Language Modeling — Perforated AI

BERT on WikiText

WikiText
- The WikiText language modeling dataset is a collection of over 100 million tokens extracted from the set of verified Good and Featured articles on Wikipedia.
BERT out of Google Research
- One of the first models to be considered an LLM

When training from scratch we found a network width granting 79 million parameters had the optimal loss score on this dataset. We then ran other sizes with and without Dendrites.
- No other modifications were made to depth or any other parameters in the training pipeline
By adding one Dendrite to a BERT model with 11M parameters we matched the 79M model’s score with 75% fewer parameters.
- Our final model contains 19M including the Dendrite Parameters
Adding 6 Dendrites provides additional loss improvement while still being 30% smaller than the optimal model without Dendrites.