
Language Modeling
BERT on WikiText
WikiText
The WikiText language modeling dataset is a collection of over 100 million tokens extracted from the set of verified Good and Featured articles on Wikipedia.
BERT out of Google Research
One of the first models to be considered an LLM
When training from scratch we found a network width granting 79 million parameters had the optimal loss score on this dataset. We then ran other sizes with and without Dendrites.
No other modifications were made to depth or any other parameters in the training pipeline
By adding one Dendrite to a BERT model with 11M parameters we matched the 79M model’s score with 75% fewer parameters.
Our final model contains 19M including the Dendrite Parameters
Adding 6 Dendrites provides additional loss improvement while still being 30% smaller than the optimal model without Dendrites.