Language Modeling

BERT on WikiText

  • WikiText

    • The WikiText language modeling dataset is a collection of over 100 million tokens extracted from the set of verified Good and Featured articles on Wikipedia.

  • BERT out of Google Research

    • One of the first models to be considered an LLM

  • When training from scratch we found a network width granting 79 million parameters had the optimal loss score on this dataset. We then ran other sizes with and without Dendrites.

    • No other modifications were made to depth or any other parameters in the training pipeline

  • By adding one Dendrite to a BERT model with 11M parameters we matched the 79M model’s score with 75% fewer parameters.

    • Our final model contains 19M including the Dendrite Parameters

  • Adding 6 Dendrites provides additional loss improvement while still being 30% smaller than the optimal model without Dendrites.