Language models are continually increasing in size, following the belief that larger models are inherently better—a principle known as the scaling laws of large language models. Scaling typically involves adding more layers and parameters, demanding significant computational resources for training. This process presents challenges, making the training of such large language models complex.
However, the Mixture of Experts offers an alternative that addresses some scaling issues. There's speculation that models like ChatGPT and GPT-4 employ this approach. While I can't confirm this, a paper suggests that these models surpass GPT-3 in various cases.
Let's explore this paper to better understand the claimed advantages and improvements over GPT-3.
This is a transformer block. The major computational bottleneck in Transformers is the FFN or the Feedforward Neural Network Component. Instead of a single FFN layer, can we replace it with multiple FFN modules? We can. These individual FFN modules are called experts. Doing this can have multiple advantages:
There are many challenges in training mixture of models
The GLaM paper is the first model that adds mixture of experts to the decoder model. With half the number of parameters for inference, they perform better than GPT-3. Here is their recipe success.
Model - Use the mixture of experts model that is shown above. They are the first to test it on a decoder only language model — with the next word prediction objective. They add Mixture of Experts in every other layer of the transformer. The number of experts ranges between 32 and 256 per layer. When they route the tokens, they only select the top-2 experts.
Dataset - The paper collects a new dataset that has 1.6 Trillion tokens from the web. They follow the same process as GPT3 in preprocessing the dataset. They also include social conversations from social media to increase the diversity of the data.
Evaluation - Similar to other papers, they test the zero shot and few-shot performance on Natural Language Generation and Few-shot Generation datasets. These are either QA datasets that require generation or those that can chose from multiple options
Looking at the results from the paper, I will leave you with three things that I learned from the paper.
Dense Models use too much computation - MoE models achieve better results using only 50% of parameters compared to GPT-3
MoE Models scale - When x number of experts are added to a bigger model, they perform better than adding the same x number of experts in the smaller model
Better convergence - The model obtains better results compared to dense model when they are trained on same number of tokens
These models are private. They are not open source. Even GPT-4 has been purported to be. Keep an eye for these kinds of models