Language models are continually increasing in size, following the belief that larger models are inherently better—a principle known as the scaling laws of large language models. Scaling typically involves adding more layers and parameters, demanding significant computational resources for training. This process presents challenges, making the training of such large language models complex.

However, the Mixture of Experts offers an alternative that addresses some scaling issues. There's speculation that models like ChatGPT and GPT-4 employ this approach. While I can't confirm this, a paper suggests that these models surpass GPT-3 in various cases.

Let's explore this paper to better understand the claimed advantages and improvements over GPT-3.

Mixture of Experts

Screenshot 2023-11-27 at 5.38.08 PM.png

This is a transformer block. The major computational bottleneck in Transformers is the FFN or the Feedforward Neural Network Component. Instead of a single FFN layer, can we replace it with multiple FFN modules? We can. These individual FFN modules are called experts. Doing this can have multiple advantages:

  1. Experts can specialise in certain kinds of inputs: depending on the kind of input, we can chose the expert that is the most suitable to handle the input. For example, one expert might handle English and the other may handle Hindi or one expert may handle one kind of word and the other may handle different kind of input
  2. Scaling the model size without really scaling the compute- Adding more experts per layer increases the size of the model. However, we can chose only a few experts per input for processing. This keeps the computational requirement still low

There are many challenges in training mixture of models

  1. How do you decide which expert handles which tokens? - This is something called as routing. If this sis not done properly
  2. How many experts to add? - In some cases it is clear. For example, if you have 3 languages, you need 3 experts per language. However, in most cases it is not very clear how on how many experts to add to the model. It is left to experimentation in most cases.

The GLaM paper is the first model that adds mixture of experts to the decoder model. With half the number of parameters for inference, they perform better than GPT-3. Here is their recipe success.

Model - Use the mixture of experts model that is shown above. They are the first to test it on a decoder only language model — with the next word prediction objective. They add Mixture of Experts in every other layer of the transformer. The number of experts ranges between 32 and 256 per layer. When they route the tokens, they only select the top-2 experts.

Dataset - The paper collects a new dataset that has 1.6 Trillion tokens from the web. They follow the same process as GPT3 in preprocessing the dataset. They also include social conversations from social media to increase the diversity of the data.

Evaluation - Similar to other papers, they test the zero shot and few-shot performance on Natural Language Generation and Few-shot Generation datasets. These are either QA datasets that require generation or those that can chose from multiple options

Looking at the results from the paper, I will leave you with three things that I learned from the paper.

Three Things that I learned from the paper

Dense Models use too much computation - MoE models achieve better results using only 50% of parameters compared to GPT-3

MoE Models scale - When x number of experts are added to a bigger model, they perform better than adding the same x number of experts in the smaller model

Better convergence - The model obtains better results compared to dense model when they are trained on same number of tokens

These models are private. They are not open source. Even GPT-4 has been purported to be. Keep an eye for these kinds of models