Mixture of Expert(MOE) Architecture

Is it has a big future?

Apr 17, 2024

There are several methods of be able to reduce the cost of inference of a model. Not all the parameters are needed in the model, so after training one could thin out the weights that are no longer useful. So, the most commonly used techniques to reduce the memory footprint and computational cost of an LLM are:

Quantization
Pruning
- It is the elimination of weights that are considered unimportant without affecting performance too much
Knowledge distillation
- It leverages the knowledge of an LLM to train a much smaller and more specialised model.

These techniques are applied to models after training, there is research that focuses on changing the structure of the model to make it more efficient:

Combat architecture design
- It is a method to try to replace self-attention(the most expensive component of the model)
Dynamic networks
- It in which only certain substructures are active at a given time, such as a Mixture of Experts

Mixture of Experts

It enables models to be pretrained with fat less compute, which means you can dramatically scale up the model or dataset size with the same compute budget as a dense model. In particular, a MoE model should achieve the same quality as its dense counterpart much faster during pretraining. And it has two main elements:

Sparse MoE layers

These layers are used instead of sense feed-forward(FFN) layers. MoE layers have a certain number of “experts“ where each expert is a neural network. In practice, the experts are FFNs, but they can also more m]complex networks or even a MoE itself, learning to hierarchical MoEs.

A gate network or router

It determines which tokens are sent to which expert. For example, in the image below, the token “More“ is sent to the second expert, and the token Parameters is sent to the first network, As we will explore later, we can send a token to more than one expert. How to route a token to an expert is one of the big decisions when working with MoEs - the router is composed of learned parameters and is pretrained at the same time as the rest of the network.

MoE layer from the [Switch Transformers paper](https://arxiv.org/abs/2101.03961)

As it mentions, they replace every FFN layer of the transformer model with an MoE layer, which is composed of a gate network and a certain number of experts.

Challenges

Although MoEs provide benefits like efficient pretraining and faster inference compared to dense models, they also come with challenges:

Training

MoEs enable significantly more compute-efficient pretrainin, but they’re history struggled to generalise during fine-tuning, leading to overfitting.

Inference

Although a MoE might have many parameters, only some of them are used during inference. This leads to much faster inference compared to a sense model with the same number of parameters. However, all parameters need to be loaded in RAM, so memory requirements are high.

For example, we need to have enough memory to hold a sense 47B parameter model to run inference with Mixtral 8x7B rather than 8x7=56. This is because in MoE models, only the FFN layers are treated as individual experts, and the rest of the model parameters are shared.

At the same time, assuming just two experts are being used per token, the inference speed(FLOPs) is like using a 12B model, because it computes 2x7B matrix multiplications, but with some layers shared.

Summary

MoEs are pretrained much faster than denser models. It have faster inference compared to a model with the same number of parameters. However, it requires high VRAM as all experts are loaded in memeory. Furthermore, it also faces many challenges in fine-tuning.

Acknowledge

https://levelup.gitconnected.com/llm-redundancy-it-is-time-for-a-massive-layoff-of-layers-948827aa926e
https://huggingface.co/blog/moe

AI Journey OpenBook

Discussion about this post

Ready for more?