In a recent interview, George Hotz claimed that GPT-4 is just an eight-way mixture model of 220B parameters. It could be a Mixture of Experts (MoE) model. That estimates GPT-4 at about 1.2 trillion parameters (8 x 220 billion).
Models often reuse the same parameters for all inputs. But Mixture of Experts models uses different parameters based on the example. You end up with a sparsely activated ensemble.
Routing multiple models isn’t easy. There’s overhead in communication between the models and routing between them. Switch Transformers (by researchers at Google), a gating network (typically a neural network), produces a sparse distribution over the available experts. It might only choose the top-k highest-scoring experts, or softmax gating, which encourages the network to select only a few experts.
Getting the balance right is still tricky — ensuring that specific experts aren’t chosen too often.
Some other interesting facts and implications:
GPT-4 costs about 10x the cost of GPT-3.
GLaM is Google’s 1.2T model with 64 experts.
There’s also some analysis on unified scaling laws for routed language models.
Ensemble networks were some of the most powerful models in the neural network deployment era. Maybe that’s still the case.
thanks for the callout! would appreciate [signal boost](https://twitter.com/swyx/status/1671183813190504448) on the source podcast :)