First Open Source Model Truly Better than GPT-3.5 & I Was Wrong Again

Europe Proves Me Wrong + What on Earth is MoE + Humane AI Pin is Doomed + Anthropic is Accelerating Backwards

Welcome to edition #30 of the “No Longer a Nincompoop with Nofil” newsletter.

Here’s the tea ☕️ 

  • I was wrong…again! 🥲 

  • GPT-3.5 level open source models are here 🤯 

Do you want me to release a newsletter during Christmas Break next week?

Login or Subscribe to participate in polls.

Never doubt the EU

I doubted. I doubted the EU’s ability to regulate. I was wrong. The EU managed to reach a deal to pass their AI Act. I’ve mentioned some of the details in previous newsletters so I won’t get into it here, but, it’s important to note that this bill is more like a political handshake. Technical teams still have to go and hash out the details of the bill, which practically might be impossible, but what needs doing will get done, one way or another.

There is strong opposition from France though. Why? Probably because they have one of the best AI labs in the world.

I must apologise

When Mistral was first formed and raised over $100M pre-product, alongside many others, I also bashed the notion that this wasn’t anything but hype. I acknowledged the expertise of the founding team and knew that they were world class - but I just didn’t have enough faith.

They’ve proven everyone wrong, including me. Mistral released their new MoE (Mixture of Experts) model called Mixtral 8×7B. It’s better than Meta’s Llama 2 and on many benchmarks is better than GPT-3.5 as well. Early testing has shown that the model in fact, in many cases is actually on par or better than GPT-3.5.

This means my prediction was correct. We have an open source model as good, if not better than GPT-3.5 by the end of the year. Let’s talk about it.

Mixture of Experts (MoE)???

An MoE model has a layer of smaller neutral networks called experts that work together to interpret input data. A gating network creates probabilities on which expert would be best suited to take in input data, with each expert specialising in certain tasks. Mixtral only uses 2 experts when giving responses so it runs like a 12B model.

This is going to be an extremely simplified and ultra high level description. There’s a lot going on so I’ll keep it basic. To put it simply, an AI model can have an MoE layer. The MoE layer contains “experts”. What are experts? An expert is its own sub-model or neural network. This can even be its own MoE (!), but let’s not mindf*** ourselves here.

The idea behind the MoE model is that different parts of input data may require different processing strategies, and that it’s more efficient to use smaller, specialised models to handle each part of the data rather than using a single, larger model for the entire input. By dividing the input data among multiple smaller “expert” models, the MoE model can lead to better performance on a wide variety of tasks.

So Mixtral has 8 experts, hence the name 8×7B. Does this mean there are 8, 7B parameter models in its MoE layer? Unfortunately, it’s not that simple. Without going into details, some parameters are shared amongst experts. So the total parameter size of Mixtral is actually 45B.

Okay, so when you talk to Mixtral, your input is split up and given to certain experts. How is this done?

Experts in an MoE model are selected based on their suitability for handling input data. Something called a gating network takes the input data and outputs a set of probabilities indicating which expert model would be best suited for handling certain data. The input data is then routed to the appropriate expert based on these probabilities. Mixtral only routes to 2 models for any given input, meaning at most it’s running 12B parameters. This allows MoE models to have very fast inference.

It is rumoured that GPT-4 is also an MoE model. George Hotz said in an interview that GPT-4 is an MoE model with 8 experts, each comprising of 220 Billion parameters. That means the memes about it being 10x the size of GPT-3 were kinda not far off.


Let’s talk about the actual model.

  • It has a context length of 32k tokens

  • It can speak English, French, Italian, German and Spanish

  • Has a total of 45B parameters (GPT-3.5 has 175B)

You might be wondering, why is this important? Like yes, it’s open source, but GPT-4 is still better, so what’s in it for me. A few points.

  • I’ll include this for better or worse - the base model is entirely uncensored. You can ask it anything and get it to do anything. Impersonate anyone, give instructions on anything - absolutely anything. You can ask it how to make meth.

  • See how I explained that in an MoE model, two experts are consulted to form an output. What if three were consulted instead? Since it’s open source we can test this. Turns out asking three experts might actually be better.

  • We can combine it with multiple other models to make it even better. Is this fair in the comparison to GPT-3.5. No. But it doesn’t need to be! We’ll use whatever we can, everything the community has developed together, to beat proprietary models. That’s the beauty of open source.

  • Never, ever, ever worry about your data. You can trust a company with your data, until you can’t. Do any of you use Dropbox? Well, OpenAI might have already seen your data now too. Many people have been automatically opted-in to sharing data with OpenAI for Dropbox’s AI features. With open source, you can run entire applications on your laptop without an internet connection. Your data stays with you and goes nowhere.

These are just a few of the reasons why I’m currently testing migrating current projects off OpenAI to Mistral, and in future other open source models. There’s too much upside and potential. I might not setup the entire infrastructure to run everything locally (very expensive, still considering) but I probably won’t need to, considering how quickly providers are undercutting each other on pricing. It’s genuinely a race to the bottom.

This is the absolute cheapest pricing I’ve seen. Mistral themselves have been undercut by over 70% in a few days…

You might ask - why change to mistral when the model is just better than GPT-3.5. What if my use cases all require GPT-4. Well, Mistral is cooking! Their models are tiny, small and medium. Medium? Yep, Mistral-Medium.

It’s good. Much better than 3.5. I’ve been testing it myself and comparing it to GPT-4. I particularly like its writing style; it somehow sounds less robotic than GPT-4. Mind you, I’m not even using a system prompt. The question here is, is it good enough. Good enough for what? For whatever your use case is. For a lot of people, it already might be. If you want to know if it’s good enough for you, email me! ([email protected])

Will they open source it? Unlikely. I mean even the current “open source” isn’t really open source but I won’t open that can of worms right now.

I have no doubt next year we’ll have at least two open source models as good as current GPT-4. For most people - that’s good enough. Could I be wrong? Of course. But I’m quite confident on this one tbh. We’re not slowing down anytime soon.

Short Digest

People have already been improving Mixtral.

  • Improvements for longer context tasks [Link]

  • The team themselves rolled out a new version of their instruct model [Link]

Is Mistral using OpenAI outputs to train their models? Yes [Link]. So why can’t we use theirs?

On release, Mistral had in their TOS that you were not allowed to use their models to train or improve other models. They got rid of it pretty fast when called out on it [Link].

There is an entire paragraph in this very newsletter written by Mistral-Medium. The first time I’ve ever used AI to write something for a newsletter. Can you guess which one? Email me your guess!

There’s so much more to talk about. Excited for the next few newsletters. You can subscribe to make sure you get all of my latest newsletters here.

How was this edition?

Login or Subscribe to participate in polls.

As always, Thanks for reading ❤️ 

Written by a human named Nofil

Join the conversation

or to participate.