- No Longer a Nincompoop
- Posts
- Does OpenAI's new model think?
Does OpenAI's new model think?
It has been a while! I’ve been head down working and building AI projects (have to pay the bills).
I’ve learnt more about how to build software with AI in the last three months than I did in the many months before. There is a lot I want to share, and I will very soon.
For now, you can try playing with this AI content creator tool I made.
You upload a number of text or markdown files, and it extracts the tone of voice of the author. You can then use that tone of voice to generate different types of content.
I made this tool relatively quickly and it is definitely nowhere near as polished as I’d like it to be. But, it works. We already use it with clients.
Feel free to try it out and let me know what you think.
Here’s the tea 🍵
OpenAI’s new o1 model 🤖
How intelligent is AI really? 🤔
What’s been happening?
OpenAI released their new o1 model. I’m sure you’ve heard all about it by now so I won’t go into the details. Actually, there aren’t many details I can go into considering OpenAI doesn’t really tell us much anymore.
We don’t know how big the model is or what kind of techniques they used to make it.
But there are a few things I’d like to discuss.
Reasoning
Although we haven’t got the full release of the o1 model, we have two version to go by.
o1-preview
o1-mini
The main thing you’ll notice when using either model is the “thinking” it does before it responds. This “thinking” process can go on from seconds to minutes.
This process is, unfortunately, shrouded in mystery. What you see in the photo is not the “thinking” process the model goes through.
Any attempts to jailbreak the model and get a view of what the “thinking” logs look like result in a permanent ban.
That didn’t stop people from doing it anyway [Link].
OpenAI doesn’t want us to see how the model thinks. Probably because it’s the most “aligned” model they’ve released, not that that’s a good thing.
So, the big question here is, does this “thinking” process help?
But we already knew this.
When a model self reflects, it tends to start fixing its own mistakes. You can re-create this with a loop and a system prompt on other models.
What this image also suggests is that the o1-mini and the o1-preview are just two better models.
You might wonder, what if we made GPT-4 use the same amount of inference compute.
Would it work as well as the o1?
Meaning, if GPT-4 thought the way o1 did, would it be just as smart?
Okay, so this won’t work, buuuut, what if it just thought for longer?
Would the o1 be even smarter if it spent more time thinking?
More thinking = more intelligence?
Nope.
Dammit.
What’s rather interesting is the way that it’s wrong, changes depending on how long it “thinks”.
For lower thinking times, its guesses are more spread out. For higher thinking times, it tends to gravitate to a single wrong answer (7).
Previous studies have shown that when asked for a random number, LLMs gave the number 7 more than any other number. As far as I understand, we don’t really know why this is.
What this suggests is that the model can sometimes “think” for a long time and reinforce the wrong answer to itself.
The key here is to find that sweet spot where the model “thinks” for just long enough that it gets the right answer, and that’s it.
Too long and it might change the answer, and too short and it might never even get there.
Also, from peoples experience using it, although both models are the same size, it’s becoming clear that the o1 mini is a much better model for STEM tasks.
This makes sense considering there is a lot more STEM data in the o1-mini’s dataset.
So, OpenAI has developed a method that lets the AI self reflect and think about its response before answering, reducing the mistakes it makes.
But, how impressive is this really?
Is it really better at reasoning?
Not really.
Testing has shown that the o1 is better than GPT-4o in domains that are more common on the internet. When faced with problems that are less commonly found online, it makes the same mistakes as 4o.
For a model to be considered “more intelligent”, we’d have to see it display such attributes across all domains. The suggestion here is that the o1 might simply be better at reinforcing what it already knows.
But, there’s a bigger question to be asked here.
How smart is o1 really?
As well as writing this newsletter, I’m also the Co-Founder of Avicenna. An AI + Brand consultancy building at the forefront of AI.
Intelligence
With the new “thinking” process, the o1 model is definitely less prone to mistakes. There is no doubt about this.
But the question is -
Is this because of actual reasoning being done by the model?
Or is it because it has more time to find connections in its training data?
Point here is, are models getting so good that they’re actually thinking?
There is a lot of debate on this topic. Lots of research papers and blogs and talks arguing both sides.
What is intelligence?
What is reasoning?
I’m not here to argue for either side, but, I will leave you with a very interesting paper I think sums up the situation quite well.
Researchers wanted a way to see if models were actually reasoning or if they were simply sophisticated pattern matchers.
They tested a number of models on certain questions, and the o1 models did really well, with results up to 95%.
Researchers then added random words to the same questions and tested the models again.
A human that could complete the questions initially, could do so afterwards as well because there was no impact on the meaning of the question with the new words being added.
How did the AI models do?
For both o1-mini and o1-preview, model performance dropped by 17.5%- and 29% respectively.
We’re not quite there, yet.
There are some caveats to this research paper that need to be addressed though.
The authors make basically zero effort to provide any kind of definition for what “reasoning” is. How can we know models don’t reason if we don’t know even know what reasoning is?
Furthermore, looking at the “random words” added, some don’t seem so arbitrary.
Are smaller kiwis counted as kiwis? I could see how this could possibly confuse even a human.
There are a lot of semantic arguments that could be made here, but, I’ll leave at that for now.
You can read more about the research paper in this thread [Link]. Here is the paper [Link]. This is also a very interesting review of the paper [Link].
Conclusion
The final question to be asked is:
Why would someone use o1 over Claude 3.5 Sonnet?
Sonnet is much cheaper, and much faster.
o1 might be marginally better in terms of reasoning, but, most people aren’t asking PHD level questions all the time.
Also, Anthropic just released an update to Sonnet meaning the latest model is called Claude 3.5 Sonnet (New)… They need to fire whoever is naming these models.
Point being, with the latest update, at least on the benchmarks, the model is unbelievably good.
For coding, Claude 3.5 Sonnet is the best model on the planet. Just look at how much better it is at Minecraft [Link].
In all seriousness though, OpenAI themselves tested the o1 and other models on agentic tasks.
The o1 did not do well at all. In fact, OpenAI can’t even beat their own older models…
I think ultra powerful models like the o1 will be used by people in very specific domains.
PhDs working on very deep and sophisticated problems will have more use of the o1 types of models.
Terrence Tao calls the o1 “a mediocre, but not completely incompetent, graduate student” (which is actually pretty good?). It’s very interesting to read what the world’s best in a domain like math thinks about AI.
More on the o1:
The o1 is the first AI model to beat PhD level scholars on the hardest dataset of questions called the GPQA Benchmark. Scholars scored ~70% and o1 scored 78% with a 93% in physics [Link]
Various threads on how the o1 works (technical) [Link] [Link]
Some cancer researchers say that the o1 is comparable to an outstanding PhD student in biomedical sciences [Link]
Remember when OpenAI had the board coup because they thought releasing more powerful AI was going to harm humanity? The reason the coup happened was because of an early preview of the o1 model [Link]. The board thought it was so dangerous that they tried to oust Altman. In retrospect, it looks silly considering the o1 is out in the wild and nothing has happened. People said the same for GPT-4 and 4o and nothing happened then as well.
There is a lot more I’ll be sharing over the next few months.
Stay tuned.
How was this edition? |
As always, Thanks for Reading ❤️
Written by a human named Nofil
Reply