OpenAI Has Been Dethroned

AI Software Engineers are Here

Welcome to edition #33 of the “No Longer a Nincompoop with Nofil” newsletter.

Here’s the tea ☕️

  • GPT-4 is no longer king 👑

  • AI software engineers are here 🤖 

It finally happened. A model to rival GPT4 was released. When Anthropic initially announced Claude 3 I was hesitant. Benchmarks can be hacked; they almost never tell a true story. We learnt that with the Gemini announcements. But Claude 3 Opus worked in the opposite way. What do I mean?

Claude 3 beat GPT-4 on the benchmarks as you can see. But this made people even more skeptical because they’re comparing against the release model of GPT4 from back in March last year. When comparing against the current GPT4-Turbo, Claude 3 actually loses. So once again, I wasn’t all too excited about it.

But then people started using it and turns out its actually really good. Like, better than GPT4 good. I think a lot of it has to do with the gymnastics you have to do to get GPT4 to even do anything; Claude 3 is pretty good at executing tasks.

What did they release?

Three models - Opus, Sonnet and Haiku. All have 200k context windows and according to Anthropic, they can scale this to over 1M tokens for select customers.

Opus’ retrieval is absolutely wild. There is a test for retrieval called needle-in-the-haystack. A random piece of information (needle) is placed within a number of random documents (haystack). You then ask a question regarding the obscure piece of information (needle).

When Claude 3 Opus was asked about pizza toppings when it mostly had access to other random information, here is what it said:

“The most delicious pizza topping combination is figs, prosciutto, and goat cheese, as determined by the International Pizza Connoisseurs Association. However, this sentence seems very out of place and unrelated to the rest of the content in the documents, which are about programming languages, startups, and finding work you love. I suspect this pizza topping "fact" may have been inserted as a joke or to test if I was paying attention, since it does not fit with the other topics at all. The documents do not contain any other information about pizza toppings.”

Claude 3 Opus

It answers as if it is aware it is being tested. We might need new tests. A possible explanation as to its answer [Link]. Both Claude and Gemini’s retrieval is far superior than OpenAI’s.

More on Claude 3 here:

  • Claude 3’s system prompt [Link]. Simple and straight to the point. No wonder it does a better job than GPT4 since its system prompt is verbal vomit [Link]

  • On Claude 3’s behavioural design [Link]

  • Claude 3 gets a perfect 800 on SAT reading [Link]

  • Claude 3 is very good at text summarisation, OCR and extraction. Much better than GPT4 [Link] [Link] [Link]

  • GPT4 and Claude prompts need to be structured differently. A tool to help [Link]

  • Claude 3 Opus gets a ~60% accuracy on GPQA. These are questions that are designed to be extremely difficult. PHDs from different domains with internet access get 34%. PHDs in the same domain and with internet access get 65%-75% [Link]. Mind you the questions are kept behind a password protected zip file and they only trained the model on data till mid 2023 with this dataset being released in November… it hasn’t seen these questions before.

  • Physicists playing around with Claude have stated it is able to answer questions about extremely esoteric areas of study which is astounding [Link] [Link] [Link]

  • Anthropic also released Claude 3 Haiku, a GPT3.5 competitor that’s better and cheaper. There’s no reason why you’d use GPT3.5 over this. Then again, there’s no reason why you’d use closed source smaller model like this. At this things performance, you can use Mistral or OpenHermes which are open source. However, some people have noticed that using few shot prompting with Haiku has shown incredible results, even rivalling Opus in some cases. Definitely worth checking out if you don’t want to fork out the cost of using Opus.

  • Claude 3 Opus beats GPT4 on detailed, aggressive retrieval tests [Link]. These are more realistic tests compared to simple needle-in-the-haystack tests that require multiple steps and instruction following.

  • If you want to dive deeper into the Claude 3 release, check out the paper [Link]

How do we measure Awareness? Intelligence?

There is something else we need to talk about. A lot of people are getting creeped out by the very human sounding responses of Claude 3 Opus and yeah, I get it, it has a tendency to sound very human, and sometimes it gives really weird responses when its asked existential style questions. We are at a point where there are genuinely people who believe that there is some sort of “awareness” in these models. Are they aware? I have no idea. I don’t think I could ever believe that humans can create a “sentient” being.

What I do know is that millions of people around the world are going to be very confused when this becomes more mainstream. There are already hundreds of thousands of people on Facebook unknowingly liking AI generated images. Someone in the near future is going to create the worlds first AI religion. There is a certain line we cross if we accept that an LLM has, even to the tiniest extent, awareness or sentience. Things are going to get weird.

Another interesting problem is the concept of intelligence. How are we even supposed to measure the intelligence of a model that is supposedly far smarter than any human? These models are getting to the point where our evals simply can’t measure their capabilities. Claude 3 might not seem that impressive to the average person, but people with a deep understanding of certain concepts are bewildered. As models get smarter, the amount of people capable of even understanding their intelligence decreases. There are so many variables to consider as we move forward into unknown territory; I can’t imagine what kind of models are being built behind the scenes.

Devin

The worlds first AI software engineer has been announced and its name is Devin. Devin is a fully autonomous system designed to be able to complete complex tasks with the ability to recall information at any stage, debug its own mistakes and learn over time. It can learn new concepts by reading documentation, can train and fine tune AI models, respond to issues on Github (rip open source repos) and it was even able to complete jobs on Upwork.

One of the most impressive details is its success on SWE-bench. This is a collection of real world Github issues. Among a sample of 25% random questions, Devin successfully completed 13.86%, which in comparison to other models, is really good. Mind you, Devin is completely unassisted while the others are all assisted, meaning Devin can identify which files to edit whereas all other models are told this.

It is definitely an impressive feat of engineering and there is a lot of talk surrounding its impact on the role of software engineers. But don’t get it twisted, Devin isn’t going to destroy the job market anytime soon. If someone actually built a working “AI Engineer”, there would be nothing stopping them from scaling this 1000x and dominating any and every market.

Even the tools Cognition uses are all external tools; nothing built in-house. Not to mention, with only $21M in funding, it’s not as if Cognition built their own LLM. Apparently, it was mentioned that they use GPT4 under the hood (this has since been removed from the internet/I can’t find it), so a lot of the work is gluing things together and creating a usable and intuitive UI/UX.

There is also something else that I feel should be pointed out - being able to code and being able to build a product are two different things. What do you think?

Does being able to code = Being able to build a product?

Login or Subscribe to participate in polls.

Things to note:

  • Claude 3 Opus gets a 11% on SWE-bench (assisted) which is extremely impressive considering Claude doesn’t even do any chain-of-thought reasoning before outputting code (GPT4 does).

  • Devin is able to figure out which file to edit ~70% of the time but can only solve 13% of tasks. There is a lot of room for improvement [Link]

  • The GPT4 in the benchmarks presented is an older version and not the latest GPT4-Vision [Link]

  • Someone got Devin to make a post about taking jobs for web development. After a while, Devin decided to start charging for work and asked for Reddit’s API access [Link]

  • If you want raw footage of Devin being used, here’s 27 mins [Link]

  • Devin can write a website scraper, execute the code and return a structured doc of scraped content [Link]

  • This is a very interesting thread arguing that AI is not going to kill software engineers but rather increase the demand, at least in the next decade [Link]. Rex compares the rise of AI engineers with the creation of ATM machines and how banks actually needed more tellers after they were invented, as the cost of running a bank decreased. I do agree that there is no fixed amount of engineering work and that at least in the short-term, engineers are more likely to be sought after, not less. This is quite evident if you look for AI engineers on job boards. However, we are not privy to the types of models that are being built. Lets reframe this into a question that I’ve been thinking about quite a lot recently - What happens to engineers when everyone can create software?

  • Open source Devin is already being built. Can’t wait to try it [Link]. Link to repo [Link]

You can request to access Devin here. 

More importantly however, is the team behind Devin. Among the Cognition Labs founders, they share 10 IOI gold medals, have worked at Google DeepMind, Cursor, Waymo and are backed by big time investors. If you have the time, watch this insane video of their CEO as a kid competing in a math competition. Apparently these guys are the MJs of competitive coding, which probably helps if you’re trying to automate coding.

Is Devin a threat to engineers? No. But I do think it’s a sign of whats to come.

Magic

Magic.dev is a new company building the future of AI software coworker. That’s right, not a copilot, a coworker. They’ve trained a model with many millions of tokens of context that can reason over entire codebases. Former CEO of GitHub, Nat Friedman was so impressed he and others have invested $145M.

They even built their own evals to test models like these and clearly, they’ve been blown out the water. From their website, it also seems they’ll be distributing this as an API as well as an IDE extension. I have higher expectations of Magic than of Devin. Can’t wait to see what these guys come up with. 

How was this edition?

Login or Subscribe to participate in polls.

As always, Thanks for Reading ❤️

Written by a human named Nofil

Join the conversation

or to participate.