No Longer a Nincompoop
Posts
Open Source Code Llama & Falcon Signal a Bright Future

Open Source Code Llama & Falcon Signal a Bright Future

What is HumanEval + The End of Prompt Engineering + GPT-4 is Your Next Co-Founder

Nofil Khan
September 11, 2023

Welcome to edition #24 of the “No Longer a Nincompoop with Nofil” newsletter.

The future of this newsletter

Firstly, thank you all for supporting this newsletter. In terms of economics, simply writing these newsletters, unfortunately, will not pay the bills. So here’s the schedule moving forward:

1 free newsletter every fortnight
1 paid newsletter every week

If you want more research like this, on every important advancement, the craziest new research, tools and insights in AI, subscribe here. It is the most extensively researched AI news on the internet. Here’s a preview.

If you’re not sold after the first month, you can even get a refund, as long as you think it’s not worth the $5. Next one goes out tomorrow.

p.s I’ve also launched Time x Money, a consulting firm specialising in transforming businesses with AI. If you want to learn how AI can transform your business and even develop tools to do so, feel free to reply to this email or email me at [email protected].

Here’s the tea ☕

Open source will soon be good enough 🔓
How open source metrics are measured 🧐
Prompt engineering is dead 💀
GPT-4 is more innovative than you 📱

Open source surges

As you already know Meta previously released Llama 2, which for all intents and purposes, is the best open source model available on the market. When it was released, I spoke to someone who was telling me about its extremely lacking coding abilities. I told them to wait a few weeks and we’ll have a model trained to code. That model has arrived.

There are three versions of Code Llama. There’s the base model, a model specialised for Python and a model specialised for instruction following. Each model comes in three sizes; 7B, 13B and 34B.

Source

Out of the box, the 34B model is already better than GPT3.5 at HumanEval pass@1 and is slowly approaching GPT4. What’s staggering here is the size difference. GPT3.5 has 175B parameters compared to Code Llama’s 34B. Smaller models with better training datasets go a long way, and there is even speculation that GPT4 could be smaller, not bigger, than 3.5. You might be wondering, what is HumanEval? What is this metric that we’re measuring against to see if Code Llama is better than GPT 4?

Simply put, HumanEval is an evaluation benchmark used to measure the correctness of AI generated code. OpenAI created it back in 2021 and now everyone uses it to test open source models. I won’t go into the technical details, but you can read this thread for more details, or you can read the actual research paper.

One of the most fascinating things from this paper that might go unnoticed to many is that Meta has shown that models trained on AI generated data perform better. A lot better. Unnatural Code Llama performs significantly better than the rest and it was trained on prompts created by Meta’s own Llama model. So Meta used an inferior LLM (Llama vs GPT4) to generate data for their coding LLM that is more powerful than GPT3.5 and creeping up on GPT4. I’m not saying AI is going to be training AI anytime soon, but this is very promising for future research and development.

Sorry…

Okay so, I haven’t been entirely honest with you. You see, the beauty of open source is that someone can take one thing and turn it into another, or drastically improve it. Did you really think the internet wouldn’t fine-tune Code Llama to be even better? Phind has already fine-tuned Code Llama to beat GPT4 on HumanEval. You know what’s hilarious? They trained their model for three hours. Three hours…

A simple breakdown for you to understand where we stand.

GPT4 achieved 67% on HumanEval in their official report in March. Meta trained their Unnatural Code Llama model on 15k programming problems and got to 62%. Phind trained their model on 80k problems for a meagre 3 hours and got to 73.8%.

Open source is accelerating at an unprecedented rate.

UAE is all in AI

You might be thinking, this is just for coding. As an overall model, open source is still severely lacking compared to GPT4. Well, not for long. Falcon180B was released just a few days ago and beats both Llama 2 and GPT3.5 on 13 benchmarks. The model was trained on 3.5 million tokens and early discussions and research is showing there is significant room for improvement.

The Falcon model comes out of the UAE and is a testament to how serious the region is about AI. Do I think the model will be the foundation of open source in the future? It’s possible, but I think they have a long way to go. Meta’s Llama models are significantly smaller (70B v 180B) and their performance is slightly lower. It’s much easier to fine-tune, download and play around with smaller, more optimised models. At this point in time, Meta is the king of open source and they aren’t slowing down either.

New reports suggest Meta is already training the next Llama model (obviously) and they expect it to be several times better than Llama 2. I don’t doubt this. Meta has the talent, infrastructure, money and expertise to make very powerful LLMs.

It’s becoming quite clear that we will very soon have open source models that far surpass the current GPT4. By this time next year, you’ll probably have current state-of-the-art models running on your phone. AI is going to engulf every aspect of our lives. The world is going to change drastically this decade. We are not ready.

RIP prompt engineering

New research shows that LLM optimised models can perform up to 50% better than human prompts.

I remember seeing the first time prompt engineering was becoming known on the internet. People were heralding it as the “job of the future” and releasing frameworks and guides and how-to’s attracting millions of views. It never made sense to me. Not then, and not now. Finally, the research can put it to rest.

Just so we’re on the same page, let’s try to understand what prompt engineering is. You prompt the model, and it delivers an output. If it gives an undesired output, you “engineer” the prompt till it works as intended. Simply put, its manipulation of language. You know what else is shockingly good at manipulating language? LLMs! The very model your prompting. I hope you didn’t pay for a prompt engineering certificate.

How original are you?

WSJ did a study in which they compared the ability of AI vs humans to come up with innovative ideas. The results? GPT-4 crushed humans 35-5 in the top 10% of ideas.

Now I’m not going to sit here and wonder why they had GPT-4 compete with MBAs, but this should come as no surprise. Why? A few thoughts.

What is an idea? What is innovation? What makes an idea innovative? Human beings derive much of our inspiration from our own experiences, our learnings, our mistakes as well as our studies and what we read and hear. Even after all of this, the amount of information we can recall and synthesise new ideas from is but a fraction of what an LLM holds. LLMs contain the thoughts, ideas, emotions and experiences of tens of millions of people. They can derive inspiration from so much more than we ever could.

Humans lost because they were never meant to win. The very concept of idea generation is directly correlated with the knowledge and experience someone has. No human can ever beat an LLM at holding information.

If you ever need thoughts on an idea or anything of the sort, do yourself a favour - try asking GPT-4. You might be surprised how well it works.

Want more?

If you’d like more info on whats happening in the world of AI, you definitely want to read my premium newsletters. For $5/month you’ll get the most extensively researched AI news on the internet. I’ll even give you a refund no-questions asked if you don’t think it’s worth it.