Comparing Llama and GPT: Open-Source Versus Closed-Source AI Development

Photo of Patryk Szczygło

Patryk Szczygło

Updated Nov 22, 2024 • 14 min read
Website designer working digital tablet and computer laptop with smart phone and graphics design diagram on wooden desk as concept-Sep-15-2023-11-46-29-9553-AM

As it stands, GPT-4 is the king of general-purpose large language models. But for building specialized LLM-based products, Llama 2 might prove superior due to its comparable or superior factual accuracy.

In the introductory paper for Llama 2, Meta itself admits that these models “still lag behind other models like GPT-4.”

The tricky thing is, it’s hard to say exactly why GPT-4 dominates. No one outside of OpenAI knows the details of how it’s built because it’s a closed-source model. When comparing Llama 2 vs GPT-3.5, it's important to consider their unique abilities and applications in various AI projects.

This is where Meta’s Llama family differs the most from OpenaAI’s GPT. Meta releases their models as open source, or at least kind of open source, and GPTs are closed. This difference in openness significantly impacts how you work with and build products upon each.

Understanding these distinctions is crucial for organizations aiming to leverage their data to use it with AI tools effectively. By examining the fundamental differences between these models, companies can make informed decisions that align with their strategic goals.

Introduction to Large Language Models

Large language models (LLMs) are a type of artificial intelligence (AI) designed to process and understand human language. These models are trained on vast amounts of text data, allowing them to learn patterns and relationships within language. LLMs have numerous applications, including natural language processing, language translation, and text generation. By leveraging the power of large language models, businesses can enhance customer service, automate content creation, and improve language translation services. The ability of these models to understand and generate human-like text makes them invaluable tools in various industries.

Definition and Importance of Large Language Models

A large language model is a type of AI model that is trained on a massive dataset of text to generate human-like language. These models are important because they enable computers to understand and generate human language, which has numerous applications in fields such as customer service, language translation, and content creation. By learning from extensive text data, large language models can perform complex tasks like summarization, advanced reasoning, and natural language generation. Their ability to handle multiple languages and understand context makes them essential for modern AI applications.

Brief Overview of Llama and GPT

Llama and GPT are two popular large language models developed by Meta and OpenAI, respectively. Llama is an open-source model, while GPT is a proprietary model. Both models have been trained on vast amounts of text data and have demonstrated impressive capabilities in natural language understanding and generation. Llama’s open-source nature allows for greater customization and flexibility, making it a preferred choice for developers looking to fine-tune models for specific tasks. On the other hand, GPT models, particularly GPT-4, are known for their advanced reasoning and ability to handle complex tasks, albeit with more restrictive usage terms.

Key similarities and differences: GPT 4 vs Llama 2

Similarities of Llama and GPT models

  • Both are Large Language Models (LLM) based on the Transformer architecture.

  • They work in tokens, which are numbers that represent words or chunks of text. The data they were trained on is also tokenized. Llama 2 was trained on 2 trillion tokens, and speculations about GPT-4 are around 13 trillion. This training enables them to guess next tokens.

  • We don’t know exactly what data these models were trained on, the information isn’t public.

  • Their performance depends on their number of parameters (=weights). The smallest Llama 2 has 7 billion, considered the smallest model size that can do useful things. The largest has 70 billion. The precise number of GPT-4 parameters is unknown, but speculations are between 1-2 trillion.

  • Another common aspect of their performance is the context window. It determines how much of your input the model can take in at one time. It’s 4096 tokens for base Llama 2 and 8000 for base GPT-4. However, Llama 2 can be extended to 32000, and GPT-4 also has a 32000 version (which isn’t publicly available yet). These context windows significantly impact the model's capabilities in handling complex queries and providing context-aware responses.

Difference #1 - Llama is an open source model, GPT is proprietary

OpenAI's research used to be available to all, but the increasing power of their GPT models convinced them to shut the world out. When asked why, the company's co-founder and Chief Scientist Ilya Sutskever said:

“We were wrong. Flat out, we were wrong. If you believe, as we do, that at some point, AI — AGI — is going to be extremely, unbelievably potent, then it just does not make sense to open-source. It is a bad idea... I fully expect that in a few years it's going to be completely obvious to everyone that open-sourcing AI is just not wise.”

Mark Zuckerberg is of the opposite opinion:

“Open source drives innovation because it enables many more developers to build with new technology [...] It also improves safety and security because when software is open, more people can scrutinize it to identify and fix potential issues.”

But there's a small issue here. While closer to being open than GPT-4, Llama 2 isn't open source to the full extent. One researcher who analyzed how open Meta's models are stated:

“Meta using the term ‘open source' for this is positively misleading: There is no source to be seen, the training data is entirely undocumented, and beyond the glossy charts the technical documentation is really rather poor.”

Luckily, unless your plan was to recreate Llama 2 in its entirety, then it's not really a problem for you. You still get benefits that OpenAI doesn't provide:

  • The ability to download the model, interact with it directly, and host it wherever you want,

  • Access to weights and no extra payment for the option to fine-tune the model.

Weights determine the output of an LLM. Having access to them is helpful both from a research perspective, and when you're building a product and want to fine-tune them to provide a different output than the base model.

Difference #2 - Llama is customizable, GPT is convenient

Llama 2 is the first reliable model that is free to use for commercial purposes (with some limitations, for example if your app hits over 700 million users).

To start working with it, you need to fill out a form. After being approved, you can choose and download a model from Hugging Face. With a strong enough computer, you should be able to run the smallest version of Llama 2 locally.

As for the bigger ones, you’ll need access to machines built with AI in mind, the most convenient way being cloud services like Amazon SageMaker.

To customize Llama 2, you can fine-tune it for free – well, kind of for free, because fine-tuning can be difficult, costly, and require a lot of compute. Particularly if you want to do full parameter fine-tuning on large-scale models.

Larger models like LLaMA 2 70B and GPT-4 excel in summarization tasks with high factual accuracy, whereas smaller models often struggle due to issues like ordering bias and lower performance in specialized contexts.

If that’s not the case, there are ways to fine-tune Llama models on a single GPU, or platforms like Gradient that automate this for you.

Fine-tuning can produce fascinating results. This is how you can get a model that outperforms GPT-4 at a specific niche task, for example SQL generation or functional representation.

Code Llama is a good example. It’s a specialized Llama 2 model additionally trained on 500 billion tokens of code data. With some additional fine-tuning, it was able to beat GPT-4 in the HumanEval programming benchmark.

When it comes to working with OpenAIs models, you need to get an OpenAI API key and prepare to pay for the tokens you’ve used every month. Using their models is more restrictive:

  • You can’t download them or host them yourself, but on the plus side it means you don’t need to worry about where and how it’s hosted.

  • You can’t fine-tune GPT-4 yet, only GPT-3.5 and a couple of other models for now.

  • Pricing is set per 1000 tokens (~750 words), for GPT-4 with an 8K context window it’s currently $0.03 / 1K tokens of input, and $0.06 / 1K tokens of output.

How does this translate to the costs of building a product on Llama 2 versus GPT-4?

As one experiment shows, if you need a model for summarizing text:

  • You’ll pay 18x times more to use GPT-4 than the biggest Llama 2 to achieve similar performance.

  • Llama 2 70B will cost 10% more than GPT-3.5, but the performance difference is worth the extra 10%.

Depending on the use case, it might turn out that you don’t even need the 70 billion-parameter Llama 2, and that 13 or 7 billion will suffice. If so, your expenses will drop even more:

  • With more parameters, the model can process, learn from, and generate more data – but also requires more computational and memory resources, i.e. it’s more expensive to run.

  • It’s also more expensive to fine-tune a model with more parameters, or retrain it with recent data.

Training Data for Large Language Models

Training data plays a crucial role in the development of large language models. The quality and quantity of the training data can significantly impact the model’s capabilities and performance. Large language models rely on diverse and extensive datasets to learn the intricacies of human language. The better the training data, the more accurate and reliable the model’s outputs will be. This is why sourcing high-quality training data from various domains is essential for building effective language models.

The Role of Training Data in AI Development

Training data is used to teach AI models to recognize patterns and relationships within language. The data is typically sourced from various places, including books, articles, and websites. The quality of the training data is critical, as it can affect the model’s ability to understand and generate human language accurately. High-quality training data ensures that the model can perform tasks such as natural language processing, language translation, and text generation with high accuracy. Inadequate or biased training data can lead to severe ordering bias issues and reduce the model’s effectiveness in real-world applications.

Benchmarks and comparisons of Llama and GPT models

Remember that benchmarks are tricky. Task complexity plays a crucial role in evaluating the capabilities of language models, especially in how well they manage intricate tasks. A great result on a benchmark doesn’t necessarily mean the model will perform better for your use case. Plus, with the different versions of models available out there, comparing them can be tricky. Take these benchmarks with a grain of salt.

HumanEval

A carefully curated set of 164 programming challenges created by OpenAI to evaluate code generation models.

  • GPT-4: 67.0% (or as much as 91% in a new study that added a new reinforcement learning method to it)

  • Llama 2: 29.9%, however a fine-tuned Code Llama achieved 73.8%

MMLU

This challenge consists of 57 general-knowledge tasks, with elementary mathematics, grade school math tasks, US history, computer science, law, and more. It tests world knowledge and problem solving.

  • GPT 4: 86.4%

  • GPT 3.5: 70%

  • Llama 2 70B: 68.9%

LegalBench

Here, the challenge is all about legal reasoning tasks, based on a dataset prepared with law practitioners. Below are averaged scores from 5 different tasks.

  • GPT-4: 77.32%

  • GPT-3.5: 64.9%

  • Llama 2 13B: 50.6%

HellaSwag

HellaSwag evaluates the common sense of models with questions that are trivial for humans.

  • GPT-4: 95.3%

  • Llama 2 70B: 85.3%

AgentBench

Unique benchmark that evaluates LLM as autonomous agents across different environments like an operating system, database, knowledge graph, or digital card game. The numbers represent an overall score as a weighted average from all environments.

  • GPT-4: 4.41

  • GPT-3.5 Turbo: 2.55

  • Llama 2 13B: 0.55

Winogrande

Here, the models have to tackle 44000 common-sense problems.

  • GPT-4: 87.5%

  • Llama 2 70B: 80.2%

Just scratching the surface

The Llama and GPT families of models represent the two sides of the AI development coin – open source and closed.

Both are top of their class, but they're far from the only two alternatives you have to choose from. In this article, I mainly wanted to use these models to explain the differences between open and closed AI development.

Hopefully it has helped you decide which approach is better for you. If you're still not sure, we can provide additional guidance.

Photo of Patryk Szczygło

More posts by this author

Patryk Szczygło

Patryk is an engineer leading R&D department to develop more knowledge in cutting edge...
Thinking about implementing AI?  Discover the best way to introduce AI in your company with AI Primer Workshop  Sign up for AI Primer

Read more on our Blog

Check out the knowledge base collected and distilled by experienced professionals.

We're Netguru

At Netguru we specialize in designing, building, shipping and scaling beautiful, usable products with blazing-fast efficiency.

Let's talk business