How to Choose Your LLM — Scott Harrison

Why this is hard now

With the number of popular large language models expanding into the hundreds of thousands, it becomes difficult to decide which to use. From reliable cloud-based LLMs to inexpensive local open-source models, there are many factors that need to be considered.

A large language model is a prediction machine that takes an input and predicts what words will come next. To increase the accuracy of these models, researchers give the model billions of examples, have it predict the next word, then adjust the model's weights based on how accurate the prediction was. For example, if you gave a model the sentence "the sky is" it could respond with "red." You would then identify that prediction as incorrect, adjust the model weights, and repeat until it is consistently correct. After repeating this process billions of times the model becomes extremely accurate.

OpenAI's GPT-4 model released in 2023 was reportedly trained on 13 trillion tokens, or roughly 10 trillion words.

Tokens and context windows

Tokens, or tokenization, is the algorithmic process of converting text into digestible numbers tied to specific words. In GPT-4 the sentence "Hello world" would be tokenized into the numbers [9906, 1917]. A common metric for LLMs is the "context window," which is the amount of text it can see at once. The original GPT-4 had a context window of 8,192 tokens, while OpenAI's most recent model GPT-5.4 has a context window of 1 million tokens, roughly the same length as the entire Lord of the Rings trilogy.

The rule of thumb is that one token is approximately three quarters of a word in English.

Frontier models

The most well-known models from companies such as OpenAI, Anthropic, Google, and others are considered frontier models. These companies spend hundreds of millions of dollars developing their model's architecture and using extremely expensive data centers to refine model weights. They share some information but for the most part keep their architecture and model weights secret in order to host their models on the cloud and charge for their use.

Bleeding-edge frontier models today include OpenAI GPT-5.4, Anthropic Claude Opus 4.6, Google Gemini 3.1 Pro, and xAI Grok 4.20. These are the models most people are comparing when they talk about the current frontier.

Choosing which cloud model to use

Artificial Analysis comparison chart for leading large language models — Artificial Analysis comparison chart used as one input for evaluating model capability versus price.

Artificial Analysis does a great job of comparing different metrics for various models. The image above shows price on the x-axis and their intelligence index on the y-axis. Models in the green top-left quadrant have a higher efficiency of intelligence to price. However, this chart does not account for specific skills, niche topic training, or the tools each model has access to. It should be one piece of your evaluation, not the only factor you use to determine which model to use.

How to interact with cloud-based models

There are several different ways to use a cloud model, and the best option depends on what you are trying to do.

Chat interface: This is what most people use. You go to chatgpt.com, type into a box, and get a response. It has a really easy-to-use interface with built-in tools such as file uploads, speech to text, and conversation history. Typically there is a free tier with limits and paid options that give you more usage and access to more powerful models.

API (Application Programming Interface): This is a developer data access point. Instead of typing in a chat box, you create a program that sends requests and gets a response back as raw data. This is how companies build AI into their own products. Whenever you talk to a website's customer support chatbot, they pay per token used.

Coding integration: This newer category combines computer operating systems and programming interfaces with LLMs. You can use IDE assistants such as Cursor that live inside your code editor, or terminal tools such as Anthropic's Claude Code. These tools can read files, write code, run tests, and fix code automatically.

AI coding assistant running inside a code editor — IDE assistants like Cursor keep the model inside the editor and expose code, diffs, and feedback loops directly.

Claude Code running in a PowerShell terminal — Terminal agents like Claude Code move the workflow closer to the shell, where the model can inspect files, run commands, and iterate quickly.

Third-party application: Various companies pay for API access to one or more frontier models to build their own product around it. Companies such as Perplexity, Notion, and more.

Local open-source models

The other main category of LLMs is open-source models. There are many people in the world who desire to pay less, have full control over their data, and have the technical ability to run LLMs themselves, so they look to open-source models and as a community further train and develop them.

One of the most famous open-source models is DeepSeek. On January 27, 2025, DeepSeek released its R1 reasoning model that delivered performance comparable to OpenAI's model, claimed to be trained for only $5.6 million compared to the hundreds of millions being spent by US AI researchers. NVIDIA's stock dropped 17% in a single day, which was the largest single-day loss for any company in US stock market history. This destroyed the assumption that models required billions in compute power from hardware, undermining the reasoning for the AI infrastructure super cycle. This same company and others were accused in February 2026 by Anthropic of distillation, a technique to train a model using the outputs of another model instead of using actual data. The USA and China are in an intense trade war over GPUs, with President Donald Trump limiting the export of leading NVIDIA GPUs to China, forcing them to compete with limited hardware resources.

Choosing open-source models

You have to think about computer storage for the LLM itself. In theory the best model will have the most parameters, but it will require lots of storage space.

Smaller models (7B to 13B parameters) require 1 to 8 GB
Mid-range models (30B to 70B parameters) require 20 to 45 GB
Frontier-scale models (100B+ parameters) require hundreds of GB

Hardware requirements for local LLMs

You will need a decent computer in order to run local LLMs at an effective rate. Older, lower-compute setups are able to run smaller LLMs, but they are not as smart, do not have as many features, and have lower token context windows. You want a computer that does not hold you back.

A good external guide for tiers of hardware is Recommended Hardware for Running LLMs Locally.

Leading-edge open-source models currently include DeepSeek V3.2, Zhipu GLM-5, Moonshot AI Kimi K2.5, and Alibaba Qwen 3.5.

What I use right now

Personally, I am a big fan of Anthropic's Claude Opus 4.6. I pay for their Pro subscription and use it the most through the desktop app. I'm looking forward to using their new computer use tool that will enable it to mimic the OpenClaw system. Prior to Claude, I paid for OpenAI's ChatGPT for over a year. I had a good experience and do miss their image generation tool and its creative personality, but I have had such an enjoyable experience with Claude.

I can't afford the Claude API, but I have been using Google Gemini because they give students free credits. I've been using Gemini 3.1 Pro and have learned it does not have the agentic skills like Claude. Gemini is better used as a researcher or deep thinker and does not have the precision that makes me trust Claude so much.

Conclusion

There are new large language models created and improved every day, so it is important to spend the time picking what fits your needs. They all have different prices, personalities, and skills that can exponentially help or hurt your productivity. I challenge you to try out different models for different tasks. I constantly find myself using different models within the same project. Spend some time understanding your project and the capabilities of different models to figure out which is best for you.

Continue reading: OpenRouter Rankings and Artificial Analysis Model Comparison.