The `llm` Command Line Tool Is Quietly Becoming Enterprise AI Infrastructure

Q: What did llm 0.26 add, and why does it matter?

Released on 27 May 2025, llm 0.26 added tool support — the ability to let a model call any capability you can express as a Python function. Tools let a model run real code, query data, or perform calculations instead of guessing at the answer. Mechanically, this is the same loop that powers "AI agents," delivered inside a Unix-style command-line utility rather than a hosted platform.

Q: How do you use tools in llm?

You load a named tool from a plugin with -T (or --tool), or paste ad-hoc Python directly with --functions. The --td / --tools-debug flag shows you exactly which tools the model called and what they returned. In the Python library, the chain() method runs the full loop: it spots the model's tool requests, executes them, and feeds the results back to the model automatically.

Q: Which models and tools does the plugin ecosystem support?

There are two layers. Model-provider plugins teach llm how to talk to a backend; at the 0.26 announcement, seven of these supported tools: OpenAI, Anthropic, Gemini, Mistral, Ollama, llama-server, and GitHub Models. Separately, tool plugins provide the capabilities models can call — early ones included llm-tools-simpleeval (math), llm-tools-quickjs (sandboxed JavaScript), llm-tools-sqlite (read-only SQL), llm-tools-datasette (remote Datasette queries), and llm-tools-exa (web search).

Q: Does llm keep up with new frontier models?

Yes, and its changelog is the evidence. It added o3-mini with a reasoning_effort control in version 0.21 (January 2025), the gpt-5 / gpt-5-mini / gpt-5-nano family in 0.27 (August 2025), the gpt-5.4 family in 0.29 (March 2026), and gpt-5.5 in 0.31 (April 2026). New models slot in as new plugin code rather than requiring a rewrite.

There is a particular kind of software that never trends, never raises a round, and never makes a keynote — and then one day you notice it has become load-bearing. The Unix pipe is like that. So is curl. I think Simon Willison's llm command-line tool is heading for the same fate, and most executives steering AI budgets in 2026 have never heard of it.

That should change. Not because llm is the next big platform — it is the opposite of a platform — but because the way it is designed answers a question every organization is currently paying consultants to answer: how do you wire large language models into real work without chaining yourself to a single vendor?

What `llm` actually is

llm is a command-line program and Python library for talking to large language models from your terminal. You install it, point it at a model, and pipe text in and out the way you would with any other Unix utility. Willison started it in 2023 as a small companion to Datasette, his open-source tool for publishing and exploring data in SQLite. It has grown into one of the most quietly capable pieces of AI tooling in the open-source ecosystem.

The reason it matters to a technical audience is not the basic prompting. It is the architecture underneath: llm treats every model provider through a uniform plugin interface. OpenAI models are built into the core tool; Anthropic, Google Gemini, and a model running locally on your own hardware through Ollama each come as a plugin behind that same interface. Your scripts talk to a stable interface; the plugin handles the vendor-specific mess behind it. Swap the plugin, keep the script.

That is a boring-sounding design decision. It is also exactly the decision most enterprise AI integrations get wrong, and the cost of getting it wrong compounds for years.

The release that changed the project: tools

For its first two years, llm was essentially a very good remote control for text generation. That ceiling broke on 27 May 2025, when Willison shipped llm 0.26. His changelog entry is characteristically understated: "Tool support is finally here!"

Here is why that line is a bigger deal than it sounds. A raw language model can only produce text — it cannot look anything up, run a calculation, or query your database; it can only predict what those answers might be. Tool use fixes this: you describe a capability to the model as a function, and when the model decides it needs that capability, it emits a structured request, your code runs the real function, and the result goes back into the conversation. The model stops guessing and starts calling.

With 0.26, llm gained the ability to grant any tool-capable model access to any tool you can express as a Python function. The ergonomics are deliberately Unix-flavored:

-T / --tool loads a named tool that a plugin has registered.
--functions lets you paste raw Python directly on the command line for one-off tools.
--td / --tools-debug prints exactly what the model called and what came back, so you can see the machinery work.

So a one-liner like this turns a text generator into something that can actually do arithmetic correctly:

llm -T simple_eval 'Calculate 1234 * 4346 / 32414 and square root it' --td

The Python library mirrors the CLI through a chain() method that, in Willison's description, "knows how to spot returned tool call requests, execute them and then prompt the model again with the results" — the full call-execute-respond loop, in one method.

This is the agentic pattern, stripped of the marketing. An "AI agent" is, mechanically, a model in a loop with tools. llm 0.26 put that loop into a command-line utility you can pipe into a shell script — a very different posture from buying an agent platform.

The two-layer plugin ecosystem (most write-ups miss this)

The piece worth understanding precisely is that llm has two distinct kinds of plugin, and conflating them obscures what the project is doing.

Layer one: model-provider plugins. These teach llm how to talk to a given vendor or runtime. By the 0.26 announcement, Willison noted that seven of these had gained tool support: OpenAI, Anthropic, Gemini, Mistral, Ollama, llama-server, and GitHub Models. That spread is the headline. Tool use is not an OpenAI feature you rent — it is a capability that works the same way across frontier APIs and models running on your own machine.

Layer two: tool plugins. These are the capabilities the models get to call. The launch set was small and pointed: llm-tools-simpleeval (safe math), llm-tools-quickjs (a sandboxed JavaScript interpreter), llm-tools-sqlite (read-only SQL against a local database), and llm-tools-datasette (queries against a remote Datasette instance), with llm-tools-exa for web search following close behind.

Read those two lists together and the design intent is obvious. Any of seven model backends can be handed any registered tool. You can run a local Ollama model that queries your SQLite database through a read-only tool, with no data leaving your laptop — or point the identical command at a frontier API when the task is hard enough to justify it. The capability and the model are decoupled. That decoupling is the whole game.

Tracking the frontier in real time

The second thing llm does well is keep pace with the model release cycle without drama, and the changelog reads like a timeline of the last eighteen months of AI:

0.21 (Jan 2025): added o3-mini and a reasoning_effort option — letting you dial how hard a reasoning model thinks.
0.26 (May 2025): tool support.
0.27 (Aug 2025): the gpt-5, gpt-5-mini, and gpt-5-nano family, the same week they were relevant.
0.29 (Mar 2026): gpt-5.4, gpt-5.4-mini, gpt-5.4-nano.
0.31 (Apr 2026): gpt-5.5, plus a verbosity control for GPT-5-class models.

A solo open-source project shipping frontier model support on the same cadence as the labs that release the models is not a small thing. It is the practical proof that the plugin abstraction holds up under pressure — new models slot in as new plugin code, not as a rewrite.

Build it, then write it down

None of this would have the reach it does without the other half of Willison's method: he documents everything, in public, constantly. His weblog has become one of the most reliable practitioner records of the LLM era precisely because he ships a thing and then writes up what he learned — a habit he is open about crediting for his career. As he has put it, "Most of the jobs I've had in my career can be attributed at least partially to my blog." The advice underneath it is almost aggressively simple: do the work, then write about the work.

For a developer, that is a productivity tip. For an executive, it is a signal about how to evaluate tools like this. The changelog, the release notes, and the design rationale are all out in the open. You do not need a vendor briefing to assess whether llm's abstraction fits your stack — you can read the source and the reasoning behind it in an afternoon.

Tom's take: buy capability, rent models

Here is the part worth sitting with.

The expensive mistake I keep seeing in 2026 is organizations hardcoding a single model vendor into the core of their AI workflows, then discovering the switching cost the hard way when prices move, a better model ships, or a compliance requirement forces some traffic onto local hardware. The model layer is the fastest-moving, least-durable part of your stack. Welding your business logic directly to it is like pouring your foundation around a rental car.

llm embodies the opposite instinct, and it is the right one regardless of whether you ever run Willison's tool in production. Treat model providers as interchangeable plugins behind a stable interface. Express your real capabilities — query this database, call this internal API, run this calculation — as tools that any model can invoke. Then the question "which model?" becomes a procurement decision you can revisit every quarter, not an architecture decision you are stuck with for years. When gpt-5.5 shipped, an llm user adopted it by changing a model name. That is the agility you are actually buying when you insist on a clean abstraction.

You do not need a platform team to learn this lesson. You need to internalize that the durable part of an AI system is the boundary between your logic and your models — and that a single developer, building small tools in the open on top of SQLite and a plugin interface, has already drawn that boundary in a way most enterprise roadmaps are still catching up to.

It rarely trends. It just quietly becomes the thing everything else is built on.

Frequently asked questions

What is the llm command-line tool?

It is an open-source command-line program and Python library, created by Simon Willison in 2023, for interacting with large language models from your terminal. It treats each model provider — OpenAI, Anthropic, Google Gemini, local models via Ollama, and others — as a swappable plugin, so the same scripts and commands work across vendors. It originated as a companion to his Datasette project.

What did llm 0.26 add, and why does it matter?

Released on 27 May 2025, llm 0.26 added tool support — the ability to let a model call any capability you can express as a Python function. Tools let a model run real code, query data, or perform calculations instead of guessing at the answer. Mechanically, this is the same loop that powers "AI agents," delivered inside a Unix-style command-line utility rather than a hosted platform.

How do you use tools in llm?

You load a named tool from a plugin with -T (or --tool), or paste ad-hoc Python directly with --functions. The --td / --tools-debug flag shows you exactly which tools the model called and what they returned. In the Python library, the chain() method runs the full loop: it spots the model's tool requests, executes them, and feeds the results back to the model automatically.

Which models and tools does the plugin ecosystem support?

There are two layers. Model-provider plugins teach llm how to talk to a backend; at the 0.26 announcement, seven of these supported tools: OpenAI, Anthropic, Gemini, Mistral, Ollama, llama-server, and GitHub Models. Separately, tool plugins provide the capabilities models can call — early ones included llm-tools-simpleeval (math), llm-tools-quickjs (sandboxed JavaScript), llm-tools-sqlite (read-only SQL), llm-tools-datasette (remote Datasette queries), and llm-tools-exa (web search).

Does llm keep up with new frontier models?

Yes, and its changelog is the evidence. It added o3-mini with a reasoning_effort control in version 0.21 (January 2025), the gpt-5 / gpt-5-mini / gpt-5-nano family in 0.27 (August 2025), the gpt-5.4 family in 0.29 (March 2026), and gpt-5.5 in 0.31 (April 2026). New models slot in as new plugin code rather than requiring a rewrite.

What should an executive take away from this?

The model layer is the fastest-moving, least-durable part of an AI stack, so hardcoding a single vendor into your core workflows is a costly bet. llm's design — model providers as interchangeable plugins, capabilities expressed as portable tools — shows how to keep "which model?" a procurement decision you can revisit, rather than an architecture decision you are stuck with. You can adopt that principle regardless of whether you ever run llm itself.