Agent, schmagent (but tools are cool)

2026-01-09 in Posts

Thomas Ptacek says everybody Should Write An Agent, because It's Incredibly Easy.

You can indeed have your very own agent in a matter of a few dozen (pretty-printed!) lines of Python plus API access to a tool-call-capable LLM, per Simon Willison’s definition: “an LLM agent runs tools in a loop to achieve a goal”.

Thomas illustrates this with a chat client for the OpenAI Responses API that lets GPT-5 get local ping times for a target, if that helps it reply to the user.

I built an agent too. I swapped OpenAI’s API and SDK out in favour of Ollama’s, in front of Qwen3 models. Now I, too, have a chat client that can run code on my computer and respond autonomously when an LLM needs some ping data. I’m a modern Prometheus!

It was simple to implement. But now I had a lot of questions. What’s happening on the other side of the API boundary, to make it so simple on this side of it? How does a model “make a tool call”? How do I design the best ping tool?

There was a lot of scenery in between me and a proper appreciation of how incredibly easy it was. I’m writing this to share the scenery.

Wait: are tools in a loop really agentic?

That question is backwards, isn’t it? It’s chasing AI buzzwords. But I’m suitably impressed by the fact that my little Python script, talking to qwen3:1.7b on my 16GB Mac, can answer how's my connectivity to google? without asking me to ping google myself.

So looking at it frontwards, agent seems like a perfectly serviceable word for this—this being the combination of the Python client, the API and backend, and the LLM. And tool use is the key.

…did you notice that we haven’t defined “tool”?

Tool use is an API feature

Here’s what I know after making a minimal agent with Ollama and Qwen3.

I put into an API request payload: JSON schemas for my custom functions, a system prompt, and a list of messages.
I get back a message with separate fields for natural language and tool calls.
And the tool calls, if they come, always come in the same format and refer to one of the functions I described in a schema.

So in this setting, we’re talking about an API feature: a universal adapter to allow API customers to plumb in custom functions to their LLM clients for the model to invoke.

OpenAI announced “function calling”, the API feature, in June 2023. It was enough of a game changer that Anthropic had a beta version of its own (“tool use”) to go along with Claude 2.1 by November.

It’s all a bit shadows-on-the-cave-wall. I may need to know how this works if I want to do something more sophisticated than ping.

Apropos that: if you’ve been putting it off, I highly recommend taking the time to watch Andrej Karpathy’s primer, Deep Dive into LLMs like ChatGPT. It doesn’t demand deep concentration, and just doing the first half at 1.5x speed can save hours of wading through hype and slop.

Tool use is baked into model weights

Behind the API is a model that knows how, given a string of tokens—the context—that contains schemas, messages, and a system prompt, to emit—when it makes sense to!—a string of tokens containing a list of tool_calls that the backend can isolate and pass to my client. Which the client can, in turn, recognise as valid function calls.

A model gets good at following this pattern through training—specifically, “supervised fine tuning” (SFT)—on full examples of that behaviour.

It’s similar to how a base model—a “token-level internet document simulator”—can learn to role-play consistently as an assistant having a conversation with us, but with more rigorous constraints on the output structure.

The point is that if I’m using a tool-calling API feature, the ability to participate in this API feature is baked into the model weights. If I control the system end-to-end, I can fine-tune for different structured output, given different contextual clues.

For example, if my agent will have occasion to get information from the web a lot, I can consider fine-tuning for a specialised contract that kicks off some web-search machinery directly, rather than squeezing it through the generic tool-use interface first.

In the spirit of the tools-in-a-loop criterion for…agency? agenticness?, that would be a tool, too.

Tools are code an LLM can ask to run

LLMs are good at mimicking syntax conventions. Tools are code that an LLM can invoke by emitting structured output with a pre-agreed syntax.

Tools let us use something LLMs are good at to stop using them for things they suck at.

Clarifying the context

I still couldn’t picture how the context gets assembled from all the stuff in the API payload in order to elicit this tool-calling behaviour.

We can’t look at how OpenAI’s backend composes this final string, but we can see how Ollama does it: it processes the API request through a template (here’s the one for qwen3:1.7b) based on what Ollama contributors know about what the model expects—e.g. from docs.

The Qwen3 template uses everything to add structure: JSON, Markdown headings, and XML tags, plus dedicated tokens (for, e.g. <|im_start|> and <|im_end|>) that encode the conversation format unambiguously.

If the user supplies schemas in a tools field, they’re included as part of the system prompt, along with instructions for their use, including the format to use for a function call:

<tool_call>
{"name": <function-name>, "arguments": <args-json-object>}
</tool_call>

The model reliably follows this pattern, so the backend can package the tool_calls into a list, inside a message, in an API response.

Unix utilities are not tool-shaped

A cool ping demo tool notwithstanding, The Platonic ideal of an LLM tool call doesn’t map 1:1 to the Platonic ideal of a Unix shell utility. Nobody told me it did! I bring it up because I might have missed the mismatch if I hadn’t tried to give my agent a ps tool.

In contrast to utilities running in a Unix shell, LLM tools, the implementation we’re talking about here, don’t compose. The model can call multiple tools in one turn, but they don’t feed into one another. Each tool does its own thing, and then we can tell the model about everything that happened.

ps is confident that it’s not alone: that, as a human trying not to drown in hundreds of lines of ps aux output, I can pipe it into head or grep or awk. If I want my ps LLM tool to return the first 10 lines of ps output, I need to build truncation into it—there’s no pipeline operator to connect it to a head tool. And with that, my tool ceases to be ps-shaped.

That’s all I’m saying. Good LLM tools are analogous to the whole shell command when you hit Enter, whether that’s a single Unix utility or a pipeline, and whether you use existing binaries under the hood or write all new code.

But are they good inside tools?

Too many variables! What I know is that LLMs are really good at them. It’s enough to tell a model, “here’s a tool that runs BSD ps with the args you provide” for the model to (a) know when to call that tool, and (b) rummage around in its weights and pull out ps -o pid,pcpu,etime,user,command -r fully formed.

I gave my ps tool a single parameter that’s an array of args, and let the model freestyle them all. My agent can accept a function call with name='ps', arguments={'args': ['-o', 'pid,pcpu,etime,user,command', '-r']}. The model has to work backward from the complete command to fit it into my tool, but even the 600-million-parameter Qwen3 model does that part fine.

I like this a lot! I don’t have to clutter my context with a lot of individual ps options. They’re already trained-in. I bet I could add an option to truncate the output, without losing the advantage of LLMs being able to spit out whole Unixy incantations.

But maybe it’s actually more efficient if every tool is semantically named, like get_topk_processes, with, in turn, well-defined and semantic options. (Built with Unix binaries, or not.)

And that’s not to mention the great-power/great-responsibility option: making the tool a shell and letting the model straight-up propose any pipeline it wants.

Prompt engineering is real, or at least stuck to the wall

No training corpus has my custom functions in it, obviously.

The model knows the tool-calling pattern. For success with a given custom function, it depends on my schema, my system prompt, and my other prompts to figure out when to use it, which args to use, and how to interpret the output.

Claude and GPT-5 deal very well with sloppy contexts, but the less overpowered the model, the more prompt engineering matters. qwen3:0.6b can get a ping tool to run, but needs me to add the .com to google. With a pointed system prompt, it can get past that hurdle. Some of the time.

I’m definitely not attached to the term “prompt engineering”. “Context engineering” is massively important, but, I think, too broad a term. I’m talking about the part of that that’s just trying different wordings to nudge an LLM to emit the magic tokens. The very nature of that activity strikes me as more absurd than the name.

In any case, the “prompt engineering” farfalle flew across the room years ago and the wall is studded with petrified bowties. It’ll be a job to scrape them all off now.

Complicated > mysterious

Tool use is the key to self-service reality checks for LLMs, and stochastic spice for automations.

If that’s at all spooky to you, like it was to me, I agree: build an agent! Just know, going in, that it’s hard to do ping and walk away.

Once you’ve seen how to manage a simple message history, and got one tool call to work, a lot of other stuff comes into view. You start having neat ideas for more specialised, parsimonious, autonomous, secure agents. Some of those ideas are chores, like validation and error handling. Incredibly easy becomes more complicated when you dot your i‘s and cross your t‘s, but even the tedium feels compelling when all there was before was mystery.

Posts index