We can't find the internet
Attempting to reconnect
Something went wrong!
Hang in there while we get back on track
Latest entries
- 2025-03-12 Claude got scarily good at faking deterministic output
- 2025-02-27 iex -S mix phx.server: What? How?
- 2025-02-22 (Particle) Erlang and OTP resources
- 2025-01-30 Running Observer in IEx alongside an Elixir app
- 2025-01-29 (Particle) Switching an Elixir project between Erlang/OTP major versions with asdf
Claude got scarily good at faking deterministic output
2025-03-12 in Posts
The other day, I accidentally got Claude 3.7 Sonnet to generate a thousand-line, Req-based Elixir API client from the OpenAPI spec for Fly Machines.
I was exploring the options for structuring such a module, more out of curiosity than any urgent need, and asked Claude on a whim if it could produce functions for all the endpoints. It did. Forty-odd functions, complete with typespecs and detailed docstrings. It hit its message-length limit and I had to ask it to carry on in a new message. It picked up and finished without incident.
The output looks solid. It looks consistent. Claude doesn’t seem to have so much as supplemented the request parameters with a popular one from some other API.
I was really impressed that Claude sustained such an uncanny impersonation of a dumb but reliable Python script for that long—despite the Vaudeville hook dragging it offstage partway through the act.
Also: this is terrible. Now I can sneeze and a nondeterministic black box barfs out a thousand lines of independent functions and docs oozing respectability and discipline? I’m going to be so tempted to trust this!
If I’d planned an exhaustive client for this API, I’d have asked my machine buddy for help writing a boring Python script (or Mix task) to convert the OpenAPI spec directly, reproducibly, and with a lot less electricity.
I should note that the docstrings I ended up with are richer than what’s in the spec. Some of the example material identifiably originated in Fly.io developer docs. So, there’s that.
Now that I have all that code
This may be the first time an LLM tool has bitten off more than I can chew without losing its own bearings. The vibes were practically telepathic.
I don’t approve of using an LLM for tasks that need repeatability and can be done in seconds on a CPU. I can’t be sure there are no surprises until I’ve checked every function, and if I ask again, there’s no guarantee any deficiencies will be the same ones.
But here’s the module, fully formed. What to do?
I do want to talk to a subset of this API from my Elixir apps, and this gives me a starting point for incorporating the schema validation I’ve been playing with.
I could “ship” it and fix it if I ever find out it’s broken—the stakes couldn’t be lower. But all those untried functions, all that unexamined documentation in my module—it was stressing me out.
I took the easy way out: the reusable moving parts and a couple endpoint functions go into a module and the rest get banished into a slush file outside my project. I don’t have to look at them, but I can grab them as needed, if they ever are. Instant relief.
More things I think about LLMs, March 2025 edition
As always, the quality of the output of an LLM chat tool is entangled with the quality of my side of the conversation. It’s all moving targets and YMMV.
- The Elixir ecosystem abounds with high-signal, well-structured documentation. Claude’s grounding in Elixir has improved immensely in the past year. In my opinion, between the two we’ve hit a tipping point for a coding-while-learning momentum boost.
- Once the shiny, clean module appeared, I immediately got Claude to start messing it up by asking it a question about Req options. I think questions like “do I have to do X in order for Y to happen?” can be leading if there’s a weakness around Y in the model or the context. Claude is a bit too much of a people pleaser, and my lightweight attempts to counter that with project instructions and preferences haven’t cured that.
- It has happened, though, with both Claude and ChatGPT, that I’ve proposed one thing and the tool has correctly advised me to use a different approach.
- I do still wade into boggy ground with Claude 3.7, ask it to bite off more than it can chew and end up in a cycle of ineffectual “corrections”, or ask it something it doesn’t have a good grounding in and end up trying to find docs for nonexistent functions to help me implement some antipattern.
- At the end of February, I was surprised when Claude 3.5 Sonnet and I managed a very similar failure of attention navigating a bureaucracy of pattern-matching function heads. We both missed that the salient clause returned a value instead of handing it off to the next one in line. For all I know, this wouldn’t happen with 3.7. Claude can be so ultra-agreeable that I almost think I planted that blind spot by telling it what I couldn’t find.