Clueless wiki search with LLMs

2026-06-29 in ai

I thought I'd try semantic search on my playground wiki using an embedding model, but as it turns out, you have to plan how to embed stuff to suit your use case, which takes a lot more concentration than I was planning to allocate. Instead, I decided to give some local LLMs a search tool and ask them to find the right thing.

THIS IS NOT A TUTORIAL, unless you want to build local search that takes multiple seconds to return, and half the time skips all the strings you typed in your query.

The job

Unfortunately, before I continue, I must explain that in my wiki, just like in TiddlyWiki, the basic unit of content is called a tiddler. If you know TiddlyWiki, pretend it's a TiddlyWiki.

Right. Now that's out of the way: the task is to answer a user's query about wiki contents, by way of a conventional tool call that invokes a search function.

Today, the query is Do we have any tiddlers about werewolves?

And the model should call the provided search tool with some appropriate search terms, and tell the user about the tiddler entitled Easter Egg, which contains this text: This tiddler's really about werewolves.

The setup

The setup is so unoptimised as to be, in effect, engineered for failure.

Here's the system prompt:

You are a Stickleback search engine.
Choose at least five search terms related to the user's query.
Call the `multi_search` tool with your search terms.

Stickleback is the name of the wiki project. This is an unhelpful detail to include.

The schema for the multi_search tool was provided in each call to the model. It takes a list of strings and returns the titles of tiddlers containing each, with the snippet where the string was found. I used either Claude or Codex to write the tool schema and the JS function it invokes, and haven't inspected it very closely. It probably has too many parameters.

multi_search tool schema

"function": {
  "name": "multi_search",
  "description": "Run several case-insensitive substring searches in one call. Takes a list of query terms and returns an object mapping each query to its matched tiddler titles plus a trimmed snippet of context around the match — use this instead of calling `search` once per term.",
  "parameters": {
    "type": "object",
    "properties": {
      "queries": {
        "type": "array",
        "items": {
          "type": "string"
        },
        "description": "List of search query strings; each is matched independently against tiddler titles and body text. A single string is also accepted."
      },
      "limit": {
        "type": "integer",
        "description": "Max results per query (default 10)"
      },
      "system": {
        "type": "boolean",
        "description": "Include $:/ system tiddlers (default: false)"
      },
      "snippetLength": {
        "type": "integer",
        "description": "Characters of context to return around each match (default 160, min 20)"
      }
    },
    "required": [
      "queries"
    ]
  }
}

I use Ollama on my laptop to serve models. No reason beyond that it's user-friendly and amenable to swapping models frequently.

I sent "think": false with all API requests.

Ollama defaults the context window to 4096 tokens on my system. That's plenty for this exercise, and I didn't change it.

Aside from that: model templates and inference parameters are Ollama's defaults for the respective model.

This is a decision, because model providers do tend to suggest parameters to use for specific tasks.

A drawer full of small models

I tried at least one run with each of the following models, all of which are nominally tool-call capable and run OK on my 16GB Mac. These are Ollama's names for them, except the teensy LFM I got from Hugging Face. Regardless of naming, Ollama's versions seem often to be smaller quants.

hf.co/LiquidAI/LFM2.5-230M-GGUF:q8_0 (247 MB)
qwen3:0.6b (523MB)
LiquidAI/lfm2.5-1.2b-instruct:q4_k_m (731MB)
qwen3.5:0.8b (1.0GB)
LiquidAI/lfm2.5-1.2b-instruct:q8_0 (1.2GB)
phi4-mini:3.8b (2.5GB)
qwen3:4b (2.5GB)
qwen3:4b-instruct (2.5GB)
nemotron-3-nano:4b (2.8GB)
olmo-3:7b-instruct (4.5GB)
lfm2.5:8b (5.2GB)
qwen3:8b (5.2GB)
gemma4:e4b (9.6GB) (Not sure, but I think this is the instruction-tuned version)

Failure modes

I really have to emphasise that I wouldn't condemn any of these models based on what they did with what I gave them here. What I find more interesting:

Even under this rickety arrangement, some models succeeded in finding my Easter Egg, and some of the time that was by the use of what I considered a somewhat reasonable set of search terms.
Patterns---or, let's say trends--did emerge, which I could use to get closer to an actual working search.

LFM2.5-8B-A1B (albeit the Q4_K_M quant on Ollama) was the most-often-most-successful model, though it fell on its face more of the time than not.

It nicely illustrated several of the common "failure" modes. I was relatively happy to see it come up with this:

{
  "queries": [
    "werewolf",
    "werewolves",
    "lycanthrope",
    "shape-shifter",
    "wolf legend"
  ],
  "limit": 10,
  "snippetLength": 160
}

But on two additional runs it did this:

"queries": [
    "werewolf",
    "werewolf folklore",
    "werewolf legend",
    "werewolf stories",
    "werewolf mythology"
  ],

and on yet another run:

"queries": [
    "werewolf"
  ],

So that's:

Ignoring the instruction to come up with five search terms
Searching for werewolf but not werewolves, which was the actual relevant string in the user query
Padding out its term list by adding different words to the same initial word, or generating otherwise hyper-specific phrases
Also, if I'm being fussy: selecting search terms that were legitimately related but unlikely to have been used. Lycanthropy is a popular one.

This last one is a problem to which a larger model isn't a complete solution. I just had a chat with Claude Sonnet, which advised me that if the term werewolves wasn't literally in the wiki, then there's a good chance lycanthropy would be. Something to think about when prompting for "good" search terms.

Some other quirks, usually from sub-4-billion-parameter models:

Emitting malformed tool calls
Answering from pure hallucination---that is, without even trying to use the tool
Running several tool call turns before answering---but for the same single term each time

For all I know, all the above are surmountable.

Vibes

I haven't yet got any local models doing real work for me, but I've played most with Qwen models, with some poking at Phi4 Mini and Gemma. They may have distinguishing characteristics, but I think of these all as "vanilla". Phi4 Mini stands out only because I've struggled to get it to make tool calls.

I'm now really intrigued by the LFM models from LiquidAI. Some of the best output I had was from LFM2.5-8B-A1B. And at the tiny end, where Qwen3:0.6b and Qwen3.5:0.8b are almost coherent, and often generate a valid tool call, the even teensier LFM2.5-230M is...a smidge more coherent, but so far I haven't had it fail to structure a tool call.

Olmo 3 Instruct does have a different vibe than any of the others, and I'm interested to play with it more.

Cost and benefit

Performance numbers from my machine aren't that meaningful, but for a general sense: with my corpus of 10 000 entries, execution time for the LLM-search-LLM sandwich is dominated by the two inference calls. With Qwen3-0.6B or LFM2.5-230M, these can still add up to less than a second; it's a handful of seconds with LFM2.5-1.2B-Instruct, Qwen3-4B-Instruct, or Gemma 4 E4B.

With my wiki's search bar, I can pull up my Easter Egg tiddler before I finish typing werewol, so to earn its keep, an LLM would have to do better than match a word in singular and plural.

Embeddings-based semantic search seems like the grown-up approach. That being said: if the corpus changes fast, you can do LLM + a search tool at any moment and not wait for an indexing process. Or you can, if the LLM search is not dog residue.

Particles index