Local-LLM refinement¶

total-recall is 100% deterministic heuristics by default — zero model calls, zero network egress. The local-LLM refinement layer adds a second pass that improves machine-name extraction accuracy, vocabulary definitions, and per-project narrative summaries.

Auto-setup (zero config)¶

On first install the plugin bootstrap automatically:

Downloads the ollama binary (~38 MB, no sudo, into the plugin data dir).
Pulls the default model qwen3.5:2b (~2.7 GB).
Starts a localhost ollama daemon.

All of this runs detached in the background — the spawning Claude Code session does not wait for it and does not need to stay open. A one-time banner in your first session announces that setup is in progress.

Nothing for you to do. Refinement activates automatically once the model is ready, on the next rebuild.

Privacy¶

Your transcripts never leave the machine. The model runs on-device via ollama. Cloud APIs are deliberately not supported — they would break the no-reupload guarantee that is a core design constraint of total-recall. The ollama daemon listens on localhost:11434 only.

Refinement runs on the cold path only (during total-recall index --rebuild). The heuristic baseline is always the fallback; refinement is additive only.

What	Heuristic baseline	With qwen3.5:2b
Machine-name extraction	Pattern-based NER	Precision 1.0, Recall 1.0
Vocabulary definitions	Terms listed, no definitions	~60% define coverage
Project narratives	Not generated	Short accurate summaries
Runtime (machines pass)	~3s	~19s

These numbers are from a head-to-head eval against gemma4:e2b (definition coverage ~0.20) and qwen3.5:4b. qwen3.5:2b won on all three axes at the 2B size; larger models can improve definition coverage further.

Env vars¶

Env var	Default	Effect
`TOTAL_RECALL_LLM_PROVIDER`	`auto`	`none` disables the entire LLM layer (no download, no daemon, no refinement). `ollama` forces the ollama code path.
`TOTAL_RECALL_LLM_MODEL`	`qwen3.5:2b`	Model tag to use. Any model you have pulled with `ollama pull` works.
`TOTAL_RECALL_LLM_REFINE_TEXT`	`1`	Set to `0` to disable vocab/narrative refinement while keeping machine-name extraction.
`TOTAL_RECALL_LLM_BASE_URL`	`http://localhost:11434`	Ollama API endpoint.

Disable entirely¶

export TOTAL_RECALL_LLM_PROVIDER=none

No download, no daemon, no refinement — pure heuristics, same as v0.8 and earlier.

Use a larger model¶

export TOTAL_RECALL_LLM_MODEL=qwen3.5:4b   # more RAM, slower, higher coverage
ollama pull qwen3.5:4b
total-recall index --rebuild

Any model you have already pulled will work; the default qwen3.5:2b is the validated sweet spot for speed vs quality on a typical developer machine.

Troubleshooting¶

"ollama not found" after install¶

The bootstrap installs ollama into the plugin data dir, not /usr/local/bin. If your shell does not pick it up automatically, run /total-recall:llm-setup to retry. Alternatively: export TOTAL_RECALL_LLM_PROVIDER=none to opt out.

Daemon not starting¶

ollama serve &   # or: systemctl --user start ollama
curl http://localhost:11434/api/tags   # should return JSON

If the port is in use by a system ollama install, set TOTAL_RECALL_LLM_BASE_URL to point at it:

export TOTAL_RECALL_LLM_BASE_URL=http://localhost:11434

Disk space¶

The model download is ~2.7 GB. Check available space before letting the bootstrap run on a tight disk:

df -h "${CLAUDE_PLUGIN_DATA:-$HOME/.local/share}/total-recall/"

To abort the download and opt out: set TOTAL_RECALL_LLM_PROVIDER=none and remove the .llm_provisioning lockfile from the plugin data dir.

Manual setup¶

If auto-provisioning fails for any reason, the /total-recall:llm-setup slash command runs the same steps interactively and reports errors.