Does a Code Assistant Need Large Models?
A curious engineer's journey from an OpenAI research paper to building a fully local, multi-agent coding assistant — and benchmarking it against Claude.
The Question That Started It All
For a long time, I was puzzled by a fundamental question: how can an LLM — which is essentially just predicting the next token — write correct code? Coding is inherently logical. Logic shouldn't emerge from statistical word prediction, or so I thought.
So I did what any curious engineer would do: I asked GPT and Claude. That conversation led me to a landmark paper — Evaluating Large Language Models Trained on Code by OpenAI. One result stood out immediately.
This answered the first part of my question: a 12B model trained on code is good enough. But it still did not explain why logic can emerge from token prediction.
The answer was simpler than I expected. A programming language is just another language — but with far fewer keywords and a strict, unambiguous grammar. Code on GitHub and Stack Overflow always appears with surrounding context: problem statement, comments, variable names, tests. As long as an LLM has learned that mapping, it can generate code that fits the context. Logic is just a very regular sub-language of human writing.
This realisation led to a second thought: if programming is just another language with fewer words, a smaller and more specialised model should be sufficient.
Building the Local Code Assistant
I chose Ollama as the inference backend — it runs quantised models locally with a clean API — and started with the Qwen 2.5 Coder family (7B, 14B, and 32B). Rather than a chat interface, I wanted an agent that could actually write files, edit them, and run shell commands — the things that matter for real development work.
I also believe strongly in specialisation. A single all-knowing model tends to be average at everything. Instead, I defined distinct personas with different instructions, each doing one thing well.
Three Execution Modes
1. Interactive mode — a standard REPL where you can ask questions, request edits, and work iteratively. The assistant maintains session history and can be resumed across sessions.
2. Pipeline mode — you hand it a requirement document and walk away. The full 7-phase flow runs sequentially:
3. Quick mode — a single, fast, no-tools response for questions
like "what does git reflog do?"
Because only one model needs to be in RAM at a time in pipeline mode, this works on a 32 GB machine without VRAM. Each phase loads its model, runs, then releases memory before the next phase begins.
Architecture Deep Dive
The assistant is built around four interlocking systems: a multi-agent core, a RAG retrieval layer, an AST symbol index, and a layered configuration engine.
Multi-Agent Core
Plans the approach, writes acceptance criteria, and classifies
incoming intent (conversational vs implementation vs complex). Stays on a small,
fast model — 7b — even when the implementer is upgraded.
Writes and edits code using tool calls: write_file,
edit_file, read_file, run_shell. The heaviest
persona — benefits most from a larger model (14b or 32b).
Reads the generated code and produces structured findings. The implementer then gets one more pass to fix the issues before tests run.
Runs acceptance criteria against the implementation — up to three rounds. Each failure feeds back into the implementer for a targeted fix. Inspired directly by the pass@k insight from the Codex paper.
RAG — Retrieval-Augmented Generation
Before answering any substantive query, the assistant embeds the question with
nomic-embed-text and retrieves the top-K semantically relevant chunks
from a local ChromaDB vector store. This means the model always has
real project context — actual function signatures, file contents, module structure —
injected into its prompt, rather than relying on what it learned during training.
/index src/ # embed your codebase into the RAG index
/index src-tauri/src # works with any language
AST Symbol Index
RAG retrieves semantically similar chunks of text, but sometimes you need
structural answers: "what functions exist in state.rs?" or
"where is TerminalState defined?" For this, the assistant builds a
lightweight symbol table using tree-sitter — supporting Python,
JavaScript, TypeScript, and Rust — stored in a local SQLite database (~1 MB for
large codebases).
At session start, a compact outline is injected into context automatically:
# Symbol Map [Rust: 67 · TypeScript: 5 · Python: 8]
## src-tauri/src/state.rs [Rust]
pub struct TerminalState :6 · impl TerminalState → [new, update, reset] :13
## src-tauri/src/commands/mod.rs [Rust]
execute_command(...) :65 · register_commands(...) :12
The model also has a find_symbols tool it can call mid-session for
targeted structural queries — complementing the semantic RAG search.
Web Tools
Two tools give the model access to live information when local context isn't enough:
-
fetch_url — fetches and parses any URL using Python's stdlib
(
urllib+html.parser). No API key, always available. Useful for reading documentation, GitHub issues, or Stack Overflow answers. -
web_search — performs a web search using either
Serper API (fast, structured JSON)
or DuckDuckGo (free, no key required). Toggled by
web_search_enabled = truein config. Results are injected as context before the model responds.
Config-Driven Design
Everything is driven by a layered configuration system — the same assistant can run on a 16 GB laptop with a 7B model today and a 128 GB workstation with a 70B model tomorrow without changing a line of code.
CA_* env varsCA_IMPLEMENTER_MODEL=qwen2.5-coder:32bca.config~/.code-assistant/config.tomlSensitive settings — feedback storage, session directories, API keys — are enforced at machine scope and silently blocked from appearing in per-project config files. You cannot accidentally commit credentials.
Testing the Efficacy — Benchmarks
Inspired by the pass@k methodology from the Codex paper, I built a benchmark harness that runs both the local code-assistant and the Claude API against identical requirement documents, then compares the results side-by-side.
Three requirements were tested, ranging in complexity:
- req_01 — Python calculator with REPL and expression parsing
- req_02 — REST API for a todo web application (FastAPI + SQLite)
- req_03 — Log analyser CLI with multi-format parsing, aggregation, and alerting
Benchmark Results
| Task | Runner | Model | Time (s) | API Calls | Lines Written | Tests Passed | Syntax Errors |
|---|---|---|---|---|---|---|---|
| Calculator | code-assistant | 7b + 14b | 1,790 | 21 | 71 | 0 | 1 |
| Claude API | claude-sonnet-4-6 | 1,137 | 24 | 2,235 | 218 | 0 | |
| Todo API | code-assistant | 7b + 14b | 2,163 | 23 | 117 | 0 | 0 |
| Claude API | claude-sonnet-4-6 | 2,310 | 41 | 2,803 | 4 | 0 | |
| Log Analyser | code-assistant | 7b + 14b | 1,362 | 17 | 166 | 0 | 0 |
| Claude API | claude-sonnet-4-6 | 5,771 | 55 | 5,851 | 329 | 0 |
The test-passing gap narrows considerably when the 32B model is used — larger models follow tool-use instructions far more reliably and write tests that actually compile and run. The architecture is already in place; it just needs a smarter model behind it.
Testing on a Real Project — Intelligent Terminal
Portability is a long-standing pain point in software development. The Unix terminal is rich and powerful; Windows Command Prompt falls short; PowerShell changed the entire command structure. Git Bash and MinGW work but are heavy installs for what is essentially a compatibility shim.
So I started building something I call Intelligent Terminal — a cross-platform terminal where every command is implemented from scratch in Rust, giving identical behaviour on macOS, Linux, and Windows. Eventually it will connect to a local LLM so you can say "list all hidden directories by size and filter those matching a pattern" and it just works.
This project became the real test bed for code-assistant.
First Test — Validating Existing Commands
I asked the assistant to write a Python script that would execute every implemented
command with its --help flag, capture the output, and compare it against
the requirement document.
It produced the script correctly. When I ran it from the root directory, it failed with a "file not found" error — the instruction pointed at the wrong working directory. Once I navigated to the correct path and re-ran, the output matched the requirements exactly. The logic was correct; the working directory assumption was not. A lesson noted.
Second Test — Implementing nslookup
The real challenge was asking the assistant to implement the nslookup
command in Rust — a moderately complex task with multiple flags, option parsing,
and DNS query logic.
Run 1: The 14B model printed the Rust code as a markdown block without calling a single file-writing tool. Nothing was written to disk. This was a known limitation of smaller models — they "explain" instead of "act."
Run 2: After fixing the flag parsing bug (the --req-file
flag was passed with a single dash, causing the CLI parser to read it as
-r eq-file — a session resume attempt rather than a file load), the 14B
model was upgraded to 32B. The 32B model correctly used tool calls, wrote the file,
and ran cargo build.
Run 3: The build failed. The model had generated clap
argument parser code with duplicate short flags: -d was used for both
debug and ndots; -r for both
recurse and retry. Clap rejects this at runtime.
The shell output was truncated by the tool, so the model never saw the actual
compiler error and eventually lost track of the filename — attempting to edit a
file called ns.rs that did not exist.
I switched to interactive mode, manually resolved the build errors, and also fixed two Rust-specific issues that the model had not caught:
// Model wrote:
let matches = Command::new("nslookup").get_matches_from(args);
// Correct (won't panic on bad args):
let matches = match Command::new("nslookup").try_get_matches_from(args) {
Ok(m) => m,
Err(e) => return Err(e.to_string()),
};
Honest Assessment
The assistant reduced my coding effort by roughly 70%. The remaining 30% was troubleshooting — reading compiler errors, fixing edge cases, and correcting the occasional hallucinated filename. For someone comfortable reading Rust, that trade-off is extremely worthwhile. The architecture and boilerplate were generated correctly; only the fine-grained logic needed human intervention.
Ironically, this tool was built using Claude. I am using Claude to create a tool that can eventually replace Claude for me — which reminds me of a tweet where someone said "it's time to replace GitHub" and GitHub replied asking them to share the GitHub link.
Lessons Learned
Model size matters for tool use. The 7B and 14B models often describe what to do in markdown rather than calling the appropriate tool. The 32B model reliably uses tools. This aligns with the pass@k finding: bigger models are not just smarter, they are more disciplined at following structured instructions.
Truncated tool output breaks the feedback loop. If a compiler error is cut off, the model cannot fix the bug it cannot see. The tool must surface the end of the output (where errors appear), not the beginning.
Specialised agents outperform a generalist. Separating Architect, Implementer, Reviewer, and Tester into distinct personas with different system prompts produces noticeably better output than asking a single agent to do everything. Each persona has a focused objective and fewer distractions.
Infrastructure beats intelligence, sometimes. RAG, AST indexing, layered config, and per-project memory dramatically improve the quality of local model output — not by making the model smarter, but by giving it better context. A 14B model with good context often outperforms a 32B model working blind.
Try It
Both projects are open source. The code-assistant is built to be forked and configured for your own hardware and preferred models.