AI Open Source Developer Tools Rust

Does a Code Assistant Need Large Models?

A curious engineer's journey from an OpenAI research paper to building a fully local, multi-agent coding assistant — and benchmarking it against Claude.

Tushar Saurabh · March 2026 · 12 min read

The Question That Started It All

For a long time, I was puzzled by a fundamental question: how can an LLM — which is essentially just predicting the next token — write correct code? Coding is inherently logical. Logic shouldn't emerge from statistical word prediction, or so I thought.

So I did what any curious engineer would do: I asked GPT and Claude. That conversation led me to a landmark paper — Evaluating Large Language Models Trained on Code by OpenAI. One result stood out immediately.

Key Finding — Codex Paper (2021)
For a 12-billion parameter model trained on code, the percentage of problems solved increased from 28.8% with a single sample to 77% when 100 samples were generated and evaluated against unit tests. The model that knows how to test and what to test, through iteration, converges to correct code.

This answered the first part of my question: a 12B model trained on code is good enough. But it still did not explain why logic can emerge from token prediction.

The answer was simpler than I expected. A programming language is just another language — but with far fewer keywords and a strict, unambiguous grammar. Code on GitHub and Stack Overflow always appears with surrounding context: problem statement, comments, variable names, tests. As long as an LLM has learned that mapping, it can generate code that fits the context. Logic is just a very regular sub-language of human writing.

This realisation led to a second thought: if programming is just another language with fewer words, a smaller and more specialised model should be sufficient.

The Practical Motivation
I can currently afford Claude, but what if pricing changes? The best tools should remain accessible. I wanted a coding assistant that runs entirely on local hardware — no API keys, no subscription, no data leaving my machine.

Building the Local Code Assistant

I chose Ollama as the inference backend — it runs quantised models locally with a clean API — and started with the Qwen 2.5 Coder family (7B, 14B, and 32B). Rather than a chat interface, I wanted an agent that could actually write files, edit them, and run shell commands — the things that matter for real development work.

I also believe strongly in specialisation. A single all-knowing model tends to be average at everything. Instead, I defined distinct personas with different instructions, each doing one thing well.

Three Execution Modes

1. Interactive mode — a standard REPL where you can ask questions, request edits, and work iteratively. The assistant maintains session history and can be resumed across sessions.

2. Pipeline mode — you hand it a requirement document and walk away. The full 7-phase flow runs sequentially:

Architect Implementer Reviewer Implementer (fix) Tester ×3 Docs

3. Quick mode — a single, fast, no-tools response for questions like "what does git reflog do?"

Because only one model needs to be in RAM at a time in pipeline mode, this works on a 32 GB machine without VRAM. Each phase loads its model, runs, then releases memory before the next phase begins.

Architecture Deep Dive

The assistant is built around four interlocking systems: a multi-agent core, a RAG retrieval layer, an AST symbol index, and a layered configuration engine.

Multi-Agent Core

🏛️
Architect

Plans the approach, writes acceptance criteria, and classifies incoming intent (conversational vs implementation vs complex). Stays on a small, fast model — 7b — even when the implementer is upgraded.

⚙️
Implementer

Writes and edits code using tool calls: write_file, edit_file, read_file, run_shell. The heaviest persona — benefits most from a larger model (14b or 32b).

🔍
Reviewer

Reads the generated code and produces structured findings. The implementer then gets one more pass to fix the issues before tests run.

🧪
Tester

Runs acceptance criteria against the implementation — up to three rounds. Each failure feeds back into the implementer for a targeted fix. Inspired directly by the pass@k insight from the Codex paper.

RAG — Retrieval-Augmented Generation

Before answering any substantive query, the assistant embeds the question with nomic-embed-text and retrieves the top-K semantically relevant chunks from a local ChromaDB vector store. This means the model always has real project context — actual function signatures, file contents, module structure — injected into its prompt, rather than relying on what it learned during training.

/index src/          # embed your codebase into the RAG index
/index src-tauri/src # works with any language

AST Symbol Index

RAG retrieves semantically similar chunks of text, but sometimes you need structural answers: "what functions exist in state.rs?" or "where is TerminalState defined?" For this, the assistant builds a lightweight symbol table using tree-sitter — supporting Python, JavaScript, TypeScript, and Rust — stored in a local SQLite database (~1 MB for large codebases).

At session start, a compact outline is injected into context automatically:

# Symbol Map [Rust: 67 · TypeScript: 5 · Python: 8]

## src-tauri/src/state.rs [Rust]
pub struct TerminalState :6 · impl TerminalState → [new, update, reset] :13

## src-tauri/src/commands/mod.rs [Rust]
execute_command(...) :65 · register_commands(...) :12

The model also has a find_symbols tool it can call mid-session for targeted structural queries — complementing the semantic RAG search.

Web Tools

Two tools give the model access to live information when local context isn't enough:

  • fetch_url — fetches and parses any URL using Python's stdlib (urllib + html.parser). No API key, always available. Useful for reading documentation, GitHub issues, or Stack Overflow answers.
  • web_search — performs a web search using either Serper API (fast, structured JSON) or DuckDuckGo (free, no key required). Toggled by web_search_enabled = true in config. Results are injected as context before the model responds.

Config-Driven Design

Everything is driven by a layered configuration system — the same assistant can run on a 16 GB laptop with a 7B model today and a 128 GB workstation with a 70B model tomorrow without changing a line of code.

1
CLI flags / runtime
Highest priority — overrides everything
2
CA_* env vars
e.g. CA_IMPLEMENTER_MODEL=qwen2.5-coder:32b
3
ca.config
Project-level TOML — auto-generated on first launch
4
~/.code-assistant/config.toml
Machine-level defaults for all projects
5
Built-in defaults
Sized conservatively for a 32 GB CPU machine

Sensitive settings — feedback storage, session directories, API keys — are enforced at machine scope and silently blocked from appearing in per-project config files. You cannot accidentally commit credentials.

Testing the Efficacy — Benchmarks

Inspired by the pass@k methodology from the Codex paper, I built a benchmark harness that runs both the local code-assistant and the Claude API against identical requirement documents, then compares the results side-by-side.

Three requirements were tested, ranging in complexity:

  • req_01 — Python calculator with REPL and expression parsing
  • req_02 — REST API for a todo web application (FastAPI + SQLite)
  • req_03 — Log analyser CLI with multi-format parsing, aggregation, and alerting

Benchmark Results

Task Runner Model Time (s) API Calls Lines Written Tests Passed Syntax Errors
Calculator code-assistant 7b + 14b 1,790 21 71 0 1
Claude API claude-sonnet-4-6 1,137 24 2,235 218 0
Todo API code-assistant 7b + 14b 2,163 23 117 0 0
Claude API claude-sonnet-4-6 2,310 41 2,803 4 0
Log Analyser code-assistant 7b + 14b 1,362 17 166 0 0
Claude API claude-sonnet-4-6 5,771 55 5,851 329 0
Reading the Numbers
The local models (7b + 14b) consistently used fewer API calls and finished faster on simpler tasks — but produced significantly less code and no passing tests. The Claude API produced comprehensive implementations with full test suites, but at the cost of 6–14× more tokens and much longer runtimes on complex tasks. Crucially: the local assistant has zero per-token cost and runs entirely offline.

The test-passing gap narrows considerably when the 32B model is used — larger models follow tool-use instructions far more reliably and write tests that actually compile and run. The architecture is already in place; it just needs a smarter model behind it.

Testing on a Real Project — Intelligent Terminal

Portability is a long-standing pain point in software development. The Unix terminal is rich and powerful; Windows Command Prompt falls short; PowerShell changed the entire command structure. Git Bash and MinGW work but are heavy installs for what is essentially a compatibility shim.

So I started building something I call Intelligent Terminal — a cross-platform terminal where every command is implemented from scratch in Rust, giving identical behaviour on macOS, Linux, and Windows. Eventually it will connect to a local LLM so you can say "list all hidden directories by size and filter those matching a pattern" and it just works.

This project became the real test bed for code-assistant.

First Test — Validating Existing Commands

I asked the assistant to write a Python script that would execute every implemented command with its --help flag, capture the output, and compare it against the requirement document.

It produced the script correctly. When I ran it from the root directory, it failed with a "file not found" error — the instruction pointed at the wrong working directory. Once I navigated to the correct path and re-ran, the output matched the requirements exactly. The logic was correct; the working directory assumption was not. A lesson noted.

Second Test — Implementing nslookup

The real challenge was asking the assistant to implement the nslookup command in Rust — a moderately complex task with multiple flags, option parsing, and DNS query logic.

Run 1: The 14B model printed the Rust code as a markdown block without calling a single file-writing tool. Nothing was written to disk. This was a known limitation of smaller models — they "explain" instead of "act."

Run 2: After fixing the flag parsing bug (the --req-file flag was passed with a single dash, causing the CLI parser to read it as -r eq-file — a session resume attempt rather than a file load), the 14B model was upgraded to 32B. The 32B model correctly used tool calls, wrote the file, and ran cargo build.

Run 3: The build failed. The model had generated clap argument parser code with duplicate short flags: -d was used for both debug and ndots; -r for both recurse and retry. Clap rejects this at runtime. The shell output was truncated by the tool, so the model never saw the actual compiler error and eventually lost track of the filename — attempting to edit a file called ns.rs that did not exist.

I switched to interactive mode, manually resolved the build errors, and also fixed two Rust-specific issues that the model had not caught:

// Model wrote:
let matches = Command::new("nslookup").get_matches_from(args);

// Correct (won't panic on bad args):
let matches = match Command::new("nslookup").try_get_matches_from(args) {
    Ok(m) => m,
    Err(e) => return Err(e.to_string()),
};

Honest Assessment

The assistant reduced my coding effort by roughly 70%. The remaining 30% was troubleshooting — reading compiler errors, fixing edge cases, and correcting the occasional hallucinated filename. For someone comfortable reading Rust, that trade-off is extremely worthwhile. The architecture and boilerplate were generated correctly; only the fine-grained logic needed human intervention.

Ironically, this tool was built using Claude. I am using Claude to create a tool that can eventually replace Claude for me — which reminds me of a tweet where someone said "it's time to replace GitHub" and GitHub replied asking them to share the GitHub link.

Lessons Learned

Model size matters for tool use. The 7B and 14B models often describe what to do in markdown rather than calling the appropriate tool. The 32B model reliably uses tools. This aligns with the pass@k finding: bigger models are not just smarter, they are more disciplined at following structured instructions.

Truncated tool output breaks the feedback loop. If a compiler error is cut off, the model cannot fix the bug it cannot see. The tool must surface the end of the output (where errors appear), not the beginning.

Specialised agents outperform a generalist. Separating Architect, Implementer, Reviewer, and Tester into distinct personas with different system prompts produces noticeably better output than asking a single agent to do everything. Each persona has a focused objective and fewer distractions.

Infrastructure beats intelligence, sometimes. RAG, AST indexing, layered config, and per-project memory dramatically improve the quality of local model output — not by making the model smarter, but by giving it better context. A 14B model with good context often outperforms a 32B model working blind.


Try It

Both projects are open source. The code-assistant is built to be forked and configured for your own hardware and preferred models.