December 10

Benchmarking AI Code Generation – From –24 to +25 Net Tests with Codebase Context

I tested LLMs with and without structured repo knowledge on FastAPI and Django. Here's how adding context changed the results.

Asa Shepard

Benchmarking AI Code Generation – From –24 to +25 Net Tests with Codebase Context

Related: Aspect Code — the problem I'm solving and why context matters.

Open Source Benchmark: github.com/asashepard/Aspect-Bench

If you're going to claim "AI writes better code with structured context," you should probably prove it.

The question: Does giving an AI model structured repo knowledge (architecture, dependencies, flows) help it fix more tests, break less code, and work more efficiently?

The test: Run 15 tasks on FastAPI (greenfield) and Django (brownfield) repos, comparing baseline prompts vs prompts with Aspect Code's knowledge base.

The specific repos used for this benchmark: github.com/fastapi/full-stack-fastapi-template, github.com/djangopackages/djangopackages.

Methodology

Each task ran in two modes:

Baseline: Task description + relevant source files + test expectations
Aspect KB: Baseline + the 3-file knowledge base Aspect Code generates:
- architecture.md — high-risk hubs, entry points, directory layout, circular deps
- map.md — symbol index with signatures, data models, conventions
- context.md — module clusters, critical flows, external integrations
- Plus assistant instructions (.github/copilot-instructions.md, Cursor rules, CLAUDE.md)

Same model, same task, same code — the only variable is whether the KB is included.

Test repos:

FastAPI (greenfield) — clean architecture, modern patterns
Django (brownfield) — legacy patterns, tighter coupling, realistic complexity

15 tasks per repo: Typical backlog items like refactoring services, adding caching, soft deletes, CSV exports, rate limiting, etc.

Models:

Claude 4 Sonnet — fast, cost-effective baseline
Claude 4.5 Opus — SOTA, higher capability

Each task ran under four conditions (2 repos × 2 models × 2 modes). Temperature = 0.0, identical prompts except for KB content.

Metrics tracked:

Net tests: (tests fixed) − (tests broken)
Catastrophic failures: Runs where the AI introduced errors that prevented tests from running at all (syntax errors, import failures, etc.)
Efficiency: Total tokens, LOC, files touched

Results

The Big Picture: Net Tests by Configuration

Configuration	FastAPI (Greenfield)	Django (Brownfield)
Baseline / Sonnet 4	−31	−28
Baseline / Opus 4.5	−2	−22
Aspect KB / Sonnet 4	+17	−20
Aspect KB / Opus 4.5	+28	−3

The pattern is clear:

Baseline models consistently break more tests than they fix (negative net tests)
Aspect Code KB flips the script on FastAPI — both models go from negative to positive
Django is harder — even the best config only gets to −3, but that's still a massive improvement from −28

FastAPI (Greenfield): Full Results

Configuration	Net Tests	Tasks Improved	Tasks Regressed	Catastrophic	Avg Tokens/Run	Avg LOC/Run
Baseline / Sonnet 4	−31	5	7	5	3,428	301
Baseline / Opus 4.5	−2	6	4	2	4,952	465
Aspect KB / Sonnet 4	+17	7	2	2	3,304	285
Aspect KB / Opus 4.5	+28	8	3	1	2,901	261

Key observations:

Sonnet 4 baseline is brutal: −31 net tests, 5 catastrophic failures
Aspect KB transforms Sonnet 4: From −31 → +17, catastrophic failures drop from 5 → 2
Opus 4.5 + Aspect KB wins: +28 net tests, only 1 catastrophic failure, and 41% fewer tokens than baseline Opus

Django (Brownfield): Full Results

Configuration	Net Tests	Tasks Improved	Tasks Regressed	Catastrophic	Avg Tokens/Run	Avg LOC/Run
Baseline / Sonnet 4	−28	3	6	3	2,871	235
Baseline / Opus 4.5	−22	4	6	2	3,510	302
Aspect KB / Sonnet 4	−20	3	4	1	3,001	232
Aspect KB / Opus 4.5	−3	4	4	0	3,425	286

Key observations:

Django is genuinely harder. Even the best config (Opus + KB) ends at −3 net tests
But Aspect KB still helps dramatically: Opus goes from −22 → −3, and catastrophic failures drop to zero
The KB prevents catastrophic failures even when it can't achieve net-positive results

Efficiency Gains

Overall efficiency improvements with Aspect KB:

Metric	Sonnet 4	Opus 4.5
Token reduction	~4%	~41%
LOC reduction	~5%	~44%

The more capable the model, the more it benefits from structured context.

Catastrophic Failures: The Safety Story

"Catastrophic" = a run where the AI introduced errors that broke the test harness entirely (syntax errors, import failures, missing dependencies). Tests couldn't even execute.

Configuration	FastAPI	Django	Total
Baseline / Sonnet 4	5	3	8
Baseline / Opus 4.5	2	2	4
Aspect KB / Sonnet 4	2	1	3
Aspect KB / Opus 4.5	1	0	1

Opus 4.5 + Aspect KB had only 1 catastrophic failure across 30 tasks. Baseline Opus had 4, and baseline Sonnet had 8.

Limitations

Limited scope: 2 repos, 15 tasks each — representative but not exhaustive
Single-shot prompting: Real usage involves iteration, tool calls, multi-turn conversation
Curated tasks: Designed to be tractable, not random production work
Coarse metrics: Net tests weighted by test count

The benchmark tests the core hypothesis (does context help?), but real usage with iteration and human feedback will likely perform better.

In rare cases, including the Aspect Code KB caused the LLM to not produce any code at first pass and instead ask for clarification. This is good! Here, the Aspect Code KB essentially helps the agent to not hallucinate that it has the right answer.

Takeaways

Baseline LLMs break more than they fix — negative net tests across all configs
Context flips the outcome — Across both models, FastAPI went from −31 → +28 net tests
Brownfield is harder — Across all configurations, Django improved from −28 → −3 but stayed negative
Better models benefit more — Opus 4.5 + KB: 41% token reduction; Sonnet 4 + KB: 4%
KB acts as a guardrail — catastrophic failures dropped 75% with Opus overall

Opus improved by a greater margin with the Aspect KB than Sonnet, indicating that future models may benefit even more with structured knowledge as context.

Opus seems overall better at interpreting the meaning of the context it's provided. This translates into implementation; Opus was able to perform more surgical and more effective edits on the greenfield repo.

That last point is particularly exciting to me, because one of the main issues I've experienced is AI simply making too many changes, adding thousands of unnecessary lines of code.

Even though Opus was better and smarter with the extra context it was provided, Sonnet improved which suggests that Aspect Code's structured codebase context is universally helpful to an LLM-augmented workflow.

On other runs not included in this benchmark, I observed similar results with different programming languages and different LLM providers (more tests passing, fewer tokens and lines of code, fewer regressions and catastrophic breaks).

The Philosophy

In conclusion, while this benchmark is limited in that it doesn't simulate a full agentic workflow, it shows significant promise towards proving the core hypothesis that structured codebase knowledge makes AI agent outputs better and safer.

Future benchmarking tests (once Aspect Code is larger and can afford them!) may include SWE-Bench Verified, App Bench, and other AI coding benchmarks either already existing or extended from this benchmark.

The Aspect Code KB isn't a linting report or a list of issues to fix. It's structured around three principles:

Defensive guardrails — architecture.md highlights "load-bearing walls": high-risk hubs with many dependents that the model should treat carefully, informed by static analysis
Contextual density — map.md provides symbol signatures and call graphs so the model can make surgical edits without reading every file
Flow awareness — context.md shows how modules connect, where requests flow, and which files change together

The goal is to give the model just enough structure to stay out of trouble, without overwhelming it with noise.

Aspect Code is still lacking true real-world data; at the time of writing, I'm the only user! Once I've finished preparing the VS Code extension, I'll be running a small pilot cohort.