Benchmarking AI Code Generation – From –24 to +25 Net Tests with Codebase Context

I tested LLMs with and without structured repo knowledge on FastAPI and Django. Here's how adding context changed the results.

Benchmarking AI Code Generation – From –24 to +25 Net Tests with Codebase Context

Related: Aspect Code — the problem I'm solving and why context matters.

Open Source Benchmark: github.com/asashepard/Aspect-Bench

If you're going to claim "AI writes better code with structured context," you should probably prove it.

The question: Does giving an AI model structured repo knowledge (architecture, dependencies, flows) help it fix more tests, break less code, and work more efficiently?

The test: Run 15 tasks on FastAPI (greenfield) and Django (brownfield) repos, comparing baseline prompts vs prompts with Aspect Code's knowledge base.

The specific repos used for this benchmark: github.com/fastapi/full-stack-fastapi-template, github.com/djangopackages/djangopackages.


Methodology

Each task ran in two modes:

  • Baseline: Task description + relevant source files + test expectations
  • Aspect KB: Baseline + the 3-file knowledge base Aspect Code generates:
    • architecture.md — high-risk hubs, entry points, directory layout, circular deps
    • map.md — symbol index with signatures, data models, conventions
    • context.md — module clusters, critical flows, external integrations
    • Plus assistant instructions (.github/copilot-instructions.md, Cursor rules, CLAUDE.md)

Same model, same task, same code — the only variable is whether the KB is included.

Test repos:

  1. FastAPI (greenfield) — clean architecture, modern patterns
  2. Django (brownfield) — legacy patterns, tighter coupling, realistic complexity

15 tasks per repo: Typical backlog items like refactoring services, adding caching, soft deletes, CSV exports, rate limiting, etc.

Models:

  • Claude 4 Sonnet — fast, cost-effective baseline
  • Claude 4.5 Opus — SOTA, higher capability

Each task ran under four conditions (2 repos × 2 models × 2 modes). Temperature = 0.0, identical prompts except for KB content.

Metrics tracked:

  • Net tests: (tests fixed) − (tests broken)
  • Catastrophic failures: Runs where the AI introduced errors that prevented tests from running at all (syntax errors, import failures, etc.)
  • Efficiency: Total tokens, LOC, files touched

Results

The Big Picture: Net Tests by Configuration

ConfigurationFastAPI (Greenfield)Django (Brownfield)
Baseline / Sonnet 4−31−28
Baseline / Opus 4.5−2−22
Aspect KB / Sonnet 4+17−20
Aspect KB / Opus 4.5+28−3

The pattern is clear:

  • Baseline models consistently break more tests than they fix (negative net tests)
  • Aspect Code KB flips the script on FastAPI — both models go from negative to positive
  • Django is harder — even the best config only gets to −3, but that's still a massive improvement from −28

FastAPI (Greenfield): Full Results

ConfigurationNet TestsTasks ImprovedTasks RegressedCatastrophicAvg Tokens/RunAvg LOC/Run
Baseline / Sonnet 4−315753,428301
Baseline / Opus 4.5−26424,952465
Aspect KB / Sonnet 4+177223,304285
Aspect KB / Opus 4.5+288312,901261

Key observations:

  • Sonnet 4 baseline is brutal: −31 net tests, 5 catastrophic failures
  • Aspect KB transforms Sonnet 4: From −31 → +17, catastrophic failures drop from 5 → 2
  • Opus 4.5 + Aspect KB wins: +28 net tests, only 1 catastrophic failure, and 41% fewer tokens than baseline Opus

Django (Brownfield): Full Results

ConfigurationNet TestsTasks ImprovedTasks RegressedCatastrophicAvg Tokens/RunAvg LOC/Run
Baseline / Sonnet 4−283632,871235
Baseline / Opus 4.5−224623,510302
Aspect KB / Sonnet 4−203413,001232
Aspect KB / Opus 4.5−34403,425286

Key observations:

  • Django is genuinely harder. Even the best config (Opus + KB) ends at −3 net tests
  • But Aspect KB still helps dramatically: Opus goes from −22 → −3, and catastrophic failures drop to zero
  • The KB prevents catastrophic failures even when it can't achieve net-positive results

Efficiency Gains

Overall efficiency improvements with Aspect KB:

MetricSonnet 4Opus 4.5
Token reduction~4%~41%
LOC reduction~5%~44%

The more capable the model, the more it benefits from structured context.


Catastrophic Failures: The Safety Story

"Catastrophic" = a run where the AI introduced errors that broke the test harness entirely (syntax errors, import failures, missing dependencies). Tests couldn't even execute.

ConfigurationFastAPIDjangoTotal
Baseline / Sonnet 4538
Baseline / Opus 4.5224
Aspect KB / Sonnet 4213
Aspect KB / Opus 4.5101

Opus 4.5 + Aspect KB had only 1 catastrophic failure across 30 tasks. Baseline Opus had 4, and baseline Sonnet had 8.


Limitations

  • Limited scope: 2 repos, 15 tasks each — representative but not exhaustive
  • Single-shot prompting: Real usage involves iteration, tool calls, multi-turn conversation
  • Curated tasks: Designed to be tractable, not random production work
  • Coarse metrics: Net tests weighted by test count

The benchmark tests the core hypothesis (does context help?), but real usage with iteration and human feedback will likely perform better.

In rare cases, including the Aspect Code KB caused the LLM to not produce any code at first pass and instead ask for clarification. This is good! Here, the Aspect Code KB essentially helps the agent to not hallucinate that it has the right answer.


Takeaways

  1. Baseline LLMs break more than they fix — negative net tests across all configs
  2. Context flips the outcome — Across both models, FastAPI went from −31 → +28 net tests
  3. Brownfield is harder — Across all configurations, Django improved from −28 → −3 but stayed negative
  4. Better models benefit more — Opus 4.5 + KB: 41% token reduction; Sonnet 4 + KB: 4%
  5. KB acts as a guardrail — catastrophic failures dropped 75% with Opus overall

Opus improved by a greater margin with the Aspect KB than Sonnet, indicating that future models may benefit even more with structured knowledge as context.

Opus seems overall better at interpreting the meaning of the context it's provided. This translates into implementation; Opus was able to perform more surgical and more effective edits on the greenfield repo.

That last point is particularly exciting to me, because one of the main issues I've experienced is AI simply making too many changes, adding thousands of unnecessary lines of code.

Even though Opus was better and smarter with the extra context it was provided, Sonnet improved which suggests that Aspect Code's structured codebase context is universally helpful to an LLM-augmented workflow.

On other runs not included in this benchmark, I observed similar results with different programming languages and different LLM providers (more tests passing, fewer tokens and lines of code, fewer regressions and catastrophic breaks).


The Philosophy

In conclusion, while this benchmark is limited in that it doesn't simulate a full agentic workflow, it shows significant promise towards proving the core hypothesis that structured codebase knowledge makes AI agent outputs better and safer.

Future benchmarking tests (once Aspect Code is larger and can afford them!) may include SWE-Bench Verified, App Bench, and other AI coding benchmarks either already existing or extended from this benchmark.

The Aspect Code KB isn't a linting report or a list of issues to fix. It's structured around three principles:

  • Defensive guardrailsarchitecture.md highlights "load-bearing walls": high-risk hubs with many dependents that the model should treat carefully, informed by static analysis
  • Contextual densitymap.md provides symbol signatures and call graphs so the model can make surgical edits without reading every file
  • Flow awarenesscontext.md shows how modules connect, where requests flow, and which files change together

The goal is to give the model just enough structure to stay out of trouble, without overwhelming it with noise.

Aspect Code is still lacking true real-world data; at the time of writing, I'm the only user! Once I've finished preparing the VS Code extension, I'll be running a small pilot cohort.