Benchmarking AI Code Generation – 5x More Tests Fixed with Codebase Context

We tested LLMs with and without structured repo knowledge. Here's how adding context changed the results.

Benchmarking AI Code Generation – 5x More Tests Fixed with Codebase Context

Related: Aspect Code — the problem we're solving and why context matters.

If you're going to claim "AI writes better code with specified context," you should probably prove it.

This post describes:

  • how the benchmark for Aspect Code was set up
  • what models and repos were used
  • what “KB-augmented” actually means in practice
  • and the main metrics we looked at

What we were trying to measure

The core question:

If you give an AI coding model a structured understanding of a repo, in the same format Aspect Code uses in production, does it:

  • fix more tests,
  • avoid breaking existing behavior, and
  • touch less code while doing it?

To test that, we ran each task in two modes:

Baseline mode

The prompt contained:

  • a small project overview
  • the task description
  • a set of source files that a simple heuristic considered “relevant”
  • the test command / expectations (described in natural language)

KB-augmented mode (Aspect Code)

The prompt contained everything from baseline, plus:

  • the same markdown KB files Aspect Code generates in a real repo:
    • architecture.md, deps.md, flows.md, hotspots.md, symbols.md, findings_top.md
  • the same “assistant instructions” content that the extension writes to:
    • .github/copilot-instructions.md
    • .cursor rules files
    • CLAUDE.md (or equivalent)

In other words, KB mode is literally:

baseline prompt + the knowledge base and usage instructions that Aspect Code would normally generate and hand to your assistant.

No other differences:

  • same model
  • same task
  • same code extraction logic
  • same harness

Repos and tasks

The benchmarks used real open-source templates, not synthetic toy projects:

  1. FastAPI backend (Python)
    A typical FastAPI project layout with API routes, services, models, auth, config, and tests. Based on an OSS FastAPI template.

  2. Next.js + Prisma app (TypeScript)
    A Next.js + Prisma boilerplate with API routes, a Prisma schema, auth/session logic, and tests. Also based on an OSS starter.

For each repo, there were 15 tasks. They’re the kind of things you’d see in a real backlog.

Examples from the FastAPI project:

  • Refactor items service into read/write layers
  • Add response caching for expensive queries
  • Soft delete items instead of hard delete
  • Add retry mechanism for external service calls
  • Add CSV export endpoint

Examples from the Next.js + Prisma project:

  • Add URL-friendly slugs to posts
  • Implement comments on posts
  • Add rate limiting to login endpoint
  • Add CSV export endpoint for posts
  • Add draft/published status for posts

For each task, I added or updated tests so that:

  • Some tests were already passing for related existing behavior.
  • New or modified tests failed before any changes.
  • If the task was implemented correctly, all relevant tests passed afterwards.

The tests are the source of truth: if they pass, the task is considered solved.


Models and why we chose them

We benchmarked two very different models:

  • Anthropic: Claude 4.5 Opus (SOTA, more expensive, higher capability)
  • OpenAI: gpt-4.1 (strong, but generally more cost-friendly and common in production)

The goal wasn’t just “provider A vs provider B”. It was:

  • SOTA model (Opus) – what happens when Aspect Code sits under a top-end model.
  • High-end but more “practical” model (4.1) – what happens for teams who don’t always run the most expensive tier.

For each combination of:

  • repo (FastAPI, Next.js+Prisma)
  • provider (Anthropic, OpenAI)
  • mode (baseline, KB-augmented)

…we ran all 15 tasks. So every task was attempted under four conditions:

  1. Anthropic / baseline
  2. Anthropic / Aspect Code KB
  3. OpenAI / baseline
  4. OpenAI / Aspect Code KB

All runs used:

  • temperature = 0.0
  • identical base system prompt
  • identical task prompts, except for the KB content in the KB mode

The harness

The benchmark harness lives under .aspect_code_bench/ and follows a simple A/B pattern: compare baseline vs aspect_code_kb for each model and repo.

For each (task, model, mode):

  1. Reset repo

    • Hard reset to a known clean commit.
    • Clear any artifacts from previous runs.
  2. Run pre-tests

    • Run the task’s test command.
    • Record how many tests pass and fail before changes.
  3. Build prompt

    Baseline:

    • short project context
    • task description
    • “relevant” source files
    • summary of the test setup

    KB mode:

    • everything above
    • plus the KB markdown files
    • plus the assistant-instruction content that the extension normally generates
  4. Call the model

    • Send the prompt to Anthropic (Claude 4.5 Opus) or OpenAI (gpt-4.1).

    • Expect code blocks annotated with file paths, e.g.:

      // filepath: pages/api/posts/export.ts
      export default async function handler(req, res) {
        // ...
      }
  5. Apply the response

    • Parse the code blocks.
    • Overwrite the corresponding files on disk.
    • If multiple blocks target the same file, apply them in order.
  6. Run post-tests

    • Run tests again.
    • Compare with pre-tests:
      • tests_fixed = post_passed - pre_passed
      • tests_broken = pre_passed - post_passed
  7. Record metrics

    For each run we log:

    • tests_fixed / tests_broken
    • whether all target tests now pass
    • total tokens used (input + output)
    • response code lines
    • lines added to the repo
    • files touched
    • whether the run was “catastrophic” (large drop in previously passing tests)

All of this is automated. The only manual parts are writing the tasks/tests and letting the extension generate the KB + instruction files.


The knowledge base the model sees

In KB mode, the extra content in the prompt is exactly what Aspect Code produces in a real repo:

  1. KB files under .aspect/ (or a similar hidden dir)

    • architecture.md – directories, layers, and how requests flow
    • deps.md – high-level dependencies, hubs, and cycles
    • hotspots.md – high-impact or frequently problematic files
    • flows.md – typical request/response paths through the app
    • symbols.md – where key functions, handlers, and types live
    • findings_top.md – cross-file issues that are worth paying attention to
    • findings.md – details about issues and architectural rules

    This gives the model a prioritized view of what's wrong and where—so it can avoid breaking things or introducing the same patterns.

  2. Assistant instruction files

    • Copilot instructions (.github/copilot-instructions.md)
    • Cursor rules/KB references
    • Claude instructions (CLAUDE.md-style file)

For the benchmark, we take that same content and paste it into the KB mode prompt. In production, your agents typically pull this via tool calls, editor integration, or file reads. For the benchmark, we just shortcut straight to “include it in the prompt” because:

  • it makes the experiment easier to automate and compare
  • it keeps the only variable as “has KB + instructions vs doesn’t”

The KB does not include task-specific solutions. It only describes the structure, flows, and risk areas of the repo in a way an LLM can read once instead of rediscovering from scratch.


Metrics we tracked

Per task, per mode, we recorded:

  • pre/post test counts
  • tests_fixed and tests_broken
  • overall status (solved, failed, error, catastrophic)
  • input and output tokens
  • response code lines
  • lines added to the repo
  • number of files touched

Then we aggregated:

  • total net tests fixed
  • total tokens across runs
  • total response code lines
  • total files touched
  • how often KB runs were better / same / worse
  • catastrophic failure counts

“Net tests fixed” is:

  • tests that went from failing → passing (positive)
  • minus tests that went from passing → failing (negative)
  • summed across all tasks and runs

It’s not perfect, but it’s a reasonable proxy for “did we fix more than we broke?”


Results: what changed with Aspect Code

Here are the combined results across both repos and both models.

1. Net tests fixed

Across all runs:

  • Baseline (no KB): 26 net tests fixed
  • With Aspect Code KB: 133 net tests fixed

So with Aspect Code enabled, we saw roughly 5× higher net test improvement.


2. Token usage, code size, and files touched

Across all runs:

  • Token usage

    • Baseline: 159,026 tokens
    • With Aspect Code: 137,799 tokens
      → ~13.35% fewer tokens
  • Response code lines

    • Baseline: 21,046 code lines
    • With Aspect Code: 17,652 code lines
      → ~16.13% fewer code lines
  • Files touched

    • Baseline: 150 files
    • With Aspect Code: 135 files
      10% fewer files touched

So with the KB:

  • more tests were fixed,
  • while writing less code,
  • and modifying fewer files.

3. Task-level comparisons

Comparing baseline vs KB per task (same repo + model):

  • KB better: 19 tasks
  • Same outcome: 36 tasks
  • Baseline better: 5 tasks

“Better” means “more tests passing after the run.”

So most of the time, Aspect Code either helped or didn’t change the outcome. There were a few cases where baseline did better, but they were the minority.


4. Catastrophic failures

We tracked “catastrophic” runs where the model substantially destroyed previously passing tests.

Across all runs:

  • Baseline catastrophic events: 7
  • Aspect Code KB catastrophic events: 2

Asymmetry:

  • Cases where baseline was catastrophic and Aspect Code wasn’t: 5
  • Cases where Aspect Code was catastrophic and baseline wasn’t: 0

So adding structured repo knowledge made it less likely for the model to turn a mostly-working codebase into a failing one.


How fair is this?

Both modes:

  • use the same repo commit
  • use the same tasks and tests
  • use the same model and parameters
  • go through the same harness and patch application logic

The only difference is that KB mode:

  • receives the KB markdown files generated by the extension, and
  • receives the assistant-instruction content that Aspect Code would normally write to your repo for Copilot/Cursor/Claude.

That’s basically the same help you’d give a new developer on a team:

  • architecture docs
  • dependency map
  • notes on risky areas
  • “read this before you start editing” guidelines

The KB doesn’t reveal the answer. It just removes a lot of blind trial-and-error and lets the model:

  • pick the right files
  • respect layering
  • be more conservative around high-impact modules

Limitations

Obvious caveats:

  • Only two OSS templates (FastAPI, Next.js+Prisma).
  • 15 tasks per repo.
  • Tasks are curated, not pulled from a random production backlog.
  • Net tests is a coarse metric (tasks with more tests contribute more).

So I treat these as early, directional results, not a final generalization over all projects.

The interesting part is the pattern:

  • more net tests fixed
  • fewer catastrophic failures
  • less code, fewer files touched

…all from adding the same structured repo knowledge that Aspect Code uses in production.


Where this goes from here

Next steps are straightforward:

  • add more repos with different shapes (monoliths, services, etc.)
  • iterate on which KB content is most useful and what’s noise
  • improve the analysis engine and findings feeding into the KB
  • test more agent-style flows (multi-step planning, tool calls, etc.)

If you’re already using AI to write code and want to see how it behaves on your own repos with this kind of context, that’s exactly what the Aspect Code alpha is for.