January 3

Making AI Reliable on Real Codebases: The KB + Feedback Loop

Generic AI instructions plateau. I ran benchmarks on a FastAPI template and djangopackages to show how a repo-specific knowledge base plus a feedback loop transforms AI from a liability into a time-saver.

Asa Shepard

Making AI Reliable on Real Codebases: The KB + Feedback Loop

AI coding tools promise faster development. In practice, they often break more than they fix: wide diffs, failing tests, architecture drift, accidental refactors. Teams try adding instruction files or custom prompts, but the problems keep coming back.

I wanted to understand why, so I ran 15 benchmark tasks across FastAPI and Django with multiple configurations. The results were clear: the difference between AI that breaks your codebase and AI that improves it comes down to two things: a repo-specific knowledge base (KB), and a feedback loop that keeps instructions aligned with your stack.

This post shares what I found, and how Aspect Code turns these insights into a repeatable process.

Open Source Benchmark: github.com/asashepard/Aspect-Bench

Related: What is Aspect Code? · Previous benchmark results

The Problem: Generic Instructions Don't Scale

Most teams configure AI assistants with a static instruction file: "prefer small diffs," "follow our style guide," "don't break tests." These rules help, but they plateau quickly.

Why? Because generic instructions can't reference your specific architecture. They don't know which modules are load-bearing, where the coupling lives, or what conventions your team actually follows.

The result: AI that works fine on simple tasks but causes chaos on anything complex.

The Solution: KB + Feedback Loop

Aspect Code takes a different approach:

Generate a KB from your repo (architecture, hubs, coupling, conventions)
Apply a safety policy (bias toward small diffs, preserve invariants, ask when unsure)
Run real tasks and measure outcomes
Update instructions to fix the specific failures you observe
Regenerate as the repo evolves

The key insight: instructions need to stay aligned with your KB and repo. Generic rules plateau; repo-specific rules compound.

The Evidence

I ran 15 tasks on two real-world repos: a FastAPI production template and djangopackages, a mature Django codebase. Each task asked the AI to make a change, and I measured net tests: tests fixed minus tests broken. A positive number means the AI left the codebase better than it found it.

Three configurations:

Baseline: Claude with no KB or custom instructions
KB v3: The KB from the previous benchmark, plus instructions I refined manually by running the benchmark repeatedly, observing failure patterns, and adding rules like "always check architecture.md before touching a hub" and "if the KB says a module is load-bearing, don't refactor it"
KB v4: A tighter, more precise KB, but using the same instructions (not updated to the new KB structure)

I also ran KB v4 with a drastically different instruction set. It performed worse than the instruction set shown below, but it still significantly beat baseline, which is strong evidence that the KB itself helps.

A Quick Proof That the KB Matters

As a sanity check, I ran a small ablation: I swapped in a KB from an unrelated repo (effectively scrambling the ground truth), and I also ran the same tasks with the KB removed entirely.

Both variants performed meaningfully worse than using the correct KB. In this small proof-of-concept run:

Baseline: -31 net tests, 5 catastrophic
Correct KB: +24 net tests, 1 catastrophic
KB removed: -4 net tests, 3 catastrophic
KB scrambled: -12 net tests, 2 catastrophic

In other words: it's not just “more prompt text” or “more rules” doing the work — having the right repo-specific knowledge base is what changes outcomes.

FastAPI Template: From -31 to +24 Net Tests

Configuration	Net Tests	Improved	Regressed	Catastrophic
Baseline / Sonnet 4	-31	5	7	5
KB v3 / Sonnet 4	+17	7	2	2
KB v4 / Sonnet 4	+24	7	2	1
Baseline / Opus 4.5	-2	6	4	2
KB v3 / Opus 4.5	+28	8	3	1
KB v4 / Opus 4.5	+19	6	2	1

The transformation:

Sonnet 4: -31 → +24 (55-point improvement, catastrophic failures drop from 5 to 1)
Opus 4.5: -2 → +28 / -2 → +19

Without Aspect Code, the AI broke 31 more tests than it fixed. With a KB and better instructions, it fixed 24 more than it broke. That's hours of debugging saved per task.

djangopackages: The Hardest Test

Configuration	Net Tests	Improved	Regressed	Catastrophic
Baseline / Sonnet 4	-28	3	6	3
KB v3 / Sonnet 4	-20	3	4	1
KB v4 / Sonnet 4	-18	5	6	2
Baseline / Opus 4.5	-22	4	6	2
KB v3 / Opus 4.5	-3	4	4	0
KB v4 / Opus 4.5	-15	4	5	1

djangopackages is a mature, convention-heavy Django codebase with years of accumulated patterns. It's exactly the kind of brownfield repo where AI tools typically cause the most damage, and where debugging AI-generated code can eat up entire afternoons.

The transformation:

Sonnet 4: -28 → -20 / -28 → -18
Opus 4.5: -22 → -3 / -22 → -15
Fewer catastrophic failures with a KB

The KB v3 → KB v4 drop on Opus (from -3 to -15) is instructive: I changed the KB but did not update the instructions. This proves the coupling is real. If we were to update the instructions for KB v4 with djangopackages, we could expect similar gains.

For now, though, KB v4 is ideal. When paired with instructions originally developed alongside KB v3, the LLM tends to be cautious, which is exactly what we want for both effective coding in general and follow-up tuning. The presence of the KB also disproportionately helps the smaller Sonnet 4, even without the tighter coupling.

Why This Works

1. The KB Provides Ground Truth

Generic instructions say "don't break things." A KB says exactly which modules are load-bearing, where the coupling lives, and what patterns your team actually uses.

The AI can't follow conventions it doesn't know about. The KB makes them explicit.

Notably, the KB v3 instructions I tested didn't include anything repo-specific by design. They were shaped by failure patterns, not by the repos themselves. The improvement would likely be even larger with instructions that directly reference repo-specific structure.

2. Refinement Creates Compound Improvement

Each pass through the tuning/feedback loop fixes specific failure modes because the instructions start referencing your KB's actual structure:

"Check architecture.md before touching a hub"
"The auth module is load-bearing; don't refactor it"
"This repo uses factory patterns; follow them"

These rules compound. The same failures stop happening.

3. Caution is a Feature

With a stricter policy, the AI sometimes declines to generate code, instead asking for clarification or flagging uncertainty.

Even without tight instruction-KB coupling, this is a win. Every time the AI says "I'm not sure" instead of generating broken code, you save the time you would have spent debugging. NCP also gives you high-confidence signals about where the KB needs improvement, turning failures into actionable feedback.

What Aspect Code Delivers

Aspect Code turns this process into a managed service:

What You Get	How It Works
Repo-specific KB	Aspect Code analyzes your codebase and generates a knowledge base that captures architecture, conventions, and constraints
Data collection	You keep working on your actual tasks, noting where AI still doesn't understand something or goes wrong
Instruction alignment	We use the data collected to run a proven feedback loop and generate an updated instructions file for your repo
Measurable outcomes	You see the before/after: AI knows how to write better production code, codebase-breaking changes are avoided, and time is saved

The goal: AI that writes code that fits your repo, and knows when to stop. The short, low-lift pilot applies this exact process to your repos.

Key Takeaways

Generic instructions plateau. Without a repo-specific KB, AI tools hit a ceiling quickly.
The KB provides ground truth. It makes your architecture, conventions, and constraints explicit to the AI.
Instructions must match the KB + repo. The best results come from keeping instructions aligned with your KB and specific repo structure.
Caution is a feature. The KB not only helps in general, but forces models to ask for clarification, which is safer than guessing wrong.

What's Next

A system that aligns instructions to a repo-specific KB may have the potential to push past the current ceiling on benchmarks like SWE-bench Verified, especially with multiple agents working in parallel. The core principle remains the same: if AI is at all similar to human developers, it performs better on a codebase once it has examined and understood its structure.

The goal is AI coding that improves with every task, adapts to your codebase, and knows its limits.