Week 7: Engine quality jump
Two prompt changes pushed recall from 85% to 93% and precision from 31% to 50%. Plus: eval baseline landed, CI wiring live.

This week was about making the engine measurably better — not by feel, but by numbers.
Eval baseline is in
We ran the full eval suite (7 fixtures, each a synthetic page with a deliberate UX bug) and recorded the first baseline. Every future prompt edit, model swap, or pipeline change now gets measured against this floor.
The CI workflow is a manual-trigger GitHub Action. Run it before merging anything that touches the analysis pipeline.
Two prompt changes, big impact
Change 1: Stop blocking application error pages.
The engine was classifying custom error pages (422, 404 with styled UI) as "bot protection" and refusing to analyze them. An error page is part of the product's UX — it should be reviewed for clarity, recovery actions, and tone. Fixed.
Change 2: Quality over quantity.
The engine was producing 10-16 findings per page when 3-6 high-signal ones are more useful. Generic HTML-structure complaints (missing landmarks, missing headings on a single-purpose page) are now suppressed unless they cause a real usability problem.
The numbers
| Metric | Before | After | Target |
|---|---|---|---|
| Recall (must-detect) | 85.7% | 92.9% | ≥ 90% ✅ |
| Recall (all) | 69.0% | 85.7% | ≥ 70% ✅ |
| Precision | 31.2% | 50.3% | ≥ 60% (closing) |
Total AI findings dropped from 45 to 25 across the same 7 fixtures — fewer findings, but the ones that remain are the ones that matter.
Next week
Precision still has room to grow. The main lever is fixture keyword expansion — some "extra" findings are actually correct but don't match the ground-truth labels yet. That's a matcher improvement, not an engine problem.
Enjoyed this? Get the weekly build log in your inbox.