Skip to main content

How We Built a Way to Measure AI's Impact on Engineering — Without Fooling Ourselves

Ben Churchill10 min read15 January 2026Value & OutcomesTechnical Infrastructure

The phone call that started it

"We've spent six figures on AI coding tools. Everyone says they're faster. But are we actually shipping more value?"

The VP of Engineering sounded tired. Their company had rolled out GitHub Copilot, then added Cursor, then experimented with AI code review bots. Developers loved them. Anecdotes were glowing. But when the board asked "what's the ROI?" — silence.

They weren't alone. Every engineering leader we spoke to in late 2024 faced the same problem: AI tools promise productivity gains, but traditional metrics can't prove it.

Story points? Teams estimate faster now, so points-per-sprint looks flat even though they're shipping more. Lines of code? AI writes verbose code, so LOC goes up while quality goes sideways. Velocity charts? Meaningless when the work itself is changing.

We needed a measurement system that could:

  • Capture real outcomes, not proxies
  • Account for quality, not just speed
  • Adapt to strategy shifts (launch mode vs. hardening mode)
  • Work with imperfect data from real-world tools
  • Resist gaming, because smart people game smart metrics
  • This is the story of how we built it together.

    What we learned in the first month (mostly by being wrong)

    Attempt 1: "Just measure cycle time"

    We started simple. Pull request cycle time seemed obvious — if AI helps developers code faster, PRs should close faster.

    What actually happened: Cycle time did drop, but defect rates spiked. Developers were moving fast and breaking things. One team cut their PR time by 40% while their bug backlog doubled.

    Lesson: Speed without context is just speed. We needed a composite view.

    Attempt 2: "Track AI usage vs. output"

    We tried correlating Copilot acceptance rates with story completion. High AI usage should mean more stories shipped, right?

    What actually happened: The correlation was weak and noisy. Some developers used AI heavily for boilerplate but still got stuck on architecture decisions. Others barely used AI but were extremely productive because they were working in familiar codebases.

    Lesson: AI is a tool, not a magic wand. We needed to measure outcomes, not tool adoption.

    Attempt 3: "Let's survey the developers"

    Surely the people using AI daily could tell us if it was working?

    What actually happened: Developers felt more productive (and they probably were), but feelings don't satisfy a CFO asking about budget allocation. We needed numbers that could stand up to "prove it."

    Lesson: Qualitative insights are vital, but we needed quantitative guardrails alongside them.

    The breakthrough: Four truths, one score

    We were in a workshop room — whiteboards covered in failed metric sketches — when the engineering director said something that stuck:

    "I don't need one perfect number. I need to know: Are we faster? Are we shipping real value? Are we maintaining quality? And can I trust the plan?"

    We called it the ThoughtFox Development Performance Index (TDPI) — a proprietary framework we developed to answer exactly these questions. Four dimensions that, together, paint an honest picture:

    1. Speed — Are we getting code to customers faster?

    We settled on PR cycle time: from "ready for review" to "deployed." It's imperfect — doesn't capture thinking time — but it's measurable and meaningful. AI should help here by reducing boilerplate and speeding up reviews (when used for code explanation).

    2. Value — Are we shipping outcomes that matter?

    We tracked EPICs per week (big, meaningful features) and stories per quarter. This was the antidote to "busy work." If AI made teams faster but they just churned through trivial tasks, this would catch it.

    3. Quality — Are we creating problems for later?

    Defects per unit of work became our North Star. We normalised it (issues per story, or per point) so teams of different sizes could compare. If AI was injecting subtle bugs through unreviewed suggestions, this would show it.

    4. Predictability — Do we deliver what we commit to?

    Delivery accuracy: what we delivered divided by what we planned. AI should help here too — better code suggestions mean fewer "oh, that's harder than we thought" moments mid-sprint.

    Why normalisation saved us from madness

    Early on, we tried comparing raw numbers across teams. Chaos.

  • Team A's "fast" PR cycle (1.2 days) looked worse than Team B's (2.1 days), but Team A worked on a legacy monolith while Team B had a greenfield microservice.
  • Team C shipped 10 stories in a sprint; Team D shipped 4. But Team D's stories were massive platform changes.
  • The fix: We stopped comparing teams to each other and started comparing each team to their own past.

    We normalised every metric against that team's worst historical period (their "floor"). The worst period = 1.00. Anything above 1.00 = improvement. Suddenly, we could say: "Team A is 1.8x better than their worst quarter" without arguing about whether their work is "really" harder than Team D's.

    What actually changed on the ground

    The metric didn't improve teams. The conversations the metric enabled did.

    Month 2: The "fast lane" experiment

    TDPI showed Team Beta's Speed was lagging — PRs sat in review for days. We introduced a "fast lane": any PR under 50 lines with passing tests got 30-minute SLA reviews.

    Result: Speed score jumped from 1.2 to 1.9 in one sprint. Developers started writing smaller, AI-assisted PRs because they knew they'd get fast feedback. A virtuous cycle.

    Month 4: The quality wake-up call

    Team Gamma's Quality score dropped (1.3 to 0.9, meaning worse). Digging in, we found developers were accepting Copilot suggestions without reading them carefully — copying subtle bugs in error handling.

    Fix: We added an "AI-generated code" checkbox to PR templates and made reviewers specifically check those sections. Quality rebounded to 1.5 within three sprints.

    Month 6: The predictability breakthrough

    Team Delta's Predictability was stuck at ~0.7 (delivering only 70% of planned work). The root cause? Mid-sprint scope creep. AI made coding feel so fast that product managers kept adding "just one more thing."

    Fix: We instituted a sprint freeze: no new work after day 2 unless something is dropped. Predictability climbed to 0.92. Ironically, they shipped more over the quarter because they finished things instead of starting ten.

    The weights: Strategy in four percentages

    We set default weights — Speed 20%, Value 30%, Quality 20%, Predictability 30% — but the real power came from adjusting them.

    Example: Launch mode. Client was releasing a new product. We shifted to Speed 25%, Value 35%, Quality 15%, Predictability 25%. This made trade-offs explicit. Everyone knew quality was temporarily de-prioritised, and we had a plan to re-weight it post-launch.

    Example: Platform hardening. Six months later, technical debt was biting. We shifted to Speed 15%, Value 20%, Quality 35%, Predictability 30%. TDPI still worked — it just measured what currently mattered.

    The AI effect, quantified (what we learned after 12 months)

    After a year of measuring, we could finally answer the original question: "What's the ROI of AI tools?"

    Across five teams (anonymised composite data), the results were clear. Speed improved by 68%. Value improved by 140%. Quality improved by 52%. Predictability improved by 44%. The aggregate TDPI moved from 1.00 to 1.76 — a 76% improvement.

    But the real insights were in the decomposition:

  • Speed gains were real but modest. AI helped, but process changes (faster reviews, smaller PRs) mattered more.
  • Value gains were huge. Teams shipped 2.4x more EPICs. Why? AI eliminated grunt work, freeing time for high-impact features.
  • Quality improved despite speed increases. This surprised everyone. Theory: AI caught silly bugs (typos, missing null checks) that used to slip through.
  • Predictability improved. Fewer "this is harder than we thought" surprises. AI helped with unfamiliar libraries and frameworks.
  • What we got wrong (and how we fixed it)

    Mistake 1: Over-explaining the math. Early scorecards were dense. Executives glazed over. Fix: We created a one-page dashboard — four pillar scores, one TDPI number, one trend arrow, two bullet points on what changed. That's it.

    Mistake 2: Comparing teams publicly. Some teams had higher TDPI scores. Others felt bad. Morale dipped. Fix: TDPI became team-private. Only aggregate, anonymised trends went to leadership. Teams competed with their own past, not each other.

    Mistake 3: Not documenting context. One quarter, Team Epsilon's Speed crashed. Panic. Turns out they'd onboarded three juniors. Context matters. Fix: Every score now has a context note field.

    Mistake 4: Changing definitions mid-flight. We tweaked the "defect" definition in month 5. Historical comparisons broke. Fix: Definitions are now versioned and locked per quarter.

    How to try this yourself (90-day pilot)

    Weeks 1-2: Set the foundation. Pick one team or product area. Define the four pillars in plain English. Agree on initial weights based on your current strategy. Publish a draft scorecard and get feedback.

    Weeks 3-6: Baseline and instrument. Pull data from existing tools (GitHub, Jira, Linear, whatever you use). Choose a "worst period." Calculate normalised scores for each pillar. Set your baseline TDPI = 1.00.

    Weeks 7-12: Improve and learn. Identify your lowest-scoring pillar. Pick one experiment to improve it. Recompute TDPI weekly. Hold a 15-minute "what moved?" meeting.

    Success criteria: Not "TDPI went up" (though that's nice). Success is: "We made a decision based on this data that we wouldn't have made otherwise."

    The real bottom line

    If you'd asked us two years ago "how do you measure AI's impact on engineering?", we'd have shrugged.

    Today, we have an answer — not a perfect one, but an honest one.

    The ThoughtFox Development Performance Index doesn't tell you whether to adopt AI tools. It tells you whether they're working. And when they're not, it shows you where to look.

    For the VP who called us, that was worth every hour we spent arguing about defect definitions in that whiteboard room.

    Want to pilot the ThoughtFox Development Performance Index with your team? Get in touch — we'll help you set it up, and if it doesn't make decisions clearer within 90 days, we'll help you turn it off.

    Turn insight into action

    Let's talk about how these ideas apply to your organisation.