The Evolution of Agentic QA

A condensed, public version of a conversation with Joe Sturgess for the Katalon blog. The interview ran through twelve questions; below are the ones that keep coming back up in my inbox.

How I'd define agentic QA at a bar

Agentic QA is where the agent holds the product's intent — not a test script — and decides what to verify against it. The QE's job moves from writing steps to defining what "quality" even means for this product.

OpenAI gave this pattern a name in February: harness engineering. The team's job isn't to produce the output anymore — it's to build the system that lets agents do the work reliably. In QA terms, you're not writing test cases. You're building the harness that lets an agent decide which tests matter, run them, and know when the result should be trusted.

The quality harness — an agent wrapped in rules, wrapped in context. The harness is the product. The agent just runs inside it.

It's not self-healing locators, AI-generated Selenium, or a chatbot that writes scripts for you. Those are old-process-with-AI-sprinkles. If the shape of your QA workflow didn't change, if your team still writes and maintains scripts at the end of the day, you have AI-assisted testing — not agentic QA.

Why the scripted model had to break

Scripts are a snapshot of a product that no longer exists. By the time the test is stable, the feature has shipped three versions. Your automation suite is always chasing a reality that's already moved on.

Then there's the math. Every new release adds more things to test — and you still maintain the scripts from the old features. That's compounding debt. After a few release cycles, your team spends more time keeping the automation green than actually testing the product.

Modern teams aren't asking for another tool to generate test scripts. They're asking whether their quality is actually good. Those are different problems. Agentic QA became inevitable not because AI got shiny — but because the scripted model stopped scaling to the real question.

The 25% coverage ceiling isn't a tools problem

Forrester calls it the "25% automation coverage ceiling." It's not about tools. It's about maintenance. Every new test you add costs you forever. At some point, the cost of keeping the suite healthy exceeds the value it provides, and you stop adding coverage. That's the ceiling.

The 25% coverage ceiling: scripted tests plateau while agentic coverage keeps climbing. Scripted tests plateau around 25% because maintenance cost catches up. Agentic flattens that cost so coverage keeps climbing.

Agentic changes the math in two ways:

Intent, not instructions. You maintain a smaller set of definitions of what "good" and "broken" look like, and the agent reinterprets intent each time the product changes. Maintenance stops scaling linearly with features.
Offloads repetitive work. The 80% of regression your QEs don't want to do is exactly what the agent is good at. That frees the human team to focus on the parts that actually require judgment.

Trust is a loop, not a feature

We shipped an early version of our AI Test Runner where the agent was too eager to "complete the mission." Its goal was: run the test case to completion, successfully. The agent would work around the system — tolerating weirdness, retrying failing steps, taking creative paths — just to finish every step green. It was doing everything except testing.

The fix wasn't a smarter model. It was the harness. We rewrote the boundary:

Your job is not to complete the steps. Your job is to find where the product breaks intent. Be skeptical. If a step completes but the app isn't working as intended — flag it. If a step fails but the app is working — flag that too. Report anything suspicious, let a human decide.

You can't trust an agent you can't audit. Trust comes from the harness around it — not the model inside it.

Trust loop: agent acts, finding surfaces, human audits, harness tightens, repeat. Trust isn't a property the agent has. It's a loop you run around the agent, tightening the harness every pass.

What the QE role becomes

Two things get more important: judgment and critical thinking.

Judgment is knowing when the agent's output is trustworthy and when it isn't. You stop being the person who executes steps and become the person who reviews reasoning. You're not grading a test run — you're evaluating whether the agent understood the problem it was trying to solve.

Critical thinking is the stress-testing instinct. The next wave of great QEs are the people who can argue with the agent, find the gap in its reasoning, spot the failure mode it missed. Harness architects, not button-clickers.

The anti-pattern I want to flag bluntly: QEs asking the agent to generate Selenium faster. That's not agentic QA. That's just faster script-writing on the same broken treadmill. If the deliverable at the end of your week is still a pile of scripts, the tool changed but the model didn't.

The velocity gap is already here

Gartner says 25% of Y Combinator companies have 95% or more AI-generated code. Google's CEO said over a quarter of new code at Google is AI-generated. The dev side of the equation has already changed.

Teams that don't close the gap land in one of two places:

Testing becomes the new bottleneck, ten feet downstream from the old one. Dev ships a feature in two hours. QA takes three days. Engineering is idle waiting for sign-off on work that's already done.
Quality debt compounds silently. Nothing visibly breaks for six months — then regressions cascade, and you spend a quarter firefighting.

Velocity gap: dev velocity diverges upward with AI while QA stays flat, opening a quality-debt gap. Every week the dev line climbs and the QA line doesn't, the gap between them fills with invisible quality debt.

Peter Pang put it well: "Fast AI without fast validation is fast-moving technical debt." The harness is the validation layer. If you don't build it, the damage compounds invisibly.

Context is the harness

When a human writes code, intent is embedded in the code — in structure, naming, comments, commit history. When code is written from a natural-language prompt, the intent lives in the prompt, which is often disposable. The spec evaporates. The code exists, but the why doesn't.

So on the QA side, the defence is to wrap the agent in as much context as you can get: code, design, requirements, user stories, bug history, release notes, support tickets. All of it. The more context, the better the agent defines intent and verifies whether reality matches.

Agentic QA lives or dies on context. Starve the agent of design and requirements — you get a guesser. Feed it everything — you get a tester.

Where this is 18 months from now

The teams that embraced it will look like small, senior teams with enormous leverage. The QE-to-dev ratio drops, but the role level goes up — harness architects, not button-clickers. Quality becomes continuous, not a gate. Interestingly, QA stops being a function and starts looking more like a discipline — the same way SRE happened to ops.

The teams that didn't will look like the bottleneck everyone routes around. Engineering ships to staging, the AI-native teams ship to prod, and the old QA org is left gatekeeping a process no one waits for.

The gap is widening, but it's not a death sentence. Eighteen months is enough time to build a harness, move your team up the value stack, and be in the first group. It's not enough time if you wait twelve months and then start.

Step zero, for any leader moving now

You don't need a transformation programme. You need one experiment, and a paragraph.

Pick one repetitive, low-risk flow — a regression suite everyone ignores, a pre-release smoke test, a boring checklist — and put an agent on it for a week. Don't rip out your framework. Don't retrain the team. Watch what the agent catches, misses, and what your team does with the output. You'll learn more in five days than five months of planning.

Before you point the agent at anything — write down what "quality" means for that flow in plain English. Not test cases. Not acceptance criteria. A paragraph that says: "this is what good looks like, and this is what broken looks like." If you can't write that paragraph, no agent will pass the test — because you haven't told it what the test is.

Ninety percent of the difficulty of agentic QA isn't the agent. It's realising nobody ever wrote down what "quality" actually meant, and the agent exposes that gap immediately.

Step zero: one flow, one paragraph, one week, learn. The whole experiment fits on a napkin. The point isn't scale — it's exposing the paragraph you couldn't write.

One flow. One paragraph. One week. That's step zero.

The agent isn't the product. The harness around it is. Build the harness — and start this week.