I Gave Claude Code My Entire QA Job
Claude Code shipped workflows. So I gave it my entire QA job.
Not "help me write a test case." The whole loop: read the requirement, design the coverage, run it on the platform, file the defect, report back with evidence. The kind of thing a QE does in a day, compressed into one skill you trigger with a sentence.
Here's the whole thing. 🧵
The setup
A QE's real job isn't writing test scripts. It's a sequence of decisions:
- What changed, and what's the risk?
- What's the smallest set of tests that covers that risk?
- Which way do I actually run them?
- Did it pass — and can I prove it?
Every one of those is judgment. None of them is "type the Selenium." So the question I cared about wasn't "can AI write a test" — it's "can AI hold the decisions and drive the platform?"
Turns out: yes, if you give it the workflow instead of the keystrokes.
The trick: don't automate the clicks, automate the decision tree
The old way to "automate QA" was to record clicks. Brittle, and it skips the only part that matters — the thinking.
The new way: write the decision tree down once, hand it to the agent, and let it route every task through it.
The whole skill is built on one decision — which lane runs this test:
New behavior, ┌──────────────────────────────────┐
no code yet, ─► │ LANE A — Manual + Run with AI │
exploratory │ design cases → AI executes them │
└──────────────────────────────────┘
Automation ┌──────────────────────────────────┐
already in ─► │ LANE B — Automated / TestCloud │
the repo │ schedule the suite on the grid │
└──────────────────────────────────┘
Team lives ─► ┌──────────────────────────────────┐
in code, │ LANE C — Playwright │
wants CI │ real .spec.ts → results upload │
└──────────────────────────────────┘
That's it. That's the brain. Everything else — requirement analysis, ISTQB coverage, suite building, reporting — hangs off this one branch.
What it actually does, start to finish
I type: "We shipped a new product-filter on the storefront. Cover it."
The agent:
Reads the intent. Pulls the requirement (Jira/Azure sync), or analyzes the one-liner. Spits out personas, main flow, alternate flows, negative flows, risk areas.
Designs the coverage — risk-based, not maximal. It picks real ISTQB techniques: equivalence partitions for filter values, boundary analysis for the price slider, a decision table for filter + stock + sort, state transitions for the result list. Then it writes a coverage note — what it's testing, what it's deliberately skipping, and why. The "why" is the part juniors skip and seniors live by.
Routes to a lane. New behavior, no automation yet → Lane A. It drafts human-readable cases, imports them into Katalon True Platform, links them to the requirement for traceability, builds a suite, picks the AUT environment, creates a manual run, and kicks off Run with AI — the platform's agent executes the cases against the live site.
Reports with evidence. Pass/fail/blocked counts. Each failure with the verbatim error, a screenshot, a trace. No "looks broken." If it can't produce the proof, it downgrades the claim. Then it offers to file the defect against the failed result.
One sentence in. A traceable, executed, evidenced run out.
The part I'm proud of: Playwright as a first-class lane
Most "AI QA" demos stop at the manual lane. But half the teams I talk to live in code — they want Playwright, cross-browser, CI gating merges.
Katalon has a real integration for this (@katalon/playwright-reporter): you run your actual @playwright/test specs anywhere, and the results — status, duration, screenshots, videos, traces, browser metadata — upload straight into True Platform's Test Runs.
So I made it Lane C. Same skill. The agent will:
npm install --save-dev @katalon/playwright-reporter
# wire the reporter into playwright.config.ts
KATALON_API_KEY=… KATALON_PROJECT_ID=… npx playwright test # runs + uploads
And the best bit is the promotion pattern:
Explore the live site with a browser → cover new behavior fast in Lane A → once the happy paths are stable, port them to Lane C Playwright specs so CI gates every future merge.
One product. Three lanes. One place to read the results. The agent knows when to use which — and tells you when it's recommending a promotion vs a one-off.
What surprised me
Honesty beats capability. The most useful thing I wrote into the skill wasn't a feature — it was the boundary table. "You cannot create requirements in Katalon — they sync from Jira. Here's the workaround." "You cannot guarantee Run-with-AI finishes — report the blocked state and the exact fixture that's missing." An agent that knows what it can't do is worth ten that bluff.
The decision tree is the product. OpenAI called it harness engineering; in QA it's the same move. You're not writing tests anymore. You're building the harness that lets an agent decide which tests matter, run them, and know when the result can be trusted.
Fewer, stronger tests. Left alone, an agent will happily generate 200 shallow cases. The ISTQB guardrails — prefer fewer strong tests, mark P0 for revenue/checkout/data-loss — are what make the output a QE would actually sign off on.
Why workflows change this
Before workflows, this was a prompt you re-typed and re-tuned every time. Now it's a skill: written once, versioned, triggered by intent. The agent doesn't improvise the process — it follows it, and improvises only the judgment inside each step. That's the difference between a clever demo and something you'd let near a release.
I gave Claude Code my QA job. It didn't take it. It gave me back the eight hours I spent on the mechanical 80% — and handed me the 20% that's actually judgment.
That's the trade I'll take every time.
The skill is built on Katalon True Platform's MCP toolset + the Playwright reporter. If you want the structure — the lane router, the ISTQB coverage guide, the capability-boundary table — say the word and I'll open-source the scaffold.