# I Gave Claude Code My Entire QA Job

> Claude Code shipped workflows. So I wrote one that takes any testing task — a vague Slack message, a Jira story, a 'just check the checkout' — and runs it end to end on Katalon True Platform. Manual, automated, and Playwright. One skill. Three lanes. Here's the whole thing.

Published: 2026-05-29 · Reading time: 4 min · Tags: qa, ai, claude-code, agentic-qa, katalon, playwright
Canonical URL: https://huytieu.com/blog/i-gave-claude-code-my-qa-job/
Author: Huy Tieu (huytieu.com)

---

Claude Code shipped workflows. So I gave it my entire QA job.

Not "help me write a test case." The whole loop: read the requirement, design the coverage, run it on the platform, file the defect, report back with evidence. The kind of thing a QE does in a day, compressed into one skill you trigger with a sentence.

Here's the whole thing. 🧵

---

## The setup

A QE's real job isn't writing test scripts. It's a sequence of decisions:

1. *What changed, and what's the risk?*
2. *What's the smallest set of tests that covers that risk?*
3. *Which way do I actually run them?*
4. *Did it pass — and can I prove it?*

Every one of those is judgment. None of them is "type the Selenium." So the question I cared about wasn't "can AI write a test" — it's **"can AI hold the decisions and drive the platform?"**

Turns out: yes, if you give it the workflow instead of the keystrokes.

---

## The trick: don't automate the clicks, automate the decision tree

The old way to "automate QA" was to record clicks. Brittle, and it skips the only part that matters — the thinking.

The new way: write the *decision tree* down once, hand it to the agent, and let it route every task through it.

The whole skill is built on one decision — **which lane runs this test:**

```
   New behavior,    ┌──────────────────────────────────┐
   no code yet,  ─► │ LANE A — Manual + Run with AI    │
   exploratory      │ design cases → AI executes them  │
                    └──────────────────────────────────┘
   Automation       ┌──────────────────────────────────┐
   already in    ─► │ LANE B — Automated / TestCloud   │
   the repo         │ schedule the suite on the grid   │
                    └──────────────────────────────────┘
   Team lives    ─► ┌──────────────────────────────────┐
   in code,         │ LANE C — Playwright              │
   wants CI         │ real .spec.ts → results upload   │
                    └──────────────────────────────────┘
```

That's it. That's the brain. Everything else — requirement analysis, ISTQB coverage, suite building, reporting — hangs off this one branch.

---

## What it actually does, start to finish

I type: *"We shipped a new product-filter on the storefront. Cover it."*

The agent:

**Reads the intent.** Pulls the requirement (Jira/Azure sync), or analyzes the one-liner. Spits out personas, main flow, alternate flows, negative flows, risk areas.

**Designs the coverage — risk-based, not maximal.** It picks real ISTQB techniques: equivalence partitions for filter values, boundary analysis for the price slider, a decision table for filter + stock + sort, state transitions for the result list. Then it writes a *coverage note* — what it's testing, what it's deliberately skipping, and why. The "why" is the part juniors skip and seniors live by.

**Routes to a lane.** New behavior, no automation yet → **Lane A.** It drafts human-readable cases, imports them into Katalon True Platform, links them to the requirement for traceability, builds a suite, picks the AUT environment, creates a manual run, and kicks off **Run with AI** — the platform's agent executes the cases against the live site.

**Reports with evidence.** Pass/fail/blocked counts. Each failure with the verbatim error, a screenshot, a trace. No "looks broken." If it can't produce the proof, it downgrades the claim. Then it offers to file the defect against the failed result.

One sentence in. A traceable, executed, evidenced run out.

---

## The part I'm proud of: Playwright as a first-class lane

Most "AI QA" demos stop at the manual lane. But half the teams I talk to **live in code** — they want Playwright, cross-browser, CI gating merges.

Katalon has a real integration for this (`@katalon/playwright-reporter`): you run your actual `@playwright/test` specs anywhere, and the results — status, duration, **screenshots, videos, traces**, browser metadata — upload straight into True Platform's Test Runs.

So I made it Lane C. Same skill. The agent will:

```bash
npm install --save-dev @katalon/playwright-reporter
# wire the reporter into playwright.config.ts
KATALON_API_KEY=… KATALON_PROJECT_ID=… npx playwright test   # runs + uploads
```

And the best bit is the **promotion pattern**:

> Explore the live site with a browser → cover new behavior *fast* in Lane A → once the happy paths are stable, port them to Lane C Playwright specs so CI gates every future merge.

One product. Three lanes. One place to read the results. The agent knows when to use which — and tells you when it's recommending a promotion vs a one-off.

---

## What surprised me

**Honesty beats capability.** The most useful thing I wrote into the skill wasn't a feature — it was the boundary table. *"You cannot create requirements in Katalon — they sync from Jira. Here's the workaround."* *"You cannot guarantee Run-with-AI finishes — report the blocked state and the exact fixture that's missing."* An agent that knows what it *can't* do is worth ten that bluff.

**The decision tree is the product.** OpenAI called it harness engineering; in QA it's the same move. You're not writing tests anymore. You're building the harness that lets an agent decide which tests matter, run them, and know when the result can be trusted.

**Fewer, stronger tests.** Left alone, an agent will happily generate 200 shallow cases. The ISTQB guardrails — *prefer fewer strong tests, mark P0 for revenue/checkout/data-loss* — are what make the output a QE would actually sign off on.

---

## Why workflows change this

Before workflows, this was a prompt you re-typed and re-tuned every time. Now it's a skill: written once, versioned, triggered by intent. The agent doesn't improvise the process — it *follows* it, and improvises only the judgment inside each step. That's the difference between a clever demo and something you'd let near a release.

I gave Claude Code my QA job. It didn't take it. It gave me back the eight hours I spent on the mechanical 80% — and handed me the 20% that's actually judgment.

That's the trade I'll take every time.

---

*The skill is built on Katalon True Platform's MCP toolset + the Playwright reporter. If you want the structure — the lane router, the ISTQB coverage guide, the capability-boundary table — say the word and I'll open-source the scaffold.*