AI-driven usability testing: a think-aloud study with a team of AI testers

Manual usability testing is slow, expensive, and easy to skip when a deadline looms. The /tool-ux-study skill spawns a coordinated team of AI tester agents that each log in as a different persona, test the application under different themes and viewports, and report back — while a lead agent acts as UX research facilitator, observing sessions, probing for clarity, and synthesizing findings into a research-grade report.

Picture this: you’ve just shipped a feature. Your automated tests pass. Your CI pipeline is green. But nobody has actually sat down and tried to use the thing as a real person would. Someone who doesn’t know where the proposals section lives. Someone on a phone with a slow connection. Someone who relies on a keyboard because they can’t use a mouse. Manual usability testing is the part of the process most teams skip, not because they don’t care, but because it takes time and coordination you rarely have.

AI-driven testing is not a replacement for real users sitting in front of your application. Nothing replaces that. But when the alternative is no usability testing at all, which is the reality for most sprints, having AI testers go through your flows is better than shipping blind.

The /tool-ux-study skill runs a full think-aloud usability study using real browsers. Each tester agent uses playwright-cli to open your application in an actual browser, navigate pages, click buttons, fill forms, and take screenshots of what they see. They read the rendered HTML the same way a real user would. This isn’t mocked or simulated. The testers interact with your running application, across multiple personas, themes, viewports, and accessibility scenarios. All from a single command.

This is part of the agentic dev workflow series. If you’re new here, that first post covers the foundation: persistent memory, session heartbeat, and how skills plug into the workflow.

The skills shown here reflect my stack and conventions at the time of writing. They improve over time as the workflow learns from daily use. Your project will have different tools, different security concerns, different quality bars. These are examples of what’s possible, not prescriptions. Fork them, adjust them, or use them as inspiration for your own.

The lead agent is a facilitator, not just a coordinator

The skill isn’t a test runner that collects pass/fail counts. The lead AI agent takes the role of a UX research facilitator, the researcher who sits next to a participant in a usability lab, watches them struggle, and asks “Tell me what you were thinking when you clicked there.” That distinction shapes everything about how the skill works.

As tester reports arrive, the facilitator reads them with attention to emotional arcs. Did a tester start confident and hit a wall at the proposals section? Did someone who came in skeptical warm up once they found the events view? The facilitator looks for friction patterns that cross multiple testers. Three separate people struggling to find the same section isn’t three bugs: it’s an information architecture problem. It synthesizes those patterns into narrative themes instead of lists of isolated observations.

The facilitator also sends targeted follow-up probes to individual testers. At most one follow-up per tester, max three or four total per run. Enough to clarify ambiguity without slowing the parallel work down. The questions are open-ended and non-leading: “You gave that task a 2/5 confidence score. Was it that you couldn’t find the proposals section, or was the page itself confusing once you got there?” That’s the difference between a checklist and a research session.

The final report tells the story of what happened when real-ish people tried the platform. It reads like research findings, not a spreadsheet. That’s not the AI doing extra work for the sake of it; it’s the only format that actually helps a product team decide what to fix first.

Four triggers for probing

The facilitator doesn’t probe randomly. It fires a follow-up only when one of four specific situations occurs:

Ambiguous failure. The tester reported a task as FAIL but didn’t explain what happened clearly.
Pattern confirmation. You’re seeing the same friction across multiple testers and want to confirm it with a specific one.
Interesting contradiction. This tester’s experience contradicts others in a way worth exploring.
Unexplored critical path. The tester skipped something important for their persona without explanation.

That bounded probing keeps cost manageable and keeps the facilitator focused on signal, not noise.

The test matrix: personas, themes, viewports

Run it with a single command:

/tool-ux-study 5

That spawns five regular testers plus one accessibility tester by default. Or be explicit about exactly what you want:

/tool-ux-study --user=3 --duser=2 --model=sonnet --review-model=opus

Before spawning testers, the skill needs to know who your users are and what they do. It checks three sources in order:

ux-study.json in your project root, where you define personas, test users, and the application URL explicitly
Project documentation like your README, PRDs, and design docs, from which it generates personas based on your application’s purpose and features
Asking you directly: “What does this application do? Who uses it? What are the key features?”

The generated personas are presented for your approval before any agents spawn. You can add, remove, or adjust them.

Each persona has a background, a realistic scenario, and concrete tasks. For an e-commerce site, you might get “The Bargain Hunter” who compares prices and reads reviews. For a SaaS dashboard, “The New Employee” completing onboarding for the first time. For a community platform, “The Newcomer” exploring whether to join. These aren’t abstract test cases. They’re people trying to accomplish something, tailored to your specific application.

The skill then builds a test matrix. Each tester agent gets a unique slot in a grid of persona, theme, and viewport. Regular agent i gets the light theme if i % 2 == 0, dark theme otherwise. The first half get desktop (1440x900), the second half get mobile (375x812). Accessibility testers continue the alternating pattern after the regular testers.

Example matrix for --user=5 --duser=2:

#	Persona	Theme	Viewport
0	The Newcomer	light	desktop
1	The Regular User	dark	desktop
2	The Power User	light	desktop
3	The Mobile-First User	dark	mobile
4	The Skeptic	light	mobile
5	Screen Reader User (Blind)	dark	desktop
6	Motor Impairment (Keyboard Only)	light	desktop

The skill prints this table to the terminal before any agents spawn. You can see exactly what’s about to run and stop it if something looks wrong. Combined with the theme and viewport grid, each agent is testing a genuinely distinct slice of the product.

Accessibility testing is built in, not bolted on

By default, every run includes at least one accessibility tester. The five disability personas cover the main categories of impairment that WCAG 2.2 addresses.

Screen Reader User (Blind). Navigates using only keyboard and screen reader output. After each page loads, this tester runs a snapshot to examine the accessibility tree: heading hierarchy, alt text on images, ARIA labels on interactive elements, whether dynamic content changes get announced via live regions. Maps get special attention. Is there a text alternative, or is the location information only available visually?

Keyboard-Only User (Motor Impairment). Cannot touch the mouse. Every interactive element must be reachable by Tab, every modal dismissible with Escape. The tester checks for keyboard traps, visible focus indicators, and whether skip-to-content links exist. Mobile navigation via hamburger menu gets tested with keyboard alone.

Low Vision User (Partial Sight). Tests at 200% and 400% browser zoom. Does the layout reflow cleanly, or does content overflow and require horizontal scrolling? Placeholder text, disabled states, and subtle UI elements get checked for contrast. Icon-only buttons need to communicate their purpose without hover tooltips.

Cognitive Disability User. Complex interfaces are the problem here. This tester checks navigation consistency across pages, whether labels are clear and jargon-free, whether error messages explain what went wrong and how to fix it, and whether multi-step processes show progress. Recovery from mistakes matters: can you undo an action, go back without losing form data?

Deaf/Hard of Hearing User. Checks whether any feedback is audio-only. Notification banners, form validation, alert states all need visual equivalents. If the platform links to video content, does it indicate whether captions are available?

Each accessibility tester produces a standard tester report with an added ## Accessibility Barriers section that maps each finding to the WCAG criterion it violates. That section feeds directly into the WCAG expert review in phase 6.

How observations flow during the run

The facilitator doesn’t wait for all testers to finish before doing anything useful. As reports arrive, it reads them, appends to observation notes, and checks for emerging patterns every two or three completions.

Screenshot reading is deliberately selective. Looking at every screenshot from every tester would cost a lot of tokens and provide diminishing returns. A smooth session (score 8+, no critical bugs, no major friction) doesn’t need visual review. The facilitator reads screenshots only when a tester reports a critical or serious bug, or when a tester describes being stuck on a specific page. That keeps total screenshot reads around 15-25 for a typical run instead of over a hundred.

Every few completions, you get a research-style briefing:

Progress: 4/12 testers reporting in.

Emerging findings:
- Navigation to proposals is a consistent pain point — 3 of 4 testers struggled.
  Sophie: "I have no idea where proposals are, I've been clicking around for a while"
- Dark mode contrast issues on multiple pages (groups map, settings sidebar).
- Dashboard onboarding getting positive reactions — newcomers feel welcomed.

Waiting on 8 more testers...

Not just “4 done, 8 waiting.” The emerging findings at this point are already actionable. You can stop the run early if you’ve seen enough, or let it finish for full coverage.

When all testers have reported, the facilitator runs affinity mapping: clustering observations into themes across the full dataset. A bug reported on three different pages by three different testers might share one root cause. Friction around navigation on both desktop and mobile is an information architecture issue, not a viewport bug. The affinity map makes those connections explicit before the final report gets written.

The final REPORT.md reads like research findings. It opens with an executive summary, then narrates the themes that emerged, quotes from testers woven in, followed by a consolidated bug list sorted by severity. Individual tester reports live alongside it, one file per tester, so anyone who wants to trace a finding back to its source can.

Phase 6: expert reviews

After the tester reports and REPORT.md are written, three expert reviewer agents run in parallel. These are separate AI agents, each reading the full report and individual tester files, then applying a specialist lens.

The Product Manager reviewer reads the test findings against the product documentation. It identifies requirements that are working, areas not meeting their spec, features that were built but aren’t in the docs, and features documented as planned that haven’t been built yet. It also flags documentation improvements: specific files that need updating, what’s wrong in each one, and which tester findings revealed the discrepancy.

The UX Designer reviewer reads the tester reports and digs into the frontend source. It analyzes information architecture, navigation patterns, error handling, dark mode implementation quality, and mobile design. It produces design recommendations sorted by effort: quick wins, medium-effort improvements, and things that need a bigger redesign conversation.

The WCAG Expert reviewer reads the accessibility tester reports and maps every barrier to a WCAG 2.2 AA success criterion. It also cross-references the regular tester reports for accessibility-adjacent issues, including contrast problems, touch target sizes, and keyboard navigation gaps that non-disabled testers happened to notice. The output is a structured compliance assessment across all four WCAG principles: Perceivable, Operable, Understandable, Robust.

If you run with --duser=0, the WCAG review is skipped. It only makes sense if accessibility testers actually ran.

After all three reviews finish, the facilitator compiles a product backlog. One PBI per issue, no duplicates. If the same problem appears in the tester report, the PM review, and the UX review, it becomes one backlog item with links to all three sources. Priority maps from severity: Critical bugs become P0, High become P1. WCAG failures follow the same rule. Each PBI includes acceptance criteria that are specific and testable.

The backlog ends with a sprint planning suggestion: which items to take in sprints 1, 2, and 3 based on expected impact on the overall readiness score. P0 items go in sprint 1 by default; from there the facilitator suggests sequencing based on effort estimates and theme clustering.

The report site

After all reports, reviews, and the backlog are written, the skill scaffolds a VitePress site in the timestamped output directory. Browse it with:

cd test-reports/theme-test-20260322-143012 && npm start

The sidebar lists every individual tester report. The index page shows hero stats: total testers, average readiness score, bug counts by severity, accessibility tester count. Expert reviews link from the main report. The backlog is browsable with priority grouping.

This matters more than it sounds. A markdown file in a directory is hard to navigate when you have twelve tester reports, three expert reviews, and a backlog. The generated site turns the output into something you can sit down with as a team and actually work through, not something you forward over Slack and hope someone reads.

The site gets regenerated fresh each run. Old runs accumulate in their own timestamped directories, so you can compare this sprint’s report against last sprint’s without anything overwriting anything.

Model configuration

The skill supports three model tiers for tester agents, set independently from the reviewer agents. The --model flag controls only the tester agents. The lead facilitator always runs on whatever model you started the session with. Reviewer agents inherit the parent model unless you set --review-model explicitly.

Haiku is the cheapest option and works well for regular regression testing after merges. Sonnet is the default and provides a good balance. Opus produces richer tester narratives if you want the most detailed output for a critical release.

Costs vary widely depending on your application’s complexity, how many pages testers visit, and how many screenshots get reviewed. Start with a small run (/tool-ux-study 3) to get a feel for what a session costs for your project before scaling up.

What the skill handles vs. what the AI figures out

This is worth being direct about. The skill defines the structure: spawn N testers, assign personas from the list, alternate themes and viewports, send the tester prompt template, wait for reports, probe on triggers, affinity-map at the end, spawn three reviewers in parallel, generate a backlog with these rules, scaffold a VitePress site with this layout.

What the AI brings is the judgment inside that structure. The facilitator decides which arriving reports warrant a follow-up probe and what question to ask. It decides whether three similar friction moments share a root cause or are coincidentally similar symptoms. The WCAG reviewer decides how to interpret an ambiguous accessibility snapshot against a specific criterion. The UX reviewer notices when a navigation pattern that works in isolation creates confusion at the cross-feature level.

The skill sets up the conditions for that judgment to be applied consistently and repeatedly. Without the structure, the AI would wander. Without the AI, the structure is just a checklist.

That’s the honest description of what this kind of skill does. The scaffolding is explicit and version-controlled. The intelligence runs inside it.

Running it

Prerequisites before invoking:

Your application running locally
Test user accounts available (via ux-study.json, seed data, or self-registration)
playwright-cli installed

Start small. Three testers plus one accessibility tester is enough to surface the most significant issues in a first run:

/tool-ux-study 3

If browsers are left over from a previous run, clear them first:

playwright-cli close-all

Once you’ve seen the report format and have a feel for what surfaces, scale up to a full run for more coverage. The --model=haiku flag for regular regression testing after merges keeps the recurring cost low without giving up coverage entirely.

The output directory is timestamped, so runs accumulate. Running the same study across sprints lets you track how findings evolve: which P0s got fixed, which friction patterns persist, whether the accessibility score is moving in the right direction.

What it doesn’t replace

An AI agent navigating a browser isn’t a person. It doesn’t have the physical fatigue of someone who just tried four other platforms today. It doesn’t have the cognitive load of someone who’s simultaneously thinking about dinner. It can simulate the keyboard-only experience, but it isn’t actually using a screen reader with real voice output and building mental models from what it hears.

The WCAG reviewer flags this explicitly in its output. It tells you which findings couldn’t be verified with AI agents and suggests specific follow-up tests with real assistive technology users. Some WCAG criteria require human judgment that no amount of accessibility tree snapshots can substitute for.

What it does replace: the gap where no usability testing happened at all. That’s the default state for most teams most of the time. Getting structured feedback from ten personas across two themes, two viewports, and five disability categories, plus three expert reviews and a prioritized backlog, in a single afternoon changes what’s possible. Not perfect. Not a replacement for a real research program. But a lot better than shipping and hoping.