Status: canonical / adopted 2026-04-15 Scope: every PR that touches src/token_world/dashboard/ Companion: sim-quality-rubric.md

Dashboard QA Checklist¶

Why this exists¶

Session 4 shipped a dashboard with a scroll-reset bug. Every poll cycle called outer.clear() on the tick-stream container, which destroyed the scroll offset on every refresh. I never caught it because I treated playwright_take_screenshot as the whole validation step. A screenshot is a still photograph of a live UI; the bug lives in the interaction.

The discipline is simple: a dashboard panel is not shipped until it has been used. Used means scrolled, clicked, resized, waited through, opened and re-opened. This doc is the required QA pass gate that turns "I looked at it" into "I used it."

If the checklist feels long — good. Every item corresponds to a real-world bug we missed or nearly missed. Cutting the list is how regressions ship.

Section 1 — Required Checks¶

Every dashboard PR must pass all nine before the panel is marked "shipped." Evidence is a screenshot, a log line, or a pytest assertion; "I looked and it seemed fine" does not count.

1. Initial render¶

Why: most visible layout bugs (overflow, clipped cells, wrong aspect ratio) show up as soon as the page loads at a real viewport.
How: launch token-world dashboard <slug>, navigate with Playwright at both 1280x800 and 1920x1080. Capture a fullPage screenshot at each. Compare header, left column, right column — no element should be clipped or overflowing.

2. Scroll preservation across poll cycles¶

Why: this is the Session 4 regression. A naïve outer.clear() in a ui.timer callback nukes scroll, focus, and text selection every refresh. The panel looks correct in screenshots but is unusable the moment the user tries to read it.
How: scroll to ~50% depth inside every scrollable region (tick stream, graph canvas, property drawer body, causal chain results). Wait at least 3× the poll interval (poll is 2s; wait ≥6s). Assert the scroll offset has not changed. Run one cycle at default 1280x800, one at 1920x1080.

3. Focus and text-selection preservation¶

Why: users type node ids into the causal-chain input. If focus is lost every 2s, the field is unusable. Same for highlighting tick payload text in the card expansion.
How: click into every text input, wait 3× poll. Focus must persist. Highlight a run of text in a read-only region; wait 3× poll. Selection must persist.

4. Interactive elements¶

Why: every click-target is a state-change vector the dashboard must handle without breakage. "Expand tick card," "click node in graph," "trace property" — each has its own DOM path that poll may or may not survive.
How: click every button, expansion arrow, node button, and dropdown at least once. For each triggered state, capture a screenshot named after the state (expanded-tick-042.png, drawer-cottage.png, etc.). If anything fails to respond, log the console and add it to the PR review.

Why: the graph-canvas drawer is the worst offender for poll-induced destruction, because its content comes from a select on a node id that re-evaluates on every _rebuild.
How: open the property drawer for a specific node. Wait 3× poll. Drawer must stay open with identical content (same header, same JSON body). Repeat with a second node to confirm the state isn't pinned to the first id by accident.

6. Polling no-op¶

Why: if _rebuild fires on every tick even when nothing changed, every other check here is at risk. The positive test is "during an idle window, the DOM-mutation log stays empty."
How: instrument whichever _rebuild / refresh() callback drives the panel with a log line that fires only when it actually changes the DOM. Point the dashboard at a stable universe (no active runner). Leave the browser idle for 30s. Expect zero log lines. Non-zero means poll is rebuilding pointlessly and killing interaction.

7. Empty / missing universe graceful degradation¶

Why: a stack trace at the first 500 breaks trust. The dashboard must render a friendly banner at every failure boundary.
How: point the dashboard at a non-existent slug (token-world dashboard does-not-exist). CLI should exit 1 before NiceGUI starts. Point it at an empty but valid universe (0 ticks); every panel must render a placeholder, nothing crashes. Delete the universe.db mid-session; graph panel surfaces an error, other panels keep polling without wedging.

8. Rendering correctness¶

Why: this catches escape-leakage, mis-coloured nodes, missing edges — the class of bugs where the DOM is "right" but the mapping from graph state to DOM is wrong. See §A3 and §A4 in MORNING-HANDOFF.md.
How: inspect every label for literal <, [, & leakage. Cross-reference node colours against docs/guides/dashboard.md (agent=blue, container=amber, etc.). Walk the graph: for every node property whose value is a known node id (located_in, contains, held_by, …), there MUST be a corresponding rendered edge, or the property must be explicitly excluded with a documented reason. Property drawer contents must exactly match kg.query(<node>, <prop>) for the live graph.

9. Automated test¶

Why: manual QA is a guardrail, not a gate. The only QA that scales is executable QA. Section 2 below is the Playwright routine this test embodies.
How: the above 1-8 are encoded as tests/test_dashboard/test_qa_interactive.py. It runs the Playwright routine against a fixture universe and a synthetic poll cycle, asserts each check. The test is a required CI gate — a PR that regresses any of 1-8 fails CI, not just human review.

Section 2 — Playwright MCP Routine¶

This is the concrete recipe for Section 1's checks. It uses the mcp__playwright__* tools available in-session. Run it as a fresh sequence each time — no "still have the browser open from earlier" assumptions.

Setup¶

Launch the dashboard in a background bash: uv run token-world dashboard <slug> --port 8080 --no-show
Wait 2s for the server to bind (NiceGUI startup is fast but not instant).

Routine¶

# 1. Open + viewport sweep
mcp__playwright__browser_navigate(url="http://localhost:8080")
mcp__playwright__browser_resize(width=1280, height=800)
mcp__playwright__browser_take_screenshot(filename="initial-1280.png", fullPage=true)
mcp__playwright__browser_resize(width=1920, height=1080)
mcp__playwright__browser_take_screenshot(filename="initial-1920.png", fullPage=true)

# 2. Snapshot the DOM for element refs
mcp__playwright__browser_snapshot()

# 3. Scroll + wait + re-snapshot (checks 2, 6)
#    For each scrollable region, scroll to mid-depth, wait 3x poll,
#    snapshot again, assert scroll offset unchanged.
mcp__playwright__browser_evaluate(function="() => { document.querySelector('#tick-stream').scrollTop = 400; }")
# wait >= 6s
mcp__playwright__browser_wait_for(time=7)
mcp__playwright__browser_evaluate(function="() => document.querySelector('#tick-stream').scrollTop")
# ^ expect 400, not 0

# 4. Interactive elements (check 4)
#    For each button / expansion / node click-target:
mcp__playwright__browser_click(element="tick card 042 expander", ref="<ref from snapshot>")
mcp__playwright__browser_wait_for(text="<expected content after expansion>")
mcp__playwright__browser_take_screenshot(filename="expanded-042.png")

# 5. Drawer persistence (check 5)
mcp__playwright__browser_click(element="graph node: cottage", ref="<ref>")
mcp__playwright__browser_wait_for(time=7)
mcp__playwright__browser_snapshot()
# ^ drawer should still show cottage properties

# 6. Focus preservation (check 3)
mcp__playwright__browser_click(element="causal-chain node input", ref="<ref>")
mcp__playwright__browser_type(element="...", ref="...", text="mira")
mcp__playwright__browser_wait_for(time=7)
mcp__playwright__browser_evaluate(function="() => document.activeElement.id")
# ^ expect the input's id, not body

# 7. Final full-page capture
mcp__playwright__browser_take_screenshot(filename="final.png", fullPage=true)

# 8. Close
mcp__playwright__browser_close()

Notes¶

browser_snapshot before browser_click: click needs an element ref which you extract from the accessibility-tree snapshot. Don't try to click by CSS selector alone.
browser_resize twice: responsiveness bugs only show at exactly one of the two viewports. Both must be screenshotted.
browser_wait_for(time=N) vs browser_wait_for(text=...): use text when you're waiting for a specific UI change (content to appear after a click). Use time for "let the poll cycle through at least twice." 7s covers a 2s poll comfortably.
browser_evaluate: the only way to read scrollTop, focus target, CSS state directly. Keep the JS one-liner.
Screenshots are evidence, not validation: a screenshot proves what the DOM looked like at that instant; it cannot prove that scroll was preserved across a refresh. That's what browser_evaluate + an assertion is for.

Section 3 — End-of-Build User Pass¶

After the Playwright routine passes, there's still one gap: the routine only checks what it was told to check. Novel bugs — the "would a real person expect this?" class — slip through any finite checklist.

The antidote is a 5-minute user-mode cooldown at the end of every UI build session.

The switch¶

Builder mode (what you were in): "does the code I wrote do what I thought?" The fixation is on your own change. Shorthand, clipped labels, and truncations read as "rendered, fine."
User mode (what you switch to): "does the output make sense to someone who has never seen this code?" The fixation is on the artefact. Every shorthand is a question: "would a real person expect that here?"

The routine¶

Close every editor tab.
Launch the dashboard fresh (token-world dashboard <slug>).
Open the browser manually — do NOT use Playwright, do NOT automate.
Set a 5-minute timer.
Click around. Scroll. Open drawers. Type into the causal-chain input. Resize the window. Read every truncated label and ask: is this enough for a new user?
Write down every surprise in a scratch note. Even "huh, that's weird" counts.
At timer end: each scratch-note item becomes either a follow-up ticket or a fix before the PR merges. None are dismissed silently.

Why 5 minutes¶

Short enough that it's cheap.
Long enough that you stop skimming and actually use the thing.
Forces the switch from "is this correct?" to "is this good?"

Enforcement¶

Hard to automate. Instead: every dashboard PR description must include a ## User Pass section that lists at least three observations from the cooldown. "Nothing surprising" is a valid entry only if the reviewer can independently verify; by default, expect at least one item of friction per pass.

In autonomous overnight sessions, the orchestrator spawns a dedicated QA subagent after the build subagent reports done. The QA subagent runs the Playwright routine (Section 2) and then writes a user-pass report. The orchestrator does not mark the build shipped until the QA subagent's report lands in docs/quality/runs/.

Section 4 — When to Gate¶

Mandatory pass¶

PR touches src/token_world/dashboard/ (any file):
Section 1 checks 1-8 must pass (evidence in PR description).
Section 2 Playwright routine must run against the branch build.
Section 1 check 9 (automated test) must be green in CI.
Section 3 user pass: at least three observations in PR description.
PR adds a new panel: treat the panel as a separate required sweep of all nine checks. The existing panels' test coverage does not extend to the new panel.
PR changes the poll interval or _rebuild cadence: Section 1 check 6 (polling no-op) is load-bearing; re-run it fresh.

Optional (but advised)¶

PR touches shared NiceGUI utilities but no panel directly: run Section 2 against an untouched panel as a smoke check. Polling helpers frequently regress everyone at once.
Monthly rhythm: run the full Playwright routine against master once a month independent of any PR, and file the results under docs/quality/runs/YYYY-MM-DD/. Catches drift that no individual PR triggered.

Non-gates (explicitly)¶

Docs-only PRs touching docs/guides/dashboard.md alone — no panel QA needed.
Dashboard CLI flag additions (e.g. --port, --no-dark) that don't touch panel code — smoke test the flag; skip the full sweep.

Escalation¶

If a check fails and the fix is non-trivial (≥ 1 tick of engineering work), the PR must either: - carry the fix, or - land behind a feature flag that disables the regressed surface until the fix is in, or - be reverted.

"We'll fix it next sprint" is not an allowed option for checks 1-8. Shipping a poll-regression dashboard into overnight runs means every subsequent debugging session starts from bad telemetry.

sim-quality-rubric.md — the sibling rubric for "is the simulation healthy?" Simulation-quality gates apply to overnight runs; dashboard QA gates apply to the panel PRs that surface those runs. Both are required; neither is sufficient.
MORNING-HANDOFF.md §K1, §K3, §K4 — the process-gap discussion this checklist codifies.
docs/guides/dashboard.md — user-facing dashboard guide. When a check here fails, the fix likely updates both files.