Status: canonical / adopted 2026-04-15 Scope: every PR that touches src/token_world/dashboard/ Companion: sim-quality-rubric.md
Dashboard QA Checklist¶
Why this exists¶
Session 4 shipped a dashboard with a scroll-reset bug. Every poll cycle called outer.clear() on the tick-stream container, which destroyed the scroll offset on every refresh. I never caught it because I treated playwright_take_screenshot as the whole validation step. A screenshot is a still photograph of a live UI; the bug lives in the interaction.
The discipline is simple: a dashboard panel is not shipped until it has been used. Used means scrolled, clicked, resized, waited through, opened and re-opened. This doc is the required QA pass gate that turns "I looked at it" into "I used it."
If the checklist feels long — good. Every item corresponds to a real-world bug we missed or nearly missed. Cutting the list is how regressions ship.
Section 1 — Required Checks¶
Every dashboard PR must pass all nine before the panel is marked "shipped." Evidence is a screenshot, a log line, or a pytest assertion; "I looked and it seemed fine" does not count.
1. Initial render¶
- Why: most visible layout bugs (overflow, clipped cells, wrong aspect ratio) show up as soon as the page loads at a real viewport.
- How: launch
token-world dashboard <slug>, navigate with Playwright at both1280x800and1920x1080. Capture a fullPage screenshot at each. Compare header, left column, right column — no element should be clipped or overflowing.
2. Scroll preservation across poll cycles¶
- Why: this is the Session 4 regression. A naïve
outer.clear()in aui.timercallback nukes scroll, focus, and text selection every refresh. The panel looks correct in screenshots but is unusable the moment the user tries to read it. - How: scroll to ~50% depth inside every scrollable region (tick stream, graph canvas, property drawer body, causal chain results). Wait at least 3× the poll interval (poll is 2s; wait ≥6s). Assert the scroll offset has not changed. Run one cycle at default
1280x800, one at1920x1080.
3. Focus and text-selection preservation¶
- Why: users type node ids into the causal-chain input. If focus is lost every 2s, the field is unusable. Same for highlighting tick payload text in the card expansion.
- How: click into every text input, wait 3× poll. Focus must persist. Highlight a run of text in a read-only region; wait 3× poll. Selection must persist.
4. Interactive elements¶
- Why: every click-target is a state-change vector the dashboard must handle without breakage. "Expand tick card," "click node in graph," "trace property" — each has its own DOM path that poll may or may not survive.
- How: click every button, expansion arrow, node button, and dropdown at least once. For each triggered state, capture a screenshot named after the state (
expanded-tick-042.png,drawer-cottage.png, etc.). If anything fails to respond, log the console and add it to the PR review.
5. Drawer / modal state persistence¶
- Why: the graph-canvas drawer is the worst offender for poll-induced destruction, because its content comes from a
selecton a node id that re-evaluates on every_rebuild. - How: open the property drawer for a specific node. Wait 3× poll. Drawer must stay open with identical content (same header, same JSON body). Repeat with a second node to confirm the state isn't pinned to the first id by accident.
6. Polling no-op¶
- Why: if
_rebuildfires on every tick even when nothing changed, every other check here is at risk. The positive test is "during an idle window, the DOM-mutation log stays empty." - How: instrument whichever
_rebuild/refresh()callback drives the panel with a log line that fires only when it actually changes the DOM. Point the dashboard at a stable universe (no active runner). Leave the browser idle for 30s. Expect zero log lines. Non-zero means poll is rebuilding pointlessly and killing interaction.
7. Empty / missing universe graceful degradation¶
- Why: a stack trace at the first 500 breaks trust. The dashboard must render a friendly banner at every failure boundary.
- How: point the dashboard at a non-existent slug (
token-world dashboard does-not-exist). CLI should exit 1 before NiceGUI starts. Point it at an empty but valid universe (0 ticks); every panel must render a placeholder, nothing crashes. Delete the universe.db mid-session; graph panel surfaces an error, other panels keep polling without wedging.
8. Rendering correctness¶
- Why: this catches escape-leakage, mis-coloured nodes, missing edges — the class of bugs where the DOM is "right" but the mapping from graph state to DOM is wrong. See §A3 and §A4 in
MORNING-HANDOFF.md. - How: inspect every label for literal
<,[,&leakage. Cross-reference node colours againstdocs/guides/dashboard.md(agent=blue, container=amber, etc.). Walk the graph: for every node property whose value is a known node id (located_in,contains,held_by, …), there MUST be a corresponding rendered edge, or the property must be explicitly excluded with a documented reason. Property drawer contents must exactly matchkg.query(<node>, <prop>)for the live graph.
9. Automated test¶
- Why: manual QA is a guardrail, not a gate. The only QA that scales is executable QA. Section 2 below is the Playwright routine this test embodies.
- How: the above 1-8 are encoded as
tests/test_dashboard/test_qa_interactive.py. It runs the Playwright routine against a fixture universe and a synthetic poll cycle, asserts each check. The test is a required CI gate — a PR that regresses any of 1-8 fails CI, not just human review.
Section 2 — Playwright MCP Routine¶
This is the concrete recipe for Section 1's checks. It uses the mcp__playwright__* tools available in-session. Run it as a fresh sequence each time — no "still have the browser open from earlier" assumptions.
Setup¶
- Launch the dashboard in a background bash:
uv run token-world dashboard <slug> --port 8080 --no-show - Wait 2s for the server to bind (NiceGUI startup is fast but not instant).
Routine¶
# 1. Open + viewport sweep
mcp__playwright__browser_navigate(url="http://localhost:8080")
mcp__playwright__browser_resize(width=1280, height=800)
mcp__playwright__browser_take_screenshot(filename="initial-1280.png", fullPage=true)
mcp__playwright__browser_resize(width=1920, height=1080)
mcp__playwright__browser_take_screenshot(filename="initial-1920.png", fullPage=true)
# 2. Snapshot the DOM for element refs
mcp__playwright__browser_snapshot()
# 3. Scroll + wait + re-snapshot (checks 2, 6)
# For each scrollable region, scroll to mid-depth, wait 3x poll,
# snapshot again, assert scroll offset unchanged.
mcp__playwright__browser_evaluate(function="() => { document.querySelector('#tick-stream').scrollTop = 400; }")
# wait >= 6s
mcp__playwright__browser_wait_for(time=7)
mcp__playwright__browser_evaluate(function="() => document.querySelector('#tick-stream').scrollTop")
# ^ expect 400, not 0
# 4. Interactive elements (check 4)
# For each button / expansion / node click-target:
mcp__playwright__browser_click(element="tick card 042 expander", ref="<ref from snapshot>")
mcp__playwright__browser_wait_for(text="<expected content after expansion>")
mcp__playwright__browser_take_screenshot(filename="expanded-042.png")
# 5. Drawer persistence (check 5)
mcp__playwright__browser_click(element="graph node: cottage", ref="<ref>")
mcp__playwright__browser_wait_for(time=7)
mcp__playwright__browser_snapshot()
# ^ drawer should still show cottage properties
# 6. Focus preservation (check 3)
mcp__playwright__browser_click(element="causal-chain node input", ref="<ref>")
mcp__playwright__browser_type(element="...", ref="...", text="mira")
mcp__playwright__browser_wait_for(time=7)
mcp__playwright__browser_evaluate(function="() => document.activeElement.id")
# ^ expect the input's id, not body
# 7. Final full-page capture
mcp__playwright__browser_take_screenshot(filename="final.png", fullPage=true)
# 8. Close
mcp__playwright__browser_close()
Notes¶
browser_snapshotbeforebrowser_click: click needs an elementrefwhich you extract from the accessibility-tree snapshot. Don't try to click by CSS selector alone.browser_resizetwice: responsiveness bugs only show at exactly one of the two viewports. Both must be screenshotted.browser_wait_for(time=N)vsbrowser_wait_for(text=...): usetextwhen you're waiting for a specific UI change (content to appear after a click). Usetimefor "let the poll cycle through at least twice." 7s covers a 2s poll comfortably.browser_evaluate: the only way to read scrollTop, focus target, CSS state directly. Keep the JS one-liner.- Screenshots are evidence, not validation: a screenshot proves what the DOM looked like at that instant; it cannot prove that scroll was preserved across a refresh. That's what
browser_evaluate+ an assertion is for.
Section 3 — End-of-Build User Pass¶
After the Playwright routine passes, there's still one gap: the routine only checks what it was told to check. Novel bugs — the "would a real person expect this?" class — slip through any finite checklist.
The antidote is a 5-minute user-mode cooldown at the end of every UI build session.
The switch¶
- Builder mode (what you were in): "does the code I wrote do what I thought?" The fixation is on your own change. Shorthand, clipped labels, and truncations read as "rendered, fine."
- User mode (what you switch to): "does the output make sense to someone who has never seen this code?" The fixation is on the artefact. Every shorthand is a question: "would a real person expect that here?"
The routine¶
- Close every editor tab.
- Launch the dashboard fresh (
token-world dashboard <slug>). - Open the browser manually — do NOT use Playwright, do NOT automate.
- Set a 5-minute timer.
- Click around. Scroll. Open drawers. Type into the causal-chain input. Resize the window. Read every truncated label and ask: is this enough for a new user?
- Write down every surprise in a scratch note. Even "huh, that's weird" counts.
- At timer end: each scratch-note item becomes either a follow-up ticket or a fix before the PR merges. None are dismissed silently.
Why 5 minutes¶
- Short enough that it's cheap.
- Long enough that you stop skimming and actually use the thing.
- Forces the switch from "is this correct?" to "is this good?"
Enforcement¶
Hard to automate. Instead: every dashboard PR description must include a ## User Pass section that lists at least three observations from the cooldown. "Nothing surprising" is a valid entry only if the reviewer can independently verify; by default, expect at least one item of friction per pass.
In autonomous overnight sessions, the orchestrator spawns a dedicated QA subagent after the build subagent reports done. The QA subagent runs the Playwright routine (Section 2) and then writes a user-pass report. The orchestrator does not mark the build shipped until the QA subagent's report lands in docs/quality/runs/.
Section 4 — When to Gate¶
Mandatory pass¶
- PR touches
src/token_world/dashboard/(any file): - Section 1 checks 1-8 must pass (evidence in PR description).
- Section 2 Playwright routine must run against the branch build.
- Section 1 check 9 (automated test) must be green in CI.
- Section 3 user pass: at least three observations in PR description.
- PR adds a new panel: treat the panel as a separate required sweep of all nine checks. The existing panels' test coverage does not extend to the new panel.
- PR changes the poll interval or
_rebuildcadence: Section 1 check 6 (polling no-op) is load-bearing; re-run it fresh.
Optional (but advised)¶
- PR touches shared NiceGUI utilities but no panel directly: run Section 2 against an untouched panel as a smoke check. Polling helpers frequently regress everyone at once.
- Monthly rhythm: run the full Playwright routine against
masteronce a month independent of any PR, and file the results underdocs/quality/runs/YYYY-MM-DD/. Catches drift that no individual PR triggered.
Non-gates (explicitly)¶
- Docs-only PRs touching
docs/guides/dashboard.mdalone — no panel QA needed. - Dashboard CLI flag additions (e.g.
--port,--no-dark) that don't touch panel code — smoke test the flag; skip the full sweep.
Escalation¶
If a check fails and the fix is non-trivial (≥ 1 tick of engineering work), the PR must either: - carry the fix, or - land behind a feature flag that disables the regressed surface until the fix is in, or - be reverted.
"We'll fix it next sprint" is not an allowed option for checks 1-8. Shipping a poll-regression dashboard into overnight runs means every subsequent debugging session starts from bad telemetry.
Related¶
- sim-quality-rubric.md — the sibling rubric for "is the simulation healthy?" Simulation-quality gates apply to overnight runs; dashboard QA gates apply to the panel PRs that surface those runs. Both are required; neither is sufficient.
MORNING-HANDOFF.md§K1, §K3, §K4 — the process-gap discussion this checklist codifies.docs/guides/dashboard.md— user-facing dashboard guide. When a check here fails, the fix likely updates both files.