Skip to content

Playtesting

The autonomous playtest framework runs an AI player agent through your game for N turns, logging every LLM call and producing a diagnostic report.

Running a Playtest

uv run python scripts/playtest.py --game lost-island --turns 20
Flag Default Description
--game (required) Game ID (directory name under games/)
--turns 20 Maximum turns
--player-name Alex Player character name
--stop-on-error off Stop on first error

What Happens During a Playtest

flowchart TD
    A["Create save from\ngame template"] --> B["AI player agent\ngenerates input"]
    B --> C["Full turn pipeline\n(narrator → characters →\nmemory → game state)"]
    C --> D["Log all LLM calls"]
    D --> E{"More turns?"}
    E -->|Yes| B
    E -->|No| F["Generate report\n(scores, errors, beats)"]

Each turn exercises the complete pipeline — narrator, character agents, memory updates, game state checks — identical to a real player session.

Reading the Report

The report covers five areas:

  • Turns completed — did the playtest finish all requested turns or error out early?
  • Beats hit — which story beats were triggered during the run?
  • Chapters advanced — did the game progress beyond the starting chapter?
  • Errors — parse failures, empty responses, exceptions with tracebacks.
  • LLM Call Summary — per-agent stats: total calls, mean latency, parse success rate, token breakdown, retry counts.

Quality Scoring

Each turn is scored on four dimensions:

Dimension Weight Ideal
Narration length 0.3 150-300 words
YAML first attempt 0.2 Parsed without retry
Character personality 0.3 Response matches personality markers
Memory relevance 0.2 Updates reference actual turn events

The composite score (0.0-1.0) appears in the report summary.

Edge Case Injection

The AI player agent injects stress-test inputs at configurable frequencies:

Type Default Rate Example
Direct strings 5% "ok", "yes", nonsense
Nonsensical input 3% Random characters, repeated text
Repeat action 3% Same action as previous turn

Configure via PlaytestConfig fields.

Common Issues

Symptom Likely Cause Fix
Empty narrator responses System prompt too long Trim world/chapter text
Malformed YAML Model ignores format Strengthen prompt example
Repetitive narration Stale rolling summary Check summary thresholds
Characters break voice Vague personality Add specific speech patterns
No beats hit Beats too specific Simplify beat phrases
Game stuck on chapter Completion too strict Loosen chapter YAML

Golden Scenarios

Scripted multi-turn behavioral tests with structural assertions. They sit between unit tests (too narrow) and full playtests (too broad) — each tests one behavior in 3-5 turns.

Running Scenarios

uv run python scripts/run_golden.py                           # All scenarios
uv run python scripts/run_golden.py --scenario crash_opening   # Single scenario

Scenario Format

name: Crash Opening Sequence
description: First 3 turns produce coherent crash scene
game: lost-island
turns:
  - input: null          # null = opening narration
    expect:
      narrator_not_empty: true
  - input: "I look for survivors."
    expect:
      narrator_not_empty: true
      characters_responded_min: 1

Scenarios live in tests/golden_scenarios/.

Available Assertions

Assertion Type Description
narrator_not_empty bool Narration is non-empty
narrator_word_count_min int Minimum narration words
narrator_word_count_max int Maximum narration words
characters_responded_min int Min characters responded
characters_responded_max int Max characters responded
characters_responded_includes list Specific character IDs that must respond
beats_hit_any bool At least one beat hit
beats_hit_count_min int Minimum beats hit

All assertions are structural (counts, presence) — not text content. Deterministic across model runs.

Existing Scenarios

File Tests
crash_opening Opening narration and first interactions
maya_dialogue Direct character interaction triggers response
short_input Handles minimal inputs without crashing
adversarial_input Meta-gaming, gibberish produce valid narration
both_characters Group interaction elicits multiple responses

Writing New Scenarios

  • Keep to 3-5 turns per scenario.
  • Use structural assertions only (never assert on exact text).
  • Test one behavior per scenario.
  • Use null input for opening turns.

See Also