Skip to content

Observability & Diagnostics

TheAct provides three layers of observability, from always-on to on-demand. All instrumentation is passive — it never changes prompts, never alters agent behavior, and never adds overhead unless opted in.

Three-Layer Stack

graph TB
    subgraph "Layer 3: Context Profiler (on-demand)"
        P["profile_messages()\nToken allocation analysis"]
    end
    subgraph "Layer 2: Diagnostics Filesystem (opt-in)"
        D["diagnostics/turn-NNN/\nPer-agent prompts, responses, metadata"]
    end
    subgraph "Layer 1: Call Logging (always available)"
        L["LLMCallLog\nPer-call records: tokens, latency, parse result"]
    end
    L --> D
    L --> P

Each layer builds on the one below. Call logging captures raw data. The diagnostics filesystem writes that data (plus prompts and responses) to disk for inspection. The context profiler analyzes token allocation to find budget problems.

Call Logging

Every LLM call produces an LLMCallRecord with the following fields:

Field Description
timestamp When the call started
agent Which agent (narrator, character:maya, memory:maya, game_state, etc.)
turn Turn number
prompt_tokens Tokens in the prompt
thinking_tokens Tokens used for model reasoning
content_tokens Response content tokens
latency_ms Wall-clock time
finish_reason Why the model stopped (stop, length)
parse_result Success or failure type
parse_attempts Parse attempts before success
retry_count Full retries needed
temperature Temperature used
max_tokens Max tokens budget

Usage

from theact.llm.call_log import LLMCallLog

call_log = LLMCallLog()
await run_turn(state, player_input, call_log=call_log)

call_log.summary()         # Aggregate stats
call_log.agent_summary()   # Stats by agent
call_log.dump_yaml(path)   # Write to YAML

The log is a flat list, not a nested tree. Post-turn agents run concurrently, so a flat structure avoids locking and ordering assumptions. Filtering by turn or agent is a one-liner.

Diagnostics Filesystem

Pass debug=True to run_turn() to write per-agent artifacts to disk:

diagnostics/turn-001/
  narrator/
    system_prompt.txt     # Raw system prompt
    user_message.txt      # User message
    raw_response.txt      # Full model output
    call_record.yaml      # Tokens, latency, parse result
  character:maya/
    ...
  memory:maya/
    ...
  game_state/
    ...
  summary.yaml            # Turn-level aggregates

Plain text for prompts and responses (readable with cat/less), YAML for metadata, one directory per agent for easy diffing between turns. Files, not a database — the consumer is a human with Unix tools.

Error Taxonomy

Every YAMLParseError carries a failure_type field from this canonical set of six types. Each maps to a different fix strategy — lumping types would hide whether the model is ignoring instructions or trying and failing.

Type What Happened Fix Direction
empty_response Model produced nothing Check max_tokens, check for context overflow
no_yaml_block Model wrote prose, no YAML Strengthen YAML format instruction in prompt
invalid_yaml YAML syntax is broken repair_yaml_text fallback handles most cases; simplify expected structure if persistent
wrong_schema Valid YAML, wrong fields Update example in prompt to match expected schema
json_instead Model output JSON not YAML Add "Output YAML, not JSON" rule to prompt
echo_prompt Model echoed the prompt Reduce prompt size, check for context overflow

These types appear in call logs, diagnostics files, and playtest reports.

Context Profiler

Analyze token allocation for any agent's messages:

from theact.llm.profiler import profile_messages, format_profile

profile = profile_messages("narrator", messages, max_tokens_budget=2000)
print(format_profile(profile))

Prints token allocation per message component with numeric token counts and remaining headroom. Use this to find which part of a prompt is consuming the budget.

Prompt Linting

Automated tests in tests/test_prompt_lint.py enforce:

  • System prompts are 300 tokens or fewer (template form)
  • Rendered narrator prompts are 400 tokens or fewer (with real game data)
  • No orphan {placeholder} strings survive rendering
  • All agents have headroom >= 0 with real game data

These run as part of the standard test suite and catch budget regressions before they reach the model.

Key Files

File Contents
src/theact/llm/call_log.py LLMCallRecord, LLMCallLog
src/theact/engine/diagnostics.py Diagnostics filesystem writer
src/theact/llm/profiler.py profile_messages(), format_profile()
src/theact/llm/parsing.py YAMLParseError, ParseFailureType
tests/test_prompt_lint.py Prompt budget enforcement

See Also

  • Debugging — interactive debugging built on this instrumentation
  • Prompt Engineering — using diagnostics to fix prompts
  • Agents — the agent calls being observed