Phase 09: Observability & Diagnostics¶
1. Overview¶
Phases 01-05 built the turn engine, agents, and playtest framework. Prompt iteration is now the main activity, but it is currently guesswork: there is no structured data about what the LLM does, how many tokens it uses, or why structured output parsing fails. This phase adds passive instrumentation so that every future prompt change is data-driven.
After this phase is complete: - Every LLM call logs structured metadata (agent, turn, tokens, latency, parse result, finish reason) - Debug mode writes full prompt/response artifacts to a diagnostics filesystem - Automated prompt lint tests catch budget violations and formatting errors - Structured output failures are categorized into an explicit error taxonomy - A context window profiler reports per-agent token allocation - The playtest framework aggregates call-level data into actionable reports
What This Phase Does NOT Do¶
- No prompt changes. Prompts are untouched — this phase builds the tools to make prompt changes safe.
- No new agents or game mechanics.
- No UI changes to the CLI or web interface.
- No architecture changes. The turn flow remains: narrator -> characters -> post-turn (parallel).
Dependencies¶
- Phase 01 (data models, YAML I/O, save manager)
- Phase 02 (LLM client, streaming, structured output)
- Phase 03 (turn engine, agents, context assembly)
- Phase 05 (playtest framework)
2. Error Taxonomy¶
Before building logging, define the vocabulary for structured output failures. This taxonomy is used throughout call logging, diagnostics, and playtest reporting.
2.1 Failure Categories¶
Create src/theact/llm/errors.py (extend the existing file) with an enum:
# src/theact/llm/errors.py (add to existing file)
from enum import Enum
class ParseFailureType(str, Enum):
"""Categorization of structured output parse failures."""
success = "success"
no_yaml_block = "no_yaml_block" # Model didn't produce fenced YAML
invalid_yaml = "invalid_yaml" # Fenced block present but YAML syntax error
wrong_schema = "wrong_schema" # Valid YAML but missing required fields or wrong types
empty_response = "empty_response" # Model returned empty or whitespace-only content
echo_prompt = "echo_prompt" # Model echoed back part of the prompt
json_instead = "json_instead" # Model produced JSON instead of YAML
2.2 Classification Function¶
Add a classifier to src/theact/llm/parsing.py:
def classify_parse_failure(raw_content: str, error: Exception | None = None) -> ParseFailureType:
"""Classify a structured output failure into a category.
Called when YAML parsing fails to produce a descriptive category
for logging and diagnostics.
"""
if not raw_content or not raw_content.strip():
return ParseFailureType.empty_response
stripped = raw_content.strip()
# Check for JSON output
if stripped.startswith("{") or stripped.startswith("["):
return ParseFailureType.json_instead
# Check for echoed prompt (heuristic: starts with "You are" or "SETTING:")
prompt_indicators = ["You are", "SETTING:", "YOUR TASK:", "Output a YAML"]
if any(stripped.startswith(ind) for ind in prompt_indicators):
return ParseFailureType.echo_prompt
# Check if there's a YAML fence at all
if "```yaml" not in raw_content and "```" not in raw_content:
# Try to determine if it's valid YAML without fences
try:
import yaml
result = yaml.safe_load(stripped)
if isinstance(result, dict):
return ParseFailureType.no_yaml_block # valid YAML but no fence
except yaml.YAMLError:
pass
return ParseFailureType.no_yaml_block
# Has a fence but parsing failed — must be invalid YAML syntax
if error and "parse" in str(error).lower():
return ParseFailureType.invalid_yaml
return ParseFailureType.invalid_yaml
2.3 Wire into Existing Parse Pipeline¶
Modify parse_yaml_response() and complete_structured() so that when a YAMLParseError is raised, the error includes the failure category:
# In YAMLParseError (src/theact/llm/parsing.py):
class YAMLParseError(Exception):
def __init__(self, message: str, raw_content: str, failure_type: ParseFailureType | None = None):
super().__init__(message)
self.raw_content = raw_content
self.failure_type = failure_type or classify_parse_failure(raw_content)
Verification¶
classify_parse_failure()returns correct categories for: empty string, JSON string, string withYou areprefix, string withyamlfence but broken syntax, string with valid YAML but no fence.- Existing
YAMLParseErrorcallers continue to work (new parameter is optional with default).
3. Structured LLM Call Logging¶
3.1 Call Record Dataclass¶
Create src/theact/llm/call_log.py:
"""Structured logging for LLM API calls."""
from __future__ import annotations
from dataclasses import asdict, dataclass, field
from pathlib import Path
from typing import Optional
import yaml
from theact.llm.errors import ParseFailureType
@dataclass
class LLMCallRecord:
"""One LLM API call with all metadata."""
timestamp: str # ISO 8601
agent: str # e.g. "narrator", "character:maya", "memory:joaquin",
# "game_state", "summarizer", "player"
turn: int # turn number in the session
prompt_tokens: int # estimated tokens in the prompt
thinking_tokens: int # tokens consumed by <think> reasoning
content_tokens: int # tokens in the actual response content
latency_ms: int # wall-clock time for the call
finish_reason: str # "stop", "length", etc.
parse_result: str # ParseFailureType value ("success", "no_yaml_block", etc.)
parse_attempts: int # total attempts (1 = first try worked)
retry_count: int # number of retries (0 = no retries)
temperature: float # temperature used for this call
max_tokens: int # max_tokens budget for this call
@dataclass
class LLMCallLog:
"""Accumulates LLM call records for a session.
Thread-safe for asyncio (single-threaded event loop).
"""
records: list[LLMCallRecord] = field(default_factory=list)
def log(self, record: LLMCallRecord) -> None:
"""Append a call record."""
self.records.append(record)
def records_for_turn(self, turn: int) -> list[LLMCallRecord]:
"""Return all records for a given turn."""
return [r for r in self.records if r.turn == turn]
def records_for_agent(self, agent: str) -> list[LLMCallRecord]:
"""Return all records for a given agent prefix (e.g. 'character' matches 'character:maya')."""
return [r for r in self.records if r.agent == agent or r.agent.startswith(f"{agent}:")]
def summary(self) -> dict:
"""Aggregate stats across all records."""
if not self.records:
return {}
total = len(self.records)
return {
"total_calls": total,
"mean_latency_ms": sum(r.latency_ms for r in self.records) // total,
"parse_success_rate": round(
sum(1 for r in self.records if r.parse_result == "success") / total, 3
),
"total_prompt_tokens": sum(r.prompt_tokens for r in self.records),
"total_thinking_tokens": sum(r.thinking_tokens for r in self.records),
"total_content_tokens": sum(r.content_tokens for r in self.records),
"length_finishes": sum(1 for r in self.records if r.finish_reason == "length"),
"total_retries": sum(r.retry_count for r in self.records),
}
def agent_summary(self) -> dict[str, dict]:
"""Per-agent aggregate stats, keyed by agent name."""
agents: dict[str, list[LLMCallRecord]] = {}
for r in self.records:
agents.setdefault(r.agent, []).append(r)
result = {}
for agent, recs in sorted(agents.items()):
n = len(recs)
result[agent] = {
"calls": n,
"mean_latency_ms": sum(r.latency_ms for r in recs) // n,
"mean_thinking_tokens": sum(r.thinking_tokens for r in recs) // n,
"mean_content_tokens": sum(r.content_tokens for r in recs) // n,
"parse_success_rate": round(
sum(1 for r in recs if r.parse_result == "success") / n, 3
),
"length_finishes": sum(1 for r in recs if r.finish_reason == "length"),
}
return result
def dump_yaml(self, path: Path) -> None:
"""Write all records to a YAML file."""
data = [asdict(r) for r in self.records]
path.parent.mkdir(parents=True, exist_ok=True)
with open(path, "w") as f:
yaml.dump(data, f, default_flow_style=False, sort_keys=False)
3.2 Recording Calls in the Inference Layer¶
The complete() and stream() functions in src/theact/llm/inference.py do not know the agent name or turn number — those are higher-level concepts. Rather than threading a logger through the low-level inference functions, each agent function is responsible for creating a call record after its LLM call completes.
Add an optional call_log: LLMCallLog | None parameter to each agent function:
# Example for run_narrator in src/theact/agents/narrator.py:
async def run_narrator(
game: LoadedGame,
player_input: str,
llm_config: LLMConfig,
on_token: StreamCallback | None = None,
call_log: LLMCallLog | None = None, # NEW
turn: int = 0, # NEW
) -> NarratorOutput:
After the LLM call completes, create and log a record:
import time
from datetime import datetime, timezone
from theact.llm.tokens import estimate_tokens
start = time.monotonic()
# ... existing LLM call ...
elapsed_ms = int((time.monotonic() - start) * 1000)
if call_log is not None:
call_log.log(LLMCallRecord(
timestamp=datetime.now(timezone.utc).isoformat(),
agent="narrator",
turn=turn,
prompt_tokens=estimate_tokens(messages[0]["content"]) + estimate_tokens(messages[1]["content"]),
thinking_tokens=estimate_tokens(result.thinking) if hasattr(result, "thinking") else 0,
content_tokens=estimate_tokens(result.raw_content if hasattr(result, "raw_content") else result.content),
latency_ms=elapsed_ms,
finish_reason=result.finish_reason,
parse_result="success", # or classify_parse_failure() on exception
parse_attempts=result.attempts if hasattr(result, "attempts") else 1,
retry_count=max(0, (result.attempts if hasattr(result, "attempts") else 1) - 1),
temperature=NARRATOR_CONFIG.temperature or llm_config.default_temperature,
max_tokens=NARRATOR_CONFIG.max_tokens or llm_config.default_max_tokens,
))
On YAMLParseError, the catch block records a failure:
except YAMLParseError as e:
if call_log is not None:
call_log.log(LLMCallRecord(
# ... same fields ...
parse_result=e.failure_type.value if e.failure_type else "invalid_yaml",
# ...
))
Apply the same pattern to: - run_character() — agent name character:{char_id} - run_memory_update() — agent name memory:{char_id} - run_game_state() — agent name game_state - run_chapter_summary() and run_rolling_summary() — agent name summarizer
3.3 Wiring Through the Turn Engine¶
Add an optional call_log parameter to run_turn() in src/theact/engine/turn.py:
async def run_turn(
game: LoadedGame,
player_input: str,
llm_config: LLMConfig,
on_token: StreamCallback | None = None,
call_log: LLMCallLog | None = None, # NEW
) -> TurnResult:
Pass call_log and new_turn to each agent call. The call_log is optional — if None, no logging occurs, preserving backward compatibility.
3.4 Token Estimation for Records¶
Use the existing estimate_tokens() from src/theact/llm/tokens.py for prompt and content token counts. For thinking tokens, use the same function on the thinking text. When the API returns actual prompt_tokens or completion_tokens in the response, prefer those over estimates.
Add a helper to src/theact/llm/tokens.py:
def estimate_messages_content_tokens(messages: list[dict[str, str]]) -> int:
"""Estimate total content tokens across all messages (excluding overhead)."""
return sum(estimate_tokens(m.get("content", "")) for m in messages)
Verification¶
LLMCallLog.log()appends records correctly.LLMCallLog.summary()computes correct aggregates from test records.LLMCallLog.agent_summary()groups by agent name.LLMCallLog.dump_yaml()writes valid YAML that can be read back.- Passing
call_log=Noneto all agent functions has no effect (backward compatible). - Passing a real
LLMCallLogproduces one record per agent call during a turn.
4. Diagnostics Filesystem¶
4.1 Design¶
When run_turn() is called with debug=True, write detailed artifacts to disk for every agent call within that turn. The structure:
saves/<save-id>/diagnostics/
turn-001/
summary.yaml # aggregated turn stats
narrator/
system_prompt.txt # fully rendered system prompt
user_message.txt # fully rendered user message
raw_response.txt # complete model output (thinking + content)
parsed.yaml # successfully parsed structured output (if any)
call_log.yaml # structured call metadata
character-maya/
system_prompt.txt
user_message.txt
raw_response.txt
call_log.yaml
character-joaquin/
system_prompt.txt
user_message.txt
raw_response.txt
call_log.yaml
memory-maya/
system_prompt.txt
user_message.txt
raw_response.txt
parsed.yaml
call_log.yaml
game_state/
system_prompt.txt
user_message.txt
raw_response.txt
parsed.yaml
call_log.yaml
turn-002/
...
4.2 Diagnostics Writer¶
Create src/theact/engine/diagnostics.py:
"""Diagnostics filesystem writer for debug mode."""
from __future__ import annotations
from dataclasses import asdict
from pathlib import Path
import yaml
from theact.llm.call_log import LLMCallRecord
class DiagnosticsWriter:
"""Writes per-agent diagnostics artifacts to disk.
Created per-turn. Call write_agent() for each agent that runs,
then write_summary() at the end of the turn.
"""
def __init__(self, save_path: Path, turn: int):
self.turn_dir = save_path / "diagnostics" / f"turn-{turn:03d}"
self.turn_dir.mkdir(parents=True, exist_ok=True)
def write_agent(
self,
agent_dir_name: str,
messages: list[dict[str, str]],
raw_response: str,
thinking: str,
parsed_data: dict | None,
call_record: LLMCallRecord | None,
) -> None:
"""Write all artifacts for a single agent call."""
agent_dir = self.turn_dir / agent_dir_name
agent_dir.mkdir(parents=True, exist_ok=True)
# System prompt
system_msgs = [m for m in messages if m.get("role") == "system"]
if system_msgs:
(agent_dir / "system_prompt.txt").write_text(
system_msgs[0].get("content", ""), encoding="utf-8"
)
# User message(s)
user_msgs = [m for m in messages if m.get("role") == "user"]
if user_msgs:
user_text = "\n\n---\n\n".join(m.get("content", "") for m in user_msgs)
(agent_dir / "user_message.txt").write_text(user_text, encoding="utf-8")
# Raw response (thinking + content)
raw_parts = []
if thinking:
raw_parts.append(f"<think>\n{thinking}\n</think>\n")
raw_parts.append(raw_response)
(agent_dir / "raw_response.txt").write_text(
"\n".join(raw_parts), encoding="utf-8"
)
# Parsed YAML output
if parsed_data is not None:
with open(agent_dir / "parsed.yaml", "w") as f:
yaml.dump(parsed_data, f, default_flow_style=False, sort_keys=False)
# Call metadata
if call_record is not None:
with open(agent_dir / "call_log.yaml", "w") as f:
yaml.dump(asdict(call_record), f, default_flow_style=False, sort_keys=False)
def write_summary(self, call_records: list[LLMCallRecord]) -> None:
"""Write aggregated turn-level summary."""
if not call_records:
return
total_latency = sum(r.latency_ms for r in call_records)
total_prompt = sum(r.prompt_tokens for r in call_records)
total_thinking = sum(r.thinking_tokens for r in call_records)
total_content = sum(r.content_tokens for r in call_records)
parse_failures = [r for r in call_records if r.parse_result != "success"]
summary = {
"turn": call_records[0].turn,
"agent_count": len(call_records),
"total_latency_ms": total_latency,
"total_prompt_tokens": total_prompt,
"total_thinking_tokens": total_thinking,
"total_content_tokens": total_content,
"parse_failures": [
{"agent": r.agent, "type": r.parse_result} for r in parse_failures
],
"agents": [
{
"agent": r.agent,
"latency_ms": r.latency_ms,
"prompt_tokens": r.prompt_tokens,
"thinking_tokens": r.thinking_tokens,
"content_tokens": r.content_tokens,
"parse_result": r.parse_result,
"finish_reason": r.finish_reason,
}
for r in call_records
],
}
with open(self.turn_dir / "summary.yaml", "w") as f:
yaml.dump(summary, f, default_flow_style=False, sort_keys=False)
4.3 Integration with Turn Engine¶
Add a debug: bool = False parameter to run_turn(). When True:
- Create a
DiagnosticsWriter(game.save_path, new_turn) - After each agent call, call
writer.write_agent()with the messages, response, and record - At the end of the turn, call
writer.write_summary()with all records from that turn
Each agent function returns its messages alongside its result. The simplest approach: have each agent function accept an optional _return_messages: bool = False internal flag, or capture messages in the turn engine before calling the agent. The cleaner approach is to capture messages in the turn engine by calling the context builder directly:
# In run_turn(), before calling run_narrator():
if debug:
diag = DiagnosticsWriter(game.save_path, new_turn)
narrator_messages = build_narrator_messages(game, player_input, llm_config)
Then after the agent completes:
if debug:
diag.write_agent(
agent_dir_name="narrator",
messages=narrator_messages,
raw_response=narrator_output.narration,
thinking="", # thinking from stream is not retained — see Section 4.4
parsed_data=result.data if not isinstance(result, Exception) else None,
call_record=turn_records[-1] if turn_records else None,
)
4.4 Capturing Thinking Tokens from Streaming¶
The narrator and character agents use streaming, which currently discards thinking tokens after display. To capture them for diagnostics, accumulate thinking chunks alongside content chunks in the agent functions:
# In run_narrator():
thinking_parts: list[str] = []
async for chunk in stream_iter:
if on_token and chunk.content:
await on_token(chunk.content)
if chunk.thinking:
thinking_parts.append(chunk.thinking)
full_thinking = "".join(thinking_parts)
Store full_thinking on the result or pass it directly to the diagnostics writer.
4.5 CLI and Playtest Integration¶
Add a --debug flag to: - scripts/playtest.py — enables diagnostics filesystem for every turn - The CLI (python -m theact) — enables diagnostics for the current session
When --debug is active, print a message at the end of each turn:
Verification¶
DiagnosticsWriter.write_agent()creates the expected directory structure and files.DiagnosticsWriter.write_summary()produces valid YAML with correct aggregates.- Running a turn with
debug=Truecreatessaves/<save>/diagnostics/turn-001/with subdirectories for each agent. - Each
system_prompt.txtcontains the fully rendered prompt (no{placeholders}). - Each
raw_response.txtcontains the model's actual output. - Running with
debug=False(default) creates no diagnostics files.
5. Context Window Profiler¶
5.1 Design¶
A utility that computes and reports per-agent token allocation for a turn. This answers the question: "How close is each agent to its context window limit?"
Create src/theact/llm/profiler.py:
"""Context window profiler for per-agent token analysis."""
from __future__ import annotations
from dataclasses import dataclass
from theact.llm.tokens import estimate_tokens
@dataclass
class AgentProfile:
"""Token allocation for a single agent call."""
agent: str
system_prompt_tokens: int
user_message_tokens: int
total_prompt_tokens: int
max_tokens_budget: int # the agent's max_tokens setting
context_limit: int # the model's context window (8192)
headroom: int # context_limit - total_prompt - max_tokens_budget
# Breakdown of user message components (if available)
summary_tokens: int = 0 # rolling summary
conversation_tokens: int = 0 # recent conversation history
chapter_context_tokens: int = 0 # chapter beats, completion criteria
current_input_tokens: int = 0 # player input for this turn
# Actual response data (filled after the call completes)
actual_thinking_tokens: int = 0
actual_content_tokens: int = 0
def profile_messages(
agent: str,
messages: list[dict[str, str]],
max_tokens_budget: int,
context_limit: int = 8192,
) -> AgentProfile:
"""Profile a set of messages for token allocation."""
system_tokens = 0
user_tokens = 0
for msg in messages:
content = msg.get("content", "")
if msg.get("role") == "system":
system_tokens += estimate_tokens(content)
elif msg.get("role") == "user":
user_tokens += estimate_tokens(content)
total = system_tokens + user_tokens
headroom = context_limit - total - max_tokens_budget
return AgentProfile(
agent=agent,
system_prompt_tokens=system_tokens,
user_message_tokens=user_tokens,
total_prompt_tokens=total,
max_tokens_budget=max_tokens_budget,
context_limit=context_limit,
headroom=headroom,
)
def format_profile(profile: AgentProfile) -> str:
"""Format a profile as a human-readable string."""
bar_width = 40
pct_prompt = profile.total_prompt_tokens / profile.context_limit
pct_budget = profile.max_tokens_budget / profile.context_limit
pct_headroom = max(0, profile.headroom) / profile.context_limit
prompt_bar = int(pct_prompt * bar_width)
budget_bar = int(pct_budget * bar_width)
headroom_bar = bar_width - prompt_bar - budget_bar
bar = "#" * prompt_bar + "=" * budget_bar + "." * max(0, headroom_bar)
lines = [
f"Agent: {profile.agent}",
f" System prompt: {profile.system_prompt_tokens:>5} tokens",
f" User message: {profile.user_message_tokens:>5} tokens",
f" Total prompt: {profile.total_prompt_tokens:>5} tokens ({pct_prompt:.1%})",
f" Max tokens: {profile.max_tokens_budget:>5} tokens ({pct_budget:.1%})",
f" Headroom: {profile.headroom:>5} tokens ({pct_headroom:.1%})",
f" [{bar}] {profile.context_limit}",
]
if profile.headroom < 0:
lines.append(f" WARNING: Over budget by {-profile.headroom} tokens!")
if profile.actual_thinking_tokens or profile.actual_content_tokens:
lines.append(f" Actual thinking: {profile.actual_thinking_tokens:>5} tokens")
lines.append(f" Actual content: {profile.actual_content_tokens:>5} tokens")
ratio = profile.actual_thinking_tokens / max(profile.actual_content_tokens, 1)
lines.append(f" Thinking:content: {ratio:.1f}:1")
return "\n".join(lines)
def format_turn_profile(profiles: list[AgentProfile]) -> str:
"""Format all agent profiles for a complete turn."""
sections = ["=== CONTEXT WINDOW PROFILE ===", ""]
for p in profiles:
sections.append(format_profile(p))
sections.append("")
return "\n".join(sections)
5.2 Integration Points¶
The profiler can be used from:
-
Diagnostics filesystem — when
debug=True, include acontext_profile.yamlin each turn's diagnostics directory with all agent profiles. -
Scripts — a
--profile-contextflag onscripts/diagnose_agent.pythat outputs the profile before making the LLM call:
- Playtest reports — aggregate context profiles across turns to show how prompt sizes grow over a session (see Section 7).
5.3 User Message Breakdown¶
For the narrator agent, break down the user message into its components (summary, conversation, chapter context, player input). This requires collaboration with the context assembly code.
Add an optional _profile: bool = False parameter to build_narrator_messages() that, when True, returns the message list alongside a breakdown dict:
def build_narrator_messages(
game: LoadedGame,
player_input: str,
llm_config: LLMConfig,
profile: bool = False,
) -> list[Message] | tuple[list[Message], dict[str, int]]:
"""Build narrator messages. If profile=True, also return token breakdown."""
# ... existing logic ...
if profile:
breakdown = {
"summary_tokens": estimate_tokens(game.state.rolling_summary),
"conversation_tokens": estimate_tokens(recent_text),
"chapter_context_tokens": estimate_tokens(chapter_context),
"current_input_tokens": estimate_tokens(player_input),
}
return messages, breakdown
return messages
Verification¶
profile_messages()returns correct token counts for test messages.format_profile()produces readable output with a visual bar.- Headroom correctly goes negative when prompt + max_tokens > context_limit.
- Profile integration with diagnostics filesystem writes
context_profile.yaml.
6. Prompt Linting¶
6.1 Design¶
Automated tests that catch prompt-level problems without making LLM calls. These run as part of the standard pytest suite.
Create tests/test_prompt_lint.py:
"""Prompt linting tests — catch prompt problems without LLM calls."""
import re
import pytest
from theact.agents.prompts import (
CHARACTER_SYSTEM,
CHAPTER_SUMMARY_SYSTEM,
GAME_STATE_SYSTEM,
MEMORY_UPDATE_SYSTEM,
NARRATOR_SYSTEM,
ROLLING_SUMMARY_SYSTEM,
)
from theact.llm.tokens import estimate_tokens
# --- Token budget tests ---
PROMPT_TOKEN_BUDGETS = {
"NARRATOR_SYSTEM": (NARRATOR_SYSTEM, 300),
"CHARACTER_SYSTEM": (CHARACTER_SYSTEM, 300),
"MEMORY_UPDATE_SYSTEM": (MEMORY_UPDATE_SYSTEM, 300),
"GAME_STATE_SYSTEM": (GAME_STATE_SYSTEM, 300),
"CHAPTER_SUMMARY_SYSTEM": (CHAPTER_SUMMARY_SYSTEM, 300),
"ROLLING_SUMMARY_SYSTEM": (ROLLING_SUMMARY_SYSTEM, 300),
}
@pytest.mark.parametrize("name,spec", PROMPT_TOKEN_BUDGETS.items())
def test_system_prompt_token_budget(name, spec):
"""System prompts must stay under their token budget.
Per CLAUDE.md: system prompts should stay under ~300 tokens.
This tests the TEMPLATE before variable substitution. Rendered
prompts will be slightly larger due to injected game content.
"""
template, budget = spec
# Strip format placeholders for estimation (they'll be replaced with real content)
stripped = re.sub(r"\{[^}]+\}", "PLACEHOLDER", template)
tokens = estimate_tokens(stripped)
assert tokens <= budget, (
f"{name} template is ~{tokens} tokens (budget: {budget}). "
f"Trim the prompt to fit."
)
# --- Orphan placeholder tests ---
def _render_narrator_prompt():
"""Render the narrator prompt with dummy values to check for orphan placeholders."""
return NARRATOR_SYSTEM.format(
world_setting="A tropical island.",
world_tone="Tense survival.",
world_rules="No magic.",
chapter_context="Chapter 1: The Crash",
active_characters="maya (Maya Chen), joaquin (Father Joaquin)",
)
def _render_character_prompt():
"""Render the character prompt with dummy values."""
return CHARACTER_SYSTEM.format(
name="Maya",
role="Engineer",
personality="Practical, direct.",
secret="She caused the crash.",
relationships="Player: cautious ally",
memory_block="No memories yet.",
)
def _render_memory_prompt():
return MEMORY_UPDATE_SYSTEM.format(name="Maya")
RENDERED_PROMPTS = {
"narrator": _render_narrator_prompt,
"character": _render_character_prompt,
"memory": _render_memory_prompt,
}
@pytest.mark.parametrize("name,renderer", RENDERED_PROMPTS.items())
def test_no_orphan_placeholders(name, renderer):
"""After rendering, no {placeholder} strings should remain."""
rendered = renderer()
orphans = re.findall(r"\{[a-z_]+\}", rendered)
assert not orphans, (
f"{name} prompt has orphan placeholders after rendering: {orphans}"
)
# --- YAML hint consistency tests ---
def test_narrator_yaml_hint_fields():
"""The narrator YAML example in the prompt must include all parsed fields."""
required_fields = ["narration", "responding_characters", "mood"]
for field in required_fields:
assert field in NARRATOR_SYSTEM, (
f"Narrator prompt missing YAML field '{field}' in example"
)
def test_memory_yaml_hint_fields():
"""The memory YAML example must include all parsed fields."""
required_fields = ["summary", "add", "remove", "update"]
for field in required_fields:
assert field in MEMORY_UPDATE_SYSTEM, (
f"Memory prompt missing YAML field '{field}' in example"
)
def test_game_state_yaml_hint_fields():
"""The game state YAML example must include all parsed fields."""
required_fields = ["chapter_complete", "new_beats"]
for field in required_fields:
assert field in GAME_STATE_SYSTEM, (
f"Game state prompt missing YAML field '{field}' in example"
)
# --- Rendered prompt budget tests ---
def test_rendered_narrator_prompt_budget():
"""Fully rendered narrator prompt should stay under 400 tokens.
The template is ~300 tokens; rendered with real game content
it should not exceed ~400.
"""
rendered = _render_narrator_prompt()
tokens = estimate_tokens(rendered)
assert tokens <= 400, (
f"Rendered narrator prompt is ~{tokens} tokens (budget: 400)"
)
6.2 Context Assembly Lint Tests¶
Add tests in tests/test_context_lint.py that verify context assembly with realistic game data stays within budgets:
"""Context assembly budget tests using realistic game fixtures."""
from theact.engine.context import (
build_character_messages,
build_game_state_messages,
build_memory_messages,
build_narrator_messages,
)
from theact.llm.config import (
CHARACTER_CONFIG,
GAME_STATE_CONFIG,
LLMConfig,
MEMORY_UPDATE_CONFIG,
NARRATOR_CONFIG,
)
from theact.llm.profiler import profile_messages
def test_narrator_context_fits_window(sample_game):
"""Narrator prompt + max_tokens must fit within 8192 context limit."""
llm_config = LLMConfig(api_key="test")
messages = build_narrator_messages(sample_game, "I search for water.", llm_config)
profile = profile_messages(
"narrator", messages,
NARRATOR_CONFIG.max_tokens or llm_config.default_max_tokens,
llm_config.context_limit,
)
assert profile.headroom >= 0, (
f"Narrator prompt ({profile.total_prompt_tokens} tokens) + "
f"max_tokens ({profile.max_tokens_budget}) exceeds context limit "
f"({profile.context_limit}). Headroom: {profile.headroom}"
)
# Similar tests for character, memory, game_state agents...
These tests use existing sample_game fixtures from tests/conftest.py.
Verification¶
- All prompt lint tests pass on the current codebase.
- Intentionally breaking a prompt (e.g., adding a
{missing}placeholder) causes the orphan test to fail. - Inflating a prompt beyond 300 tokens causes the budget test to fail.
7. Playtest Integration¶
7.1 Call Log in Playtest Runner¶
Modify PlaytestRunner in src/theact/playtest/runner.py to create an LLMCallLog and pass it through to run_turn():
class PlaytestRunner:
def __init__(self, config: PlaytestConfig) -> None:
self.config = config
self.logger = PlaytestLogger()
self.call_log = LLMCallLog() # NEW
self.player_agent = PlayerAgent(...)
async def run(self) -> PlaytestReport:
# ... existing code ...
result = await run_turn(
game, player_input,
llm_config=self.config.llm_config,
call_log=self.call_log, # NEW
)
7.2 Call Log Persistence¶
In _finalize(), write the call log to disk alongside the report:
def _finalize(self, run_start: float) -> PlaytestReport:
# ... existing code ...
# Write call log
out_path = Path(self.config.output_dir) / self.config.timestamp
self.call_log.dump_yaml(out_path / "llm_calls.yaml")
# ... existing report generation ...
Also flush the call log incrementally in the main loop (same pattern as logger.flush_to_disk()).
7.3 Report Enhancements¶
Add an "LLM Call Summary" section to the playtest report. Modify generate_report_markdown() in src/theact/playtest/report.py:
def generate_report_markdown(report: PlaytestReport) -> str:
# ... existing sections ...
# LLM Call Summary
if report.call_log_summary:
lines.append("## LLM Call Summary")
lines.append("")
lines.append("| Agent | Calls | Avg Latency | Avg Think Tok | Avg Content Tok | Parse Success | Length Finishes |")
lines.append("|-------|-------|-------------|---------------|-----------------|---------------|-----------------|")
for agent, stats in report.call_log_summary.items():
lines.append(
f"| {agent} | {stats['calls']} | {stats['mean_latency_ms']}ms "
f"| {stats['mean_thinking_tokens']} | {stats['mean_content_tokens']} "
f"| {stats['parse_success_rate']:.0%} | {stats['length_finishes']} |"
)
lines.append("")
totals = report.call_log_totals
if totals:
lines.append(
f"**Total tokens:** {totals['total_prompt_tokens']} prompt + "
f"{totals['total_thinking_tokens']} thinking + "
f"{totals['total_content_tokens']} content"
)
lines.append("")
# Parse failure breakdown
if report.parse_failure_breakdown:
lines.append("## Parse Failures")
lines.append("")
lines.append("| Type | Count |")
lines.append("|------|-------|")
for ftype, count in report.parse_failure_breakdown.items():
lines.append(f"| {ftype} | {count} |")
lines.append("")
7.4 New Report Fields¶
Add these fields to PlaytestReport in src/theact/playtest/report.py:
@dataclass
class PlaytestReport:
# ... existing fields ...
# LLM call data (populated from LLMCallLog)
call_log_summary: dict[str, dict] = field(default_factory=dict) # agent -> stats
call_log_totals: dict = field(default_factory=dict) # aggregate totals
parse_failure_breakdown: dict[str, int] = field(default_factory=dict) # type -> count
Populate them in generate_report():
def generate_report(
logger: PlaytestLogger,
config: PlaytestConfig,
game_title: str,
total_duration: float,
memory_final: dict[str, str] | None = None,
call_log: LLMCallLog | None = None, # NEW
) -> PlaytestReport:
# ... existing code ...
call_log_summary = {}
call_log_totals = {}
parse_failure_breakdown = {}
if call_log:
call_log_summary = call_log.agent_summary()
call_log_totals = call_log.summary()
# Count parse failures by type
for r in call_log.records:
if r.parse_result != "success":
parse_failure_breakdown[r.parse_result] = (
parse_failure_breakdown.get(r.parse_result, 0) + 1
)
return PlaytestReport(
# ... existing fields ...
call_log_summary=call_log_summary,
call_log_totals=call_log_totals,
parse_failure_breakdown=parse_failure_breakdown,
)
7.5 Per-Turn Token Usage in TurnLog¶
Extend TurnLog in src/theact/playtest/logger.py to include actual token data:
@dataclass
class TurnLog:
# ... existing fields ...
# Per-agent token data (from call log)
agent_tokens: dict[str, dict] = field(default_factory=dict)
# e.g. {"narrator": {"prompt": 340, "thinking": 680, "content": 320}, ...}
Populate this from the call log after each turn:
# In PlaytestRunner, after run_turn():
turn_records = self.call_log.records_for_turn(turn_num)
agent_tokens = {}
for r in turn_records:
agent_tokens[r.agent] = {
"prompt": r.prompt_tokens,
"thinking": r.thinking_tokens,
"content": r.content_tokens,
"latency_ms": r.latency_ms,
}
Verification¶
- A 3-turn playtest with
call_logenabled producesllm_calls.yamlwith one record per agent call per turn. - The playtest report markdown includes the "LLM Call Summary" table.
- Parse failure breakdown correctly counts failures by type.
TurnLog.agent_tokensis populated for each turn.
8. Implementation Steps¶
Step 1: Error Taxonomy (Section 2)¶
Files to create/modify: - Modify: src/theact/llm/errors.py — add ParseFailureType enum - Modify: src/theact/llm/parsing.py — add classify_parse_failure(), update YAMLParseError - Create: tests/test_error_taxonomy.py
Verification:
Step 2: Call Log Module (Section 3.1)¶
Files to create: - src/theact/llm/call_log.py — LLMCallRecord, LLMCallLog - tests/test_call_log.py
Verification:
Step 3: Wire Call Logging into Agents (Section 3.2-3.3)¶
Files to modify: - src/theact/agents/narrator.py — add call_log and turn params, record calls - src/theact/agents/character.py — same - src/theact/agents/memory.py — same - src/theact/agents/game_state.py — same - src/theact/agents/summarizer.py — same - src/theact/engine/turn.py — add call_log param, pass to agents
Verification: - Existing tests still pass (backward compatible). - Manual test: run a turn with a real LLMCallLog, verify records are created.
Step 4: Diagnostics Filesystem (Section 4)¶
Files to create: - src/theact/engine/diagnostics.py — DiagnosticsWriter - tests/test_diagnostics.py
Files to modify: - src/theact/engine/turn.py — add debug param, create writer, write artifacts
Verification:
Step 5: Context Window Profiler (Section 5)¶
Files to create: - src/theact/llm/profiler.py — AgentProfile, profile_messages(), format_profile() - tests/test_profiler.py
Verification:
Step 6: Prompt Linting Tests (Section 6)¶
Files to create: - tests/test_prompt_lint.py - tests/test_context_lint.py
Verification:
Step 7: Playtest Integration (Section 7)¶
Files to modify: - src/theact/playtest/runner.py — add LLMCallLog, pass to run_turn(), persist - src/theact/playtest/report.py — add call log fields to PlaytestReport, markdown section - src/theact/playtest/logger.py — extend TurnLog with agent_tokens
Verification:
Step 8: Integration Test¶
Run a full integration check:
# Run all unit tests
uv run pytest tests/ -v
# Run lint/format
uv run prek run --all-files
# If LLM_API_KEY is available, run a short playtest with debug + call logging
uv run python scripts/playtest.py --game lost-island --turns 3 --debug
Verify: - playtests/<timestamp>/llm_calls.yaml exists and contains structured records - playtests/<timestamp>/report.md includes the "LLM Call Summary" section - saves/playtest-<timestamp>/diagnostics/turn-001/ contains agent subdirectories - Each agent subdirectory contains system_prompt.txt, user_message.txt, raw_response.txt - turn-001/summary.yaml contains aggregated turn stats
9. Files Summary¶
New files¶
| File | Purpose |
|---|---|
src/theact/llm/call_log.py | LLMCallRecord and LLMCallLog dataclasses |
src/theact/llm/profiler.py | Context window profiler |
src/theact/engine/diagnostics.py | Diagnostics filesystem writer |
tests/test_error_taxonomy.py | Tests for ParseFailureType and classifier |
tests/test_call_log.py | Tests for call log accumulation and aggregation |
tests/test_diagnostics.py | Tests for diagnostics writer |
tests/test_profiler.py | Tests for context window profiler |
tests/test_prompt_lint.py | Prompt token budget and format lint tests |
tests/test_context_lint.py | Context assembly budget tests |
Modified files¶
| File | Changes |
|---|---|
src/theact/llm/errors.py | Add ParseFailureType enum |
src/theact/llm/parsing.py | Add classify_parse_failure(), update YAMLParseError |
src/theact/agents/narrator.py | Add call_log/turn params, record calls |
src/theact/agents/character.py | Add call_log/turn params, record calls |
src/theact/agents/memory.py | Add call_log/turn params, record calls |
src/theact/agents/game_state.py | Add call_log/turn params, record calls |
src/theact/agents/summarizer.py | Add call_log/turn params, record calls |
src/theact/engine/turn.py | Add call_log/debug params, wire diagnostics |
src/theact/engine/context.py | Optional profile breakdown in message builders |
src/theact/llm/tokens.py | Add estimate_messages_content_tokens() |
src/theact/playtest/runner.py | Create and pass LLMCallLog, persist call log |
src/theact/playtest/report.py | Add call log summary fields and markdown section |
src/theact/playtest/logger.py | Extend TurnLog with agent_tokens |
scripts/diagnose_agent.py | Add --profile-context flag for token profiling |