The Steward — Empirical Grounding for Roundtable Debates¶
What It Is¶
The Steward is an ephemeral subagent with file tools (Read, Grep, Glob). It executes precise lookups against pre-registered files and returns citation-ready extracts. It does not theorize, argue, or persist between calls.
Why It Exists¶
RT e81ad3251b80 (Vela Gate dualSignal Granularity) exposed the problem: 42 unverified file/path citations and 57 numeric claims across 5 agents, with 1 caught hallucination (Naomi fabricated "6 of 8 failures occurred on shards with centroid rank 1"). The Judge demanded the measurement 5 times; no agent could run it. The group converged on Option D using partially invented data.
The Steward prevents this by giving debaters a path to ground claims in real data, and giving the Judge a path to verify them.
Architecture¶
Debater speaks with [[request: show me the gating logic]] tag
|
Runner detects [[request:]] tag
|
Runner fires Steward subagent async (Haiku with file tools)
|
Debater goes to back of speaking queue
|
Other agents continue speaking
|
Steward returns extract → runner injects into debater's context
|
Debater's turn comes back → speaks with evidence
The Steward is NOT a roster member. No CLAUDE.md persona, no MEMORY.json, no Therapist sessions, no spark balance. It exists only as an ephemeral subprocess with file tools.
Invocation¶
Debaters include a request tag in their speak message:
The tag content is natural language — debaters don't need structured queries. The Steward is smart enough to navigate the codebase and find what was asked for.
Defer-to-Back Queue¶
When an agent makes a request during the speak phase:
- Their message (with
[[request:]]tag) goes into the transcript immediately - Runner fires the Steward subagent asynchronously
- Agent moves to the back of the speaking queue
- Other agents continue speaking (debate doesn't stall)
- As each Steward completes, results are injected into context
- When the deferred agent's turn comes back around, they have their data
Intermission¶
If ALL remaining agents in the queue have deferred (everyone requested data, nobody left to speak), the runner pauses and waits for all Steward tasks to complete. Results are injected, then speaking resumes. This is an emergent state — "intermission" — not a planned phase.
Budget¶
- 3 requests per agent per RT (total, not per-round)
- Forces strategic allocation: front-load to ground opening position, or save for rebuttal fact-checks
- Prevents filibuster (can't keep deferring forever)
- 1 active request per agent at a time
- Unused requests do NOT roll over
Execution Paths¶
Deterministic (no model call)¶
For simple lookups the runner handles without an LLM:
- JSON filters: count rows where gateReason=='dualSignal' in results.json
- Value extraction: value of vectorFloorStrict in tuning.dart
Subagent (Haiku or Sonnet)¶
For anything requiring code comprehension or navigation: - "Show me the relevanceGate function" - "How does the centroid pre-filter interact with hub rescue?" - Any fuzzy or multi-line request
The subagent gets Read, Grep, Glob tools — file access only. No Edit, no Write, no Bash.
Private Injection¶
Steward results are private to the requesting agent + Judge. Not injected into all 5 agents' contexts (saves tokens).
The requesting agent cites the relevant parts when they speak. Other agents see the citation in the speech. The between-rounds summarizer naturally compresses cited evidence into the debate state (~40 tokens vs 2000 raw).
Exception: Judge-dispatched verifications are shared to all (verdicts, not arguments).
Result Format¶
[Research result for marcus | haiku | 4.2s]:
merger.dart:152-168 — relevanceGate function:
({bool passed, List<ScoredAtom> results, String reason}) relevanceGate(
List<ScoredAtom> merged,
double threshold, [
bool relaxed = false,
RetrievalTuning? tuning,
]) {
if (merged.isEmpty) { return (..., reason: 'empty'); }
final domainResult = domainGate(merged, relaxed, tuning);
if (!domainResult.passed) return domainResult;
return confidenceGate(domainResult.results, threshold, relaxed, tuning);
}
Properties:
- Always includes source file + line numbers
- Factual only — no interpretation
- Hard cap: 2000 tokens (truncate with [truncated] if exceeded)
Citation Enforcement¶
The Judge enforces grounded claims via the existing moderation role:
- Any claim with a specific number/threshold/line-number that does NOT reference a Steward result or the briefing is scrutinized
HALT [{agent}]: fabricated datafor invented metrics- Workers who cite Steward results accurately are recognized
- Workers who fabricate when they could have requested are penalized on
accuracy
This is what makes the Steward not just available but required.
What the Steward Does NOT Do¶
- No theorizing. Returns data, not opinions.
- No writing. Read-only file tools. Cannot modify files, run tests, or execute code.
- No cross-file inference. Each request navigates from the request to the answer. "Search everything" is bounded by registered files.
- No persistence. Dies after each request. No cache, no memory, no session continuity.
- No personality. No behavioral evolution, no Therapist sessions.
- Never guesses. If it can't find the answer, it says "Not found" and lists what it checked.
Registered Files¶
The Steward can ONLY access files declared in the briefing's data pack:
## Steward-Registered Files
- staging/smoke-expanded/results.json
- staging/diagnostic-probe/*.json
- lib/core/retrieval/merger.dart
- lib/core/retrieval/tuning.dart
Briefings without this section → Steward disabled for that RT.
Parameters¶
| Parameter | Value | Location |
|---|---|---|
| Budget per agent | 3 requests/RT | config.json → steward.budget_per_agent |
| Timeout | 120 seconds | config.json → steward.timeout_seconds |
| Max response tokens | 2000 | config.json → steward.max_response_tokens |
| Haiku model | claude-haiku-4-5-20251001 | config.json → steward.haiku_model |
| Sonnet model | claude-sonnet-4-20250514 | config.json → steward.sonnet_model |
Implementation¶
engine/steward_dispatch.py— request parser, budget tracker, subagent spawner, result formatter- Runner speak-phase loop modified for defer-to-back queue + async dispatch + intermission
- Judge CLAUDE.md updated with citation enforcement instructions
- All worker CLAUDE.md files updated with
[[request:]]syntax documentation - Steward log saved alongside digest as
steward-log-{rt_id}.json
Success Criteria¶
- Zero fabricated metrics — debaters stop inventing numbers because requesting is cheaper than the accuracy penalty
- No context bloat — full files never enter shared context; only 2000-token surgical extracts, compressed to ~40 tokens by summarizer
- Empirically grounded convergence — RT outcomes reference real data
- Judge can verify — every cited number has a traceable extract in the Steward log
- No debate stalling — async dispatch + defer-to-back keeps the conversation moving