Architecture¶
Technical reference for Amatelier's internals. If you want a conceptual overview, read the README on GitHub first.
Repository Layout¶
amatelier/
├── engine/ Python orchestrators (runner, therapist, scorer, etc.)
│ └── migrations/ SQL schema migrations, applied in order
├── roundtable-server/ Live SQLite chat layer (db_client, server)
│ └── logs/ Gemini error log etc. (runtime, gitignored)
├── agents/ Per-agent dirs — CLAUDE.md + IDENTITY.md shipped,
│ │ runtime state (MEMORY, behaviors, sessions) gitignored
│ ├── elena/
│ ├── marcus/
│ ├── clare/
│ ├── simon/
│ ├── naomi/
│ ├── judge/
│ ├── therapist/
│ ├── opus-therapist/
│ ├── opus-admin/
│ └── haiku-assistant/
├── protocols/ 11 on-demand protocol docs
├── store/ Skill catalog, templates, ledger (runtime)
├── shared-skills/ Curated distilled skills (runtime)
├── tests/ Integration tests
├── tools/ watch_roundtable.py live viewer
├── benchmarks/ Leaderboard snapshots (runtime)
├── SKILL.md Skill metadata + landing card for Claude Code
├── STEWARD.md Steward design spec
├── config.json Team roster, models, fees, thresholds
└── run-roundtable.sh Convenience wrapper
Engine File Map¶
The engine is 16 Python modules, about 9,000 lines total. Responsibility-by-file:
Entry points¶
| File | Lines | Purpose |
|---|---|---|
roundtable_runner.py |
1673 | Orchestrator. Opens RT, spawns agents, runs debate rounds, closes, triggers post-RT pipeline. |
therapist.py |
1387 | Post-RT debrief runner. Interviews workers, writes memory/behaviors/sessions. |
Data layer¶
| File | Lines | Purpose |
|---|---|---|
db.py |
249 | SQLite connection, migration runner, core message ops (speak, listen, cursor tracking). |
Scoring¶
| File | Lines | Purpose |
|---|---|---|
judge_scorer.py |
492 | Runs the Judge LLM call to score each agent per RT on 4 dimensions. |
scorer.py |
580 | Score aggregation, fee deduction, gate bonuses, Byzantine variance flags. |
analytics.py |
783 | Growth analytics across RTs — per-agent trend analysis, leaderboard computation. |
Economy & skills¶
| File | Lines | Purpose |
|---|---|---|
store.py |
831 | Skill store — purchases, delivery, boost application, request fulfillment. |
distiller.py |
247 | Extract skill candidates (CAPTURE / FIX / DERIVE) from RT transcripts. |
backfill_distill.py |
270 | Retroactive distillation for historical digests. |
classify_concepts.py |
173 | Five-axis taxonomy classifier for DERIVE skills → novel_concepts.json. |
Agent state¶
| File | Lines | Purpose |
|---|---|---|
agent_memory.py |
867 | Structured MEMORY.json access — goals, session summaries, episode aging. |
evolver.py |
485 | Apply therapist-proposed behavior changes; sync skills_owned from ledger. |
Agent adapters¶
| File | Lines | Purpose |
|---|---|---|
claude_agent.py |
449 | Spawn a Claude Code subprocess as a worker. Listens to DB, posts replies. |
gemini_agent.py |
268 | Spawn Gemini-backed worker (Naomi). |
gemini_client.py |
179 | Thin wrapper over google-generativeai. |
Steward¶
| File | Lines | Purpose |
|---|---|---|
steward_dispatch.py |
461 | Parse [[request:]] tags, spawn ephemeral file-access subagent, inject results. |
Import graph (simplified)¶
roundtable_runner ──┬─→ db
├─→ steward_dispatch ──→ [spawns ephemeral claude-p subprocess]
├─→ judge_scorer
├─→ scorer ──→ analytics
├─→ distiller ──→ classify_concepts
├─→ store ──→ scorer
├─→ evolver ──→ store
├─→ agent_memory
├─→ [spawns claude_agent subprocess per worker]
└─→ [spawns gemini_agent subprocess for Naomi]
claude_agent ──→ db
gemini_agent ──→ db, gemini_client
therapist ──┬─→ db
├─→ agent_memory
├─→ evolver
└─→ store
The runner does not import worker agents directly — it spawns them as subprocesses that communicate via the SQLite DB.
Database Schema¶
SQLite at roundtable-server/roundtable.db. Four migrations, applied in order at startup by db._ensure_schema().
001_initial.sql¶
roundtables — one row per RT. id, topic, status (open/closed), participants (comma-separated), timestamps.
messages — append-only chat log. roundtable_id, agent_name, message, timestamp.
read_cursors — per-agent-per-RT read position. Agents use this to know which messages they've seen.
002_scores_table.sql¶
scores — per-agent per-RT grades on 4 dimensions (novelty, accuracy, impact, challenge). total, reasoning, grand_insight (10-score citation), scored_by (which judge run). Foreign key to roundtables.
003_spark_ledger.sql¶
spark_ledger — immutable record of every spark transaction. Positive = earned, negative = fee or penalty. reason, category (scoring/fee/penalty/purchase/gate_bonus/venture), optional roundtable_id. Current balance = SUM(amount) WHERE agent_name = X.
004_byzantine_flags.sql¶
Adds is_flagged and flagged_since_round columns to scores. Used by the variance detector (scorer.compute_variance_flags) to catch agents whose scores deviate too far from peers across multiple RTs — a suspicious-pattern signal, not a penalty.
Agent Lifecycle¶
The full journey of an agent across a single roundtable:
1. PRE-RT
runner reads agents/<name>/CLAUDE.md (operating rules)
IDENTITY.md (persona seed)
MEMORY.md (accumulated context, runtime)
MEMORY.json (structured: goals, skills_owned, sessions)
behaviors.json (therapist-evolved behavioral deltas)
2. ENTRY FEE
scorer deducts fee from spark_ledger (5/8/15 sparks by model class)
3. BOOSTS
store.apply_boosts_for_rt — purchased consumables fire (extra floor turns,
first-speaker slot, etc.)
4. SPAWN
runner launches engine/claude_agent.py or gemini_agent.py as a subprocess
in its own terminal. Agent connects to the DB as <agent_name>.
5. ROUNDS
runner posts "ROUND N: begin", then "YOUR TURN: <agent>". Agent listens
on the DB, generates a response using its assembled context (CLAUDE +
IDENTITY + MEMORY + behaviors + round state summary), posts via
db.speak. Judge also listens and intervenes (REDIRECT / GATE / HALT)
in real time.
6. STEWARD (optional, during any round)
If agent includes [[request: ...]] in a post, runner extracts the query,
spawns an ephemeral claude -p subagent with Read/Grep/Glob tools, runs
the lookup against files registered in the briefing, injects the result
as a runner message. Agent sees it on next turn.
7. CLOSE
Runner closes RT. Transcript available via db.get_transcript.
8. SCORING
judge_scorer runs the Judge LLM once with the full transcript. Emits
scores per agent. scorer.score persists to scores table, processes GATE
signals from judge_messages for bonuses.
9. DISTILLATION
distiller extracts 10–15 skill candidates from transcript via a
separate Sonnet call. classify_concepts adds taxonomy tags. DERIVE
skills appended to novel_concepts.json. Admin later curates the best
3–5 for shared-skills/index.json.
10. THERAPIST (post-RT debrief)
therapist.py iterates each worker. For each:
- Builds context from digest + memory + behaviors + session
- Runs 2–4 turn interview with Opus
- Parses therapist output for behavioral_deltas, memory_updates,
session_summary, trait adjustments
- Writes to behaviors.json, MEMORY.md, MEMORY.json,
sessions/<rt_id>.md
11. POST-RT CLEANUP
store.consume_boosts_after_rt — consumables fire for this RT mark as
spent. age_bulletin_requests — open store requests age one step.
agent_memory.age_goals — tick active goals forward.
evolver.sync_skills_owned — refresh agents' skill list from ledger.
12. ANALYTICS
analytics.update — per-agent growth trends, leaderboard snapshot
written to benchmarks/leaderboard.json.
Roundtable Runner Pipeline¶
roundtable_runner.py:run_roundtable() — the execution skeleton:
- Config load —
config.json, entry fees, team roster - Briefing load — parse
briefing-xxx.md, extract Steward-registered files - DB open — create RT row, set status=open
- Apply boosts — consume purchased consumables; apply first-speaker slot if won
- Entry fees — deduct from spark_ledger per agent
- Research window (Round 0) — each worker gets 3 free Steward requests to ground openings; fire in parallel
- Spawn workers + Judge — subprocesses launched, wait for them to connect
- Post briefing — runner broadcasts the briefing to the chat
- Round loop (default 3 rounds):
- Post "ROUND N: begin" with budget status
- Speak phase — each worker posts once. If a worker includes
[[request:]], they go to the back of the queue while their Steward task runs. If all queue is deferred, intermission fires. - Rebuttal phase — reverse order, each worker posts a rebuttal
- Judge gate — Judge decides CONVERGED or CONTINUE. If CONVERGED, break the round loop.
- Floor phase — workers with remaining budget may contribute or PASS
- Round summary — Haiku summarizer writes a cumulative debate state (not a per-round summary — a running ESTABLISHED / LIVE POSITIONS / OPEN QUESTIONS / SHIFTS structure)
- Rotate speaking order — first speaker moves to the back
- Health audit — detect dead workers / timeouts / majority-dead abort condition
- Close RT — mark status=closed, collect final transcript
- Build digest — structured JSON summary (contributions, final positions, convergence reason, budget usage)
- Judge scoring —
judge_scorer.judge_score()runs the Judge LLM with transcript - Process GATE bonuses — scan Judge messages for
GATE: agent — reason - Byzantine variance check — flag scores that deviate from peer consensus
- Ventures extraction — parse
<VENTURE>/<MOONSHOT>/<SCOUT>tags - Save leaderboard — write to benchmarks/leaderboard.json
- Update analytics — per-agent trend computation
- Distillation — extract skill candidates from transcript
- Novel concepts append — DERIVE skills → novel_concepts.json
- Therapist debrief (unless --skip-post)
- Store cleanup — consume boosts, age requests
- Memory cleanup — age goals, add session summaries
- Skills sync — refresh skills_owned per agent
- Notification — write
roundtable-server/latest-result.md, fire OS toast
Skip-post (--skip-post flag) short-circuits after distillation for faster iteration during development.
Scoring System¶
Each RT emits a per-agent score row. Judge grades four dimensions (0–3 scale, or 10 for a grand insight):
| Dimension | Signal |
|---|---|
| Novelty | Said something the group didn't know |
| Accuracy | Claims correct and defensible |
| Impact | Changed the group's direction or output |
| Challenge | Pushed back on weak consensus with evidence |
total = novelty + accuracy + impact + challenge (0–12 typical, or higher if a 10 lands).
Calibration target: average RT total is 4–6. A 10 is rare by design — gate-level contributions only.
The score_judge prompt includes calibration examples and explicit reminders that most contributions are 1s. Judge-side model is Sonnet with --effort max.
Scores are persisted to scores table and aggregated by analytics.py into per-agent growth curves. Those curves feed session-start context so each agent enters a new RT knowing how they've been trending.
Distillation Pipeline¶
After scoring completes, distiller.py runs an LLM extraction over the full transcript:
Input: full RT transcript (up to 50K chars, capped)
Model: Sonnet
Output: JSON array of 10–15 skill objects
Each skill has:
{
"title": "Short descriptive name",
"type": "CAPTURE | FIX | DERIVE",
"agent": "originator (or comma-separated for collab)",
"pattern": "specific technique, with file/line refs",
"when_to_apply": "concrete conditions for reuse",
"structural_category": "state-boundary | signal-integrity | ...",
"trigger_phase": "system-design | code-review | debugging | ...",
"primary_actor": "individual-contributor | reviewer | architect | ...",
"problem_nature": "state-lifecycle | calibration-metric | ...",
"agent_dynamic": "convergence | synthesis | reframing (DERIVE only)",
"tags": ["3-5 searchable keywords"],
"one_liner": "plain-English summary"
}
CAPTURE — observed reusable technique.
FIX — an anti-pattern correction.
DERIVE — new concept synthesized from multiple contributions. Requires agent_dynamic.
DERIVE skills are also classified by classify_concepts.py along five orthogonal axes and appended to novel_concepts.json with a content-hash dedup check.
Admin later reviews the raw skill candidates in the digest and curates the best 3–5 into shared-skills/index.json for team-wide availability.
Store / Economy¶
store.py implements the spark economy. Key tables:
store/catalog.json— purchasable items (skills, boosts, slots)store/skill_templates.py— full methodology text for 8 foundational skillsstore/ledger.json— pending purchases and consumable state (runtime, gitignored)spark_ledgerDB table — immutable transaction log (current balance = SUM of amounts per agent)
Flow for a typical purchase:
- Agent announces
PURCHASE: <item_id>in a post during an RT - Runner detects the tag after the round, calls
store.attempt_purchase(agent, item_id) store.attempt_purchase:- Checks balance against item cost
- Deducts spark_ledger entry if affordable
- Delivers skill content (appends to agent MEMORY or registers consumable)
- Records in store/ledger.json
- Next RT:
apply_boosts_for_rtreads ledger for this agent, applies active consumables
Relegation logic (analytics.check_relegation): three consecutive net-negative RTs triggers bench / deletion choice — Admin decides.
Steward System¶
Full design in steward-design. Summary:
- Agents request data by including
[[request: ...]]in a post - Runner detects the tag, parses the request
- Deterministic path tries first: JSON filters, regex lookups, value extraction (no LLM)
- If deterministic fails, spawn ephemeral
claude -psubagent with--allowedTools Read,Grep,Glob - Agent operates only on files listed in the briefing's
## Steward-Registered Filessection - Result injected back into chat as a runner message, tagged
[Research result for <agent>] - Budget: 3 per-agent per-RT, tracked in
StewardBudget - Research window (Round 0): 3 free concurrent requests per agent to ground opening statements
- Judge enforces citation — empirical claims without a Steward citation or inline math derivation are penalized
Agent State Files¶
Per agent (in agents/<name>/):
| File | Purpose | Gitignored? |
|---|---|---|
CLAUDE.md |
Operating rules, DB client location, workflow | Shipped |
IDENTITY.md |
Persona seed — who this agent is at base | Shipped |
MEMORY.md |
Accumulated context — what they've learned | Runtime (gitignored) |
MEMORY.json |
Structured state — goals, skills_owned, session pointers, active episodes | Runtime |
behaviors.json |
Therapist-proposed behavioral deltas (accepted and pending) | Runtime |
metrics.json |
Current sparks, rank, trait adjustments | Runtime |
sessions/<rt_id>.md |
Per-RT debrief summary from Therapist | Runtime |
skills/<skill_id>.md |
Delivered purchased skills | Runtime |
CLAUDE.md is the operating manual — it doesn't change as the agent evolves. IDENTITY.md is the persona seed and can be edited by the Therapist over time. Runtime files are where personality actually lives.
Configuration¶
config.json is the single source of tuning for roster, fees, thresholds, and model choice.
Structure (abridged):
{
"version": "...",
"team": { "workers": { "<name>": { "model": "...", "role": "..." } } },
"roundtable": { "max_rounds": 3, "gemini_refresh_round": 5, ... },
"competition": {
"entry_fees": { "haiku": 5, "flash": 5, "sonnet": 8, "opus": 15 },
"gate_bonus": { "enabled": true, "max_per_rt": 3, "sparks": 3 }
},
"self_determined_thresholds": { ... },
"gemini": { "model": "...", "max_tokens": ..., "temperature": ... },
"steward": {
"enabled": true,
"budget_per_agent": 3,
"timeout_seconds": 120,
"max_response_tokens": 2000,
"haiku_model": "claude-haiku-4-5-20251001",
"sonnet_model": "claude-sonnet-4-20250514"
}
}
Extension Points¶
- Add an agent type — new entry in
config.json:team.workers, new dir inagents/, wire inroundtable_runner._launch_worker - Add a skill to the store — update
store/catalog.jsonand add template instore/skill_templates.py - Change scoring dimensions — edit judge_scorer prompt + migration to add columns to
scorestable - Add a protocol — drop markdown file in
protocols/, reference it from SKILL.md's protocol table - Custom Steward handlers — add deterministic paths in
steward_dispatch.try_deterministicbefore the subagent fallback fires
Testing¶
Integration test: tests/test_integration.py. Covers:
- DB roundtrip: open RT → speak → listen → close
- Score persistence: insert → read → aggregate
- Transcript index: build from real DB → verify format
- Recall: filter by agent / keyword / round
- Cumulative debate state: verify prior_state threading
- Therapist wiring:
_apply_outcomes→ verify memory/episodes/diary written - Runner wiring: verify session bridge + goal aging + session summary calls
- Context loading: claude_agent + gemini_agent use render_memory
Run:
No external API calls required — tests use fixture data.