Test Harness
Fixture-driven replay rig scoping with 7 open calls for Mark.

Fixture-Driven Replay Across Fourteen Classes
Mark, when I say "test harness" I do not mean unit tests on Xano functions. I mean a fixture-driven replay rig that proves the dispatch pipeline behaves correctly across the 14 classes, with reproducible inputs, deterministic-enough outputs, and a clear pass/fail story per fixture. This doc scopes what we build, why, and the open calls I need from you on Apr 29.
Why a harness, not a test suite
The Anything Engine is a chain of probabilistic steps:
- Classifier (LLM) → 14-class label
- Tool branch → Cypher / SQL → rows
- WHY synthesizer (LLM) → text
- Crayon stream → templates
- Zep ingest (async)
Each step has its own failure mode. Unit tests on the Xano functions catch syntax bugs but miss the interactions: classifier confidence drift, tool yield collapse when the graph rebalances, synthesizer regressions when we swap models, memory steering producing wrong-class on turn 2. We need a rig that replays real queries end-to-end and diffs the trace.
What the harness actually tests
| Dimension | What we assert | How |
|---|---|---|
| Route accuracy | Classifier returns the expected class for a fixture query | Exact-match on classification.class field |
| Tool yield | Tool branch returns >= N rows, never 0 for non-empty fixtures | Count events[] of type tpl with name=contact_card |
| WHY tone | Generated paragraphs contain no banned phrases | Regex sweep against the Apr 21 banlist (see below) |
| Memory steering | Same query, two thread_ids with different histories, yields different classes | Diff classification.class across two runs |
| Latency p95 | Dispatch round-trip under threshold | Histogram across N runs, surface p50 / p95 / p99 |
| Classification stability | Same fixture run 5x in a row stays on the same class | Mode + variance across runs |
| Schema integrity | Cards always have required fields (name, why, source) | JSON-schema validate each tpl event |
What we are not testing yet:
- AlloyDB SQL — schema does not exist outside this doc
- Exact row identity — the FalkorDB graph mutates as enrichment runs; we assert shape, not contents
- LLM output token-for-token — we accept variance, we lint for tone
Fixtures
A fixture is a frozen (input, expected) pair. ~25-50 of them across the 14 classes, weighted toward find_investors (the laser-focus class) and the catch-all (#14).
Storage options on the table for Apr 29:
- Option A:
docs/fixtures/*.jsonchecked into this repo. Pros: version-controlled, diffable in PRs, runnable from CI. Cons: no UI, no production-time capture loop. - Option B: Xano table
dispatch_fixtureswith columns mirroring the JSON shape. Pros: lets us capture live production runs as fixtures with a one-click button. Cons: drift between repo and DB. - Option C (my preference): both. JSON in repo for the seeded set, Xano table for live capture,
scripts/sync-fixtures.tsreconciles.
Fixture shape
{
"id": "find_investors_001",
"class_expected": "find_investors",
"min_results": 5,
"input": {
"query": "find me investors for a Series A medtech round, $5-10M check, US-based",
"thread_id": "harness-find_investors_001",
"user_id": "harness-bot",
"seed_thread_messages": []
},
"asserts": {
"classification.class": "find_investors",
"classification.confidence_min": 0.7,
"events.tpl.contact_card.count_min": 5,
"events.tpl.contact_card.required_fields": ["name", "why", "source"],
"tone.banned_phrases.must_be_absent": true,
"latency_ms_p95_max": 12000
}
}
For memory tests, the fixture seeds the thread first via Zep's POST /api/v2/threads/{id}/messages, then runs dispatch:
{
"id": "memory_steering_001",
"class_expected": "find_warm_intros",
"input": {
"query": "who else should I talk to",
"thread_id": "harness-memory_steering_001",
"user_id": "harness-bot",
"seed_thread_messages": [
{"role": "user", "content": "I just met with Caitlin Morse from BrainSpace about IP licensing"},
{"role": "assistant", "content": "Logged. Caitlin runs IP partnerships at BrainSpace, focused on neurotech licensing deals."}
]
},
"asserts": {
"classification.class": "find_warm_intros",
"memory.context_used": true
}
}
The same query without seed messages should classify as something else (likely find_investors or fall to the catch-all). That asymmetry is the test.
Replay strategy
A single Node script lives at scripts/replay-fixtures.ts in this repo. It:
- Reads
docs/fixtures/*.json - For each fixture: optionally seeds the Zep thread, calls
POST https://xh2o-yths-38lt.n7c.xano.io/api:UgP1h6uR/anything-engine/dispatch(id 8399), buffers the SSE stream - Runs each assertion, collects pass/fail with a delta string for failures
- Emits two artifacts:
harness-results-<timestamp>.json— full trace + diffsharness-summary-<timestamp>.md— one-page report (counts, p95 latency, class confusion matrix)
# usage
pnpm tsx scripts/replay-fixtures.ts --filter find_investors --concurrency 4
pnpm tsx scripts/replay-fixtures.ts --all --output ./out/
Concurrency-bounded so we do not hammer OpenRouter and trigger 429s.
Tone tests — banned-phrase regex
From the Apr 21 LSI dogfood review, baked-in banlist:
const BANNED = [
/\bride shotgun\b/i,
/\btee up\b/i,
/\block (the|in)\b/i,
/\bplaybook\b/i,
/\bnine[- ]figure\b/i,
/\bmulti[- ]hundred[- ]million\b/i,
/\bbefore someone else\b/i,
/\bworth a (\d+[- ]?min(ute)?|quick) call\b/i,
/\blet me know if you'?re open\b/i,
];
The harness fails any fixture where a generated WHY paragraph matches any of these. We can extend the list as Mark catches new ones.
Memory tests — concrete
Three fixtures pin down the loop end-to-end:
memory_t0— empty thread, query "find me investors". Expectfind_investors.memory_t1— same thread after t0, query "who else should I talk to". Expectfind_warm_intros(memory steered classifier away from the literal restart).memory_t2— fresh thread, same query as t1. Expectfind_investorsor catch-all (no context to steer).
Asymmetry between t1 and t2 with the same query string is the proof memory works. We have already seen this manually in the demo thread demo-anything-engine; the harness pins it.
CI hook (TBD with Mark)
Three options:
- Local only. Robert runs before each Mark sync, screenshot the summary.
- GitHub Actions on push to main. Posts summary as a PR comment on every push.
- Xano cron. Nightly run, posts to a Slack channel.
Default proposal: GitHub Actions on push, plus a Xano cron at 8am for drift detection. Both write to the same harness_results Xano table for trend analysis.
Open questions for Apr 29
- Fixture storage — JSON in repo, Xano table, or both? My vote: both with a sync script.
- Run mode — Node script (this repo) or Xano background task? Node is faster to iterate, Xano is closer to production. My vote: Node first, port the runner to Xano once we trust it.
- CI gate — should a failing harness block deploys, or just warn? My vote: warn only until we have ~50 fixtures with known stable behavior; gate after.
- Fixture authoring — me alone, or do you want a
/capturebutton in the sandbox UI that saves the current dispatch as a fixture? - What counts as a "regression" when LLM outputs vary — exact-match, semantic similarity, or rubric-graded by another LLM?
- LSI dogfood corpus — should the four LSI queries (L1, L2, O1, O2) become fixtures? They are already a known-good benchmark.
- Cost cap — each full harness run hits OpenRouter ~50-100x. Cap at $X per run, or unlimited?
References
- Architecture: architecture.md
- Edge cases: find-investors-edge-cases.md
- Apr 21 tone rules: MEMORY.md "LSI DOGFOOD" section
- Apr 28 sync (test harness scoping ask):
~/.claude/projects/.../memory/april-28-mark-sync.mdsection 14