Quality · Test Harness

Test Harness

Fixture-driven replay rig scoping with 7 open calls for Mark.

Test Harness

Fixture-Driven Replay Across Fourteen Classes

Mark, when I say "test harness" I do not mean unit tests on Xano functions. I mean a fixture-driven replay rig that proves the dispatch pipeline behaves correctly across the 14 classes, with reproducible inputs, deterministic-enough outputs, and a clear pass/fail story per fixture. This doc scopes what we build, why, and the open calls I need from you on Apr 29.

Why a harness, not a test suite

The Anything Engine is a chain of probabilistic steps:

Classifier (LLM) → 14-class label
Tool branch → Cypher / SQL → rows
WHY synthesizer (LLM) → text
Crayon stream → templates
Zep ingest (async)

Each step has its own failure mode. Unit tests on the Xano functions catch syntax bugs but miss the interactions: classifier confidence drift, tool yield collapse when the graph rebalances, synthesizer regressions when we swap models, memory steering producing wrong-class on turn 2. We need a rig that replays real queries end-to-end and diffs the trace.

What the harness actually tests

Dimension	What we assert	How
Route accuracy	Classifier returns the expected class for a fixture query	Exact-match on `classification.class` field
Tool yield	Tool branch returns >= N rows, never 0 for non-empty fixtures	Count `events[]` of type `tpl` with `name=contact_card`
WHY tone	Generated paragraphs contain no banned phrases	Regex sweep against the Apr 21 banlist (see below)
Memory steering	Same query, two thread_ids with different histories, yields different classes	Diff `classification.class` across two runs
Latency p95	Dispatch round-trip under threshold	Histogram across N runs, surface p50 / p95 / p99
Classification stability	Same fixture run 5x in a row stays on the same class	Mode + variance across runs
Schema integrity	Cards always have required fields (name, why, source)	JSON-schema validate each `tpl` event

What we are not testing yet:

AlloyDB SQL — schema does not exist outside this doc
Exact row identity — the FalkorDB graph mutates as enrichment runs; we assert shape, not contents
LLM output token-for-token — we accept variance, we lint for tone

Fixtures

A fixture is a frozen (input, expected) pair. ~25-50 of them across the 14 classes, weighted toward find_investors (the laser-focus class) and the catch-all (#14).

Storage options on the table for Apr 29:

Option A: docs/fixtures/*.json checked into this repo. Pros: version-controlled, diffable in PRs, runnable from CI. Cons: no UI, no production-time capture loop.
Option B: Xano table dispatch_fixtures with columns mirroring the JSON shape. Pros: lets us capture live production runs as fixtures with a one-click button. Cons: drift between repo and DB.
Option C (my preference): both. JSON in repo for the seeded set, Xano table for live capture, scripts/sync-fixtures.ts reconciles.

Fixture shape

{
  "id": "find_investors_001",
  "class_expected": "find_investors",
  "min_results": 5,
  "input": {
    "query": "find me investors for a Series A medtech round, $5-10M check, US-based",
    "thread_id": "harness-find_investors_001",
    "user_id": "harness-bot",
    "seed_thread_messages": []
  },
  "asserts": {
    "classification.class": "find_investors",
    "classification.confidence_min": 0.7,
    "events.tpl.contact_card.count_min": 5,
    "events.tpl.contact_card.required_fields": ["name", "why", "source"],
    "tone.banned_phrases.must_be_absent": true,
    "latency_ms_p95_max": 12000
  }
}

For memory tests, the fixture seeds the thread first via Zep's POST /api/v2/threads/{id}/messages, then runs dispatch:

{
  "id": "memory_steering_001",
  "class_expected": "find_warm_intros",
  "input": {
    "query": "who else should I talk to",
    "thread_id": "harness-memory_steering_001",
    "user_id": "harness-bot",
    "seed_thread_messages": [
      {"role": "user", "content": "I just met with Caitlin Morse from BrainSpace about IP licensing"},
      {"role": "assistant", "content": "Logged. Caitlin runs IP partnerships at BrainSpace, focused on neurotech licensing deals."}
    ]
  },
  "asserts": {
    "classification.class": "find_warm_intros",
    "memory.context_used": true
  }
}

The same query without seed messages should classify as something else (likely find_investors or fall to the catch-all). That asymmetry is the test.

Replay strategy

A single Node script lives at scripts/replay-fixtures.ts in this repo. It:

Reads docs/fixtures/*.json
For each fixture: optionally seeds the Zep thread, calls POST https://xh2o-yths-38lt.n7c.xano.io/api:UgP1h6uR/anything-engine/dispatch (id 8399), buffers the SSE stream
Runs each assertion, collects pass/fail with a delta string for failures
Emits two artifacts:
- harness-results-<timestamp>.json — full trace + diffs
- harness-summary-<timestamp>.md — one-page report (counts, p95 latency, class confusion matrix)

# usage
pnpm tsx scripts/replay-fixtures.ts --filter find_investors --concurrency 4
pnpm tsx scripts/replay-fixtures.ts --all --output ./out/

Concurrency-bounded so we do not hammer OpenRouter and trigger 429s.

Tone tests — banned-phrase regex

From the Apr 21 LSI dogfood review, baked-in banlist:

const BANNED = [
  /\bride shotgun\b/i,
  /\btee up\b/i,
  /\block (the|in)\b/i,
  /\bplaybook\b/i,
  /\bnine[- ]figure\b/i,
  /\bmulti[- ]hundred[- ]million\b/i,
  /\bbefore someone else\b/i,
  /\bworth a (\d+[- ]?min(ute)?|quick) call\b/i,
  /\blet me know if you'?re open\b/i,
];

The harness fails any fixture where a generated WHY paragraph matches any of these. We can extend the list as Mark catches new ones.

Memory tests — concrete

Three fixtures pin down the loop end-to-end:

memory_t0 — empty thread, query "find me investors". Expect find_investors.
memory_t1 — same thread after t0, query "who else should I talk to". Expect find_warm_intros (memory steered classifier away from the literal restart).
memory_t2 — fresh thread, same query as t1. Expect find_investors or catch-all (no context to steer).

Asymmetry between t1 and t2 with the same query string is the proof memory works. We have already seen this manually in the demo thread demo-anything-engine; the harness pins it.

CI hook (TBD with Mark)

Three options:

Local only. Robert runs before each Mark sync, screenshot the summary.
GitHub Actions on push to main. Posts summary as a PR comment on every push.
Xano cron. Nightly run, posts to a Slack channel.

Default proposal: GitHub Actions on push, plus a Xano cron at 8am for drift detection. Both write to the same harness_results Xano table for trend analysis.

Open questions for Apr 29

Fixture storage — JSON in repo, Xano table, or both? My vote: both with a sync script.
Run mode — Node script (this repo) or Xano background task? Node is faster to iterate, Xano is closer to production. My vote: Node first, port the runner to Xano once we trust it.
CI gate — should a failing harness block deploys, or just warn? My vote: warn only until we have ~50 fixtures with known stable behavior; gate after.
Fixture authoring — me alone, or do you want a /capture button in the sandbox UI that saves the current dispatch as a fixture?
What counts as a "regression" when LLM outputs vary — exact-match, semantic similarity, or rubric-graded by another LLM?
LSI dogfood corpus — should the four LSI queries (L1, L2, O1, O2) become fixtures? They are already a known-good benchmark.
Cost cap — each full harness run hits OpenRouter ~50-100x. Cap at $X per run, or unlimited?

References

Architecture: architecture.md
Edge cases: find-investors-edge-cases.md
Apr 21 tone rules: MEMORY.md "LSI DOGFOOD" section
Apr 28 sync (test harness scoping ask): ~/.claude/projects/.../memory/april-28-mark-sync.md section 14