AlloyDB Schema
Robert's parallel schema draft and the 7 open questions for Mark.

Structured Tables, Vector-Threaded
Robert and Mark are drafting the schema in parallel. Bring this to Apr 29 sync, compare with Mark's draft, reconcile live.
Goal: support the find_investors hybrid query in
docs/architecture.mdstep 6 — single SQL call combining hard filters with 6-vector cosine search.
Design principles
- Denormalize for the read path. find_investors hits
investorsand onlyinvestors. No joins on the hot path. Joins live in pre-aggregation jobs. - 6 vector columns per investor, not one combined embedding. Mark's framework: each thesis dimension is its own concern. Re-embedding only one dimension when it changes is cheaper than re-embedding the whole row.
- Filters are typed, not text.
check_size_minisbigint, not'$2M-$10M'. The classifier extracts numbers, not strings. - Soft-deletable, never hard-deleted.
deleted_at timestamp NULL. Historical match audit needs the row. - Provenance columns on everything ML-derived.
thesis_summary_source∈ {deck, web, manual},thesis_summary_extracted_at,thesis_summary_model. So we can re-extract and audit.
Tables
users
Mirror of WorkOS users. Xano's existing user table is the master; this is a thin local mirror so AlloyDB can join runs.user_id without a round-trip.
CREATE TABLE users (
id bigserial PRIMARY KEY,
workos_user_id text UNIQUE NOT NULL,
email text,
display_name text,
created_at timestamptz NOT NULL DEFAULT now(),
updated_at timestamptz NOT NULL DEFAULT now()
);
companies
Investee companies. Used for portfolio enumeration and warm-path co-investor discovery.
CREATE TABLE companies (
id bigserial PRIMARY KEY,
uuid uuid UNIQUE NOT NULL DEFAULT gen_random_uuid(),
name text NOT NULL,
domain text,
hq_country text,
hq_region text,
hq_city text,
founded_year int,
primary_sector text,
sectors text[],
status text, -- active | acquired | shutdown | ipo
description text,
description_embedding vector(1536),
created_at timestamptz NOT NULL DEFAULT now(),
updated_at timestamptz NOT NULL DEFAULT now(),
deleted_at timestamptz
);
CREATE INDEX companies_sectors_gin ON companies USING gin (sectors);
CREATE INDEX companies_country ON companies (hq_country);
CREATE INDEX companies_description_scann ON companies
USING scann (description_embedding cosine)
WITH (MODE='AUTO', OPTIMIZATION='SEARCH_OPTIMIZED');
funding_rounds
The graph hop between investors and companies.
CREATE TABLE funding_rounds (
id bigserial PRIMARY KEY,
uuid uuid UNIQUE NOT NULL DEFAULT gen_random_uuid(),
company_id bigint NOT NULL REFERENCES companies(id),
round_type text, -- pre-seed | seed | series_a | series_b | ...
announced_at date,
amount_usd bigint,
lead_investor_id bigint, -- nullable, references investors(id)
created_at timestamptz NOT NULL DEFAULT now(),
deleted_at timestamptz
);
CREATE INDEX funding_rounds_company ON funding_rounds (company_id);
CREATE INDEX funding_rounds_lead ON funding_rounds (lead_investor_id);
CREATE INDEX funding_rounds_announced ON funding_rounds (announced_at DESC);
investments (junction)
Many-to-many between investors and rounds. Captures co-investors.
CREATE TABLE investments (
id bigserial PRIMARY KEY,
funding_round_id bigint NOT NULL REFERENCES funding_rounds(id),
investor_id bigint NOT NULL REFERENCES investors(id),
is_lead boolean NOT NULL DEFAULT false,
amount_usd bigint,
created_at timestamptz NOT NULL DEFAULT now(),
UNIQUE (funding_round_id, investor_id)
);
CREATE INDEX investments_investor ON investments (investor_id);
CREATE INDEX investments_round ON investments (funding_round_id);
investors — the hot table
The find_investors target. 6 vector columns, hard filter columns, ranking signals.
CREATE TABLE investors (
id bigserial PRIMARY KEY,
uuid uuid UNIQUE NOT NULL DEFAULT gen_random_uuid(),
-- Identity
name text NOT NULL, -- person OR firm name
kind text NOT NULL, -- 'person' | 'firm'
firm_id bigint, -- if kind='person', their firm
linkedin_url text,
twitter_handle text,
domain text, -- firm website
-- Hard filters
active boolean NOT NULL DEFAULT true,
check_size_min_usd bigint,
check_size_max_usd bigint,
preferred_stages text[], -- ['seed','series_a']
focus_sectors text[], -- ['medtech','healthtech']
geographies text[], -- ISO-3166: ['US','CA','MX']
is_lead_capable boolean,
-- Ranking signals
last_invested_at date,
total_investments int NOT NULL DEFAULT 0,
total_leads int NOT NULL DEFAULT 0,
authority_score real, -- 0..1, derived offline
-- 6 vectors. Each captures a dimension of thesis.
-- All embedded with text-embedding-3-small (1536 dims).
sector_embedding vector(1536), -- "what they invest in"
stage_embedding vector(1536), -- "where in the lifecycle"
check_embedding vector(1536), -- "size + structure"
geography_embedding vector(1536), -- "where geographically"
signal_embedding vector(1536), -- "what excites them" (vibe / mission)
founder_embedding vector(1536), -- "what founder profile they back"
-- Provenance for the embeddings
thesis_extracted_at timestamptz,
thesis_source text, -- 'web' | 'deck' | 'manual'
thesis_model text, -- 'opus-4' | 'sonnet-4-6' | ...
thesis_summary text, -- human-readable, shown in UI
thesis_summary_long text, -- raw scraped text it was extracted from
created_at timestamptz NOT NULL DEFAULT now(),
updated_at timestamptz NOT NULL DEFAULT now(),
deleted_at timestamptz
);
-- Hard-filter indexes
CREATE INDEX investors_active ON investors (active) WHERE deleted_at IS NULL;
CREATE INDEX investors_focus_sectors_gin ON investors USING gin (focus_sectors);
CREATE INDEX investors_geographies_gin ON investors USING gin (geographies);
CREATE INDEX investors_preferred_stages_gin ON investors USING gin (preferred_stages);
CREATE INDEX investors_check_min ON investors (check_size_min_usd);
CREATE INDEX investors_check_max ON investors (check_size_max_usd);
CREATE INDEX investors_last_invested ON investors (last_invested_at DESC NULLS LAST);
CREATE INDEX investors_authority ON investors (authority_score DESC NULLS LAST);
-- 6 ScaNN indexes — one per vector column
CREATE INDEX investors_sector_scann ON investors USING scann (sector_embedding cosine) WITH (MODE='AUTO', OPTIMIZATION='SEARCH_OPTIMIZED', auto_maintenance='ON');
CREATE INDEX investors_stage_scann ON investors USING scann (stage_embedding cosine) WITH (MODE='AUTO', OPTIMIZATION='SEARCH_OPTIMIZED', auto_maintenance='ON');
CREATE INDEX investors_check_scann ON investors USING scann (check_embedding cosine) WITH (MODE='AUTO', OPTIMIZATION='SEARCH_OPTIMIZED', auto_maintenance='ON');
CREATE INDEX investors_geo_scann ON investors USING scann (geography_embedding cosine) WITH (MODE='AUTO', OPTIMIZATION='SEARCH_OPTIMIZED', auto_maintenance='ON');
CREATE INDEX investors_signal_scann ON investors USING scann (signal_embedding cosine) WITH (MODE='AUTO', OPTIMIZATION='SEARCH_OPTIMIZED', auto_maintenance='ON');
CREATE INDEX investors_founder_scann ON investors USING scann (founder_embedding cosine) WITH (MODE='AUTO', OPTIMIZATION='SEARCH_OPTIMIZED', auto_maintenance='ON');
runs
Every classifier dispatch and tool call. Audit + replay + eval.
CREATE TABLE runs (
id bigserial PRIMARY KEY,
uuid uuid UNIQUE NOT NULL DEFAULT gen_random_uuid(),
user_id bigint NOT NULL REFERENCES users(id),
thread_id text, -- Zep thread id
-- Classifier output
user_query text NOT NULL,
class text, -- e.g., 'find_investors'
classifier_confidence real,
classifier_model text,
classifier_latency_ms int,
-- Tool inputs (flexible — JSON for evolving schemas)
tool_input jsonb, -- { thesis: {...}, filters: {...} }
-- Tool outputs
result_count int,
result_ids bigint[], -- referenced investor ids
total_latency_ms int,
cost_usd numeric(10,6),
-- Failure tracking
status text NOT NULL, -- 'success' | 'classifier_unsure' | 'no_results' | 'error'
error_message text,
created_at timestamptz NOT NULL DEFAULT now()
);
CREATE INDEX runs_user ON runs (user_id, created_at DESC);
CREATE INDEX runs_class ON runs (class, created_at DESC);
CREATE INDEX runs_status ON runs (status, created_at DESC);
feedback
Per-result feedback so we can train ranking and prune bad recommendations.
CREATE TABLE feedback (
id bigserial PRIMARY KEY,
run_id bigint NOT NULL REFERENCES runs(id),
investor_id bigint NOT NULL REFERENCES investors(id),
signal text NOT NULL, -- 'thumbs_up' | 'thumbs_down' | 'meeting_booked' | 'check_received' | 'pass'
note text,
user_id bigint NOT NULL REFERENCES users(id),
created_at timestamptz NOT NULL DEFAULT now()
);
CREATE INDEX feedback_run ON feedback (run_id);
CREATE INDEX feedback_investor ON feedback (investor_id);
The hot query (find_investors step 6)
Single SQL call, all filters + cosine + ranker, hits one ScaNN index (sector) plus optionally re-ranks with the other five in a CTE.
Single-vector variant (cheapest, MVP)
SELECT
id, name, kind, firm_id, linkedin_url, domain,
thesis_summary,
check_size_min_usd, check_size_max_usd,
preferred_stages, focus_sectors, geographies,
last_invested_at, authority_score,
sector_embedding <=> $1::vector AS sector_distance
FROM investors
WHERE deleted_at IS NULL
AND active = true
AND check_size_min_usd <= $2::bigint -- target check size
AND (check_size_max_usd IS NULL OR check_size_max_usd >= $2::bigint)
AND $3::text = ANY(focus_sectors) -- target sector
AND geographies && $4::text[] -- intersection with allowed geos
AND $5::text = ANY(preferred_stages) -- target stage
AND (last_invested_at IS NULL OR last_invested_at > now() - interval '24 months')
ORDER BY sector_embedding <=> $1::vector
LIMIT 50;
6-vector reranker variant (final UI ranking)
WITH candidates AS (
-- Hard filter + sector ScaNN to get ~50 candidates fast
SELECT id
FROM investors
WHERE deleted_at IS NULL AND active = true
AND check_size_min_usd <= $check_size
AND $sector = ANY(focus_sectors)
AND geographies && $geos
ORDER BY sector_embedding <=> $sector_query
LIMIT 50
)
SELECT
i.id, i.name, i.firm_id, i.thesis_summary,
i.last_invested_at, i.authority_score,
-- 6-dimensional cosine — uses the other 5 ScaNN indexes
(i.sector_embedding <=> $sector_query) AS d_sector,
(i.stage_embedding <=> $stage_query) AS d_stage,
(i.check_embedding <=> $check_query) AS d_check,
(i.geography_embedding <=> $geo_query) AS d_geo,
(i.signal_embedding <=> $signal_query) AS d_signal,
(i.founder_embedding <=> $founder_query) AS d_founder,
-- Weighted final score (weights tuned empirically)
(
0.30 * (i.sector_embedding <=> $sector_query)
+ 0.15 * (i.stage_embedding <=> $stage_query)
+ 0.15 * (i.check_embedding <=> $check_query)
+ 0.10 * (i.geography_embedding <=> $geo_query)
+ 0.20 * (i.signal_embedding <=> $signal_query)
+ 0.10 * (i.founder_embedding <=> $founder_query)
) AS final_score
FROM investors i
JOIN candidates c USING (id)
ORDER BY final_score
LIMIT 25;
Open questions for Mark
- Vector dimensions: I drafted 1536 (text-embedding-3-small). You said you have BigQuery thesis vectors already — what dim and what model? If different we need to re-embed or use a separate column.
is_lead_capable— derive offline fromtotal_leads > 0or store explicitly?geographiesas ISO-3166 strings — fine? Or do we need polygon support for "anywhere within 2hr flight from SF"?firmsas their own table vs.kind='firm'row ininvestors? I went flat for read simplicity. Your call.- Re-embedding cadence — when an investor publishes new content, do we re-embed all 6 vectors or just
signal_embedding? Provenance columns are there to support either. - Authority score — already computed somewhere, or new derived column?
- Co-investor walks —
investmentsjoined to itself. Do we want a materialized view ofco_investor_pairsfor instant warm-path drafts?
Migration path from FalkorDB / Xano
The current Robert Lab agent (8349) reads FalkorDB. Migration:
- Mark exposes AlloyDB connection from Xano (lambda or external request).
- Port endpoint 8349 logic into new Xano
Anything Enginegroup — switch the Cypher block to the SQL above. - Backfill: one-time job pulls FalkorDB Investor + Funding_Round + Company nodes, embeds 6 vectors, inserts into AlloyDB.
- Delta sync from BigQuery (Mark's "couple weeks" item from Apr 21) keeps
investorsfresh.
References
- AlloyDB ScaNN docs:
docs/alloydb-scann.md - Architecture:
docs/architecture.md - Thesis extraction prompt:
prompts/find_investors/thesis-extract.md - Apr 21 Mark sync (cohort-first vs vector-first): MEMORY.md "LSI DOGFOOD" section