DocsAlloyDB Schema
Data Plane · Schema

AlloyDB Schema

Robert's parallel schema draft and the 7 open questions for Mark.

AlloyDB Schema

Structured Tables, Vector-Threaded

Robert and Mark are drafting the schema in parallel. Bring this to Apr 29 sync, compare with Mark's draft, reconcile live.

Goal: support the find_investors hybrid query in docs/architecture.md step 6 — single SQL call combining hard filters with 6-vector cosine search.

Design principles

  1. Denormalize for the read path. find_investors hits investors and only investors. No joins on the hot path. Joins live in pre-aggregation jobs.
  2. 6 vector columns per investor, not one combined embedding. Mark's framework: each thesis dimension is its own concern. Re-embedding only one dimension when it changes is cheaper than re-embedding the whole row.
  3. Filters are typed, not text. check_size_min is bigint, not '$2M-$10M'. The classifier extracts numbers, not strings.
  4. Soft-deletable, never hard-deleted. deleted_at timestamp NULL. Historical match audit needs the row.
  5. Provenance columns on everything ML-derived. thesis_summary_source ∈ {deck, web, manual}, thesis_summary_extracted_at, thesis_summary_model. So we can re-extract and audit.

Tables

users

Mirror of WorkOS users. Xano's existing user table is the master; this is a thin local mirror so AlloyDB can join runs.user_id without a round-trip.

CREATE TABLE users (
  id              bigserial PRIMARY KEY,
  workos_user_id  text UNIQUE NOT NULL,
  email           text,
  display_name    text,
  created_at      timestamptz NOT NULL DEFAULT now(),
  updated_at      timestamptz NOT NULL DEFAULT now()
);

companies

Investee companies. Used for portfolio enumeration and warm-path co-investor discovery.

CREATE TABLE companies (
  id              bigserial PRIMARY KEY,
  uuid            uuid UNIQUE NOT NULL DEFAULT gen_random_uuid(),
  name            text NOT NULL,
  domain          text,
  hq_country      text,
  hq_region       text,
  hq_city         text,
  founded_year    int,
  primary_sector  text,
  sectors         text[],
  status          text,             -- active | acquired | shutdown | ipo
  description     text,
  description_embedding  vector(1536),
  created_at      timestamptz NOT NULL DEFAULT now(),
  updated_at      timestamptz NOT NULL DEFAULT now(),
  deleted_at      timestamptz
);

CREATE INDEX companies_sectors_gin ON companies USING gin (sectors);
CREATE INDEX companies_country ON companies (hq_country);
CREATE INDEX companies_description_scann ON companies
  USING scann (description_embedding cosine)
  WITH (MODE='AUTO', OPTIMIZATION='SEARCH_OPTIMIZED');

funding_rounds

The graph hop between investors and companies.

CREATE TABLE funding_rounds (
  id            bigserial PRIMARY KEY,
  uuid          uuid UNIQUE NOT NULL DEFAULT gen_random_uuid(),
  company_id    bigint NOT NULL REFERENCES companies(id),
  round_type    text,         -- pre-seed | seed | series_a | series_b | ...
  announced_at  date,
  amount_usd    bigint,
  lead_investor_id bigint,    -- nullable, references investors(id)
  created_at    timestamptz NOT NULL DEFAULT now(),
  deleted_at    timestamptz
);

CREATE INDEX funding_rounds_company ON funding_rounds (company_id);
CREATE INDEX funding_rounds_lead ON funding_rounds (lead_investor_id);
CREATE INDEX funding_rounds_announced ON funding_rounds (announced_at DESC);

investments (junction)

Many-to-many between investors and rounds. Captures co-investors.

CREATE TABLE investments (
  id               bigserial PRIMARY KEY,
  funding_round_id bigint NOT NULL REFERENCES funding_rounds(id),
  investor_id      bigint NOT NULL REFERENCES investors(id),
  is_lead          boolean NOT NULL DEFAULT false,
  amount_usd       bigint,
  created_at       timestamptz NOT NULL DEFAULT now(),
  UNIQUE (funding_round_id, investor_id)
);

CREATE INDEX investments_investor ON investments (investor_id);
CREATE INDEX investments_round ON investments (funding_round_id);

investors — the hot table

The find_investors target. 6 vector columns, hard filter columns, ranking signals.

CREATE TABLE investors (
  id                       bigserial PRIMARY KEY,
  uuid                     uuid UNIQUE NOT NULL DEFAULT gen_random_uuid(),

  -- Identity
  name                     text NOT NULL,           -- person OR firm name
  kind                     text NOT NULL,           -- 'person' | 'firm'
  firm_id                  bigint,                  -- if kind='person', their firm
  linkedin_url             text,
  twitter_handle           text,
  domain                   text,                    -- firm website

  -- Hard filters
  active                   boolean NOT NULL DEFAULT true,
  check_size_min_usd       bigint,
  check_size_max_usd       bigint,
  preferred_stages         text[],                  -- ['seed','series_a']
  focus_sectors            text[],                  -- ['medtech','healthtech']
  geographies              text[],                  -- ISO-3166: ['US','CA','MX']
  is_lead_capable          boolean,

  -- Ranking signals
  last_invested_at         date,
  total_investments        int NOT NULL DEFAULT 0,
  total_leads              int NOT NULL DEFAULT 0,
  authority_score          real,                    -- 0..1, derived offline

  -- 6 vectors. Each captures a dimension of thesis.
  -- All embedded with text-embedding-3-small (1536 dims).
  sector_embedding         vector(1536),            -- "what they invest in"
  stage_embedding          vector(1536),            -- "where in the lifecycle"
  check_embedding          vector(1536),            -- "size + structure"
  geography_embedding      vector(1536),            -- "where geographically"
  signal_embedding         vector(1536),            -- "what excites them" (vibe / mission)
  founder_embedding        vector(1536),            -- "what founder profile they back"

  -- Provenance for the embeddings
  thesis_extracted_at      timestamptz,
  thesis_source            text,                    -- 'web' | 'deck' | 'manual'
  thesis_model             text,                    -- 'opus-4' | 'sonnet-4-6' | ...
  thesis_summary           text,                    -- human-readable, shown in UI
  thesis_summary_long      text,                    -- raw scraped text it was extracted from

  created_at               timestamptz NOT NULL DEFAULT now(),
  updated_at               timestamptz NOT NULL DEFAULT now(),
  deleted_at               timestamptz
);

-- Hard-filter indexes
CREATE INDEX investors_active ON investors (active) WHERE deleted_at IS NULL;
CREATE INDEX investors_focus_sectors_gin ON investors USING gin (focus_sectors);
CREATE INDEX investors_geographies_gin ON investors USING gin (geographies);
CREATE INDEX investors_preferred_stages_gin ON investors USING gin (preferred_stages);
CREATE INDEX investors_check_min ON investors (check_size_min_usd);
CREATE INDEX investors_check_max ON investors (check_size_max_usd);
CREATE INDEX investors_last_invested ON investors (last_invested_at DESC NULLS LAST);
CREATE INDEX investors_authority ON investors (authority_score DESC NULLS LAST);

-- 6 ScaNN indexes — one per vector column
CREATE INDEX investors_sector_scann   ON investors USING scann (sector_embedding cosine)   WITH (MODE='AUTO', OPTIMIZATION='SEARCH_OPTIMIZED', auto_maintenance='ON');
CREATE INDEX investors_stage_scann    ON investors USING scann (stage_embedding cosine)    WITH (MODE='AUTO', OPTIMIZATION='SEARCH_OPTIMIZED', auto_maintenance='ON');
CREATE INDEX investors_check_scann    ON investors USING scann (check_embedding cosine)    WITH (MODE='AUTO', OPTIMIZATION='SEARCH_OPTIMIZED', auto_maintenance='ON');
CREATE INDEX investors_geo_scann      ON investors USING scann (geography_embedding cosine) WITH (MODE='AUTO', OPTIMIZATION='SEARCH_OPTIMIZED', auto_maintenance='ON');
CREATE INDEX investors_signal_scann   ON investors USING scann (signal_embedding cosine)   WITH (MODE='AUTO', OPTIMIZATION='SEARCH_OPTIMIZED', auto_maintenance='ON');
CREATE INDEX investors_founder_scann  ON investors USING scann (founder_embedding cosine)  WITH (MODE='AUTO', OPTIMIZATION='SEARCH_OPTIMIZED', auto_maintenance='ON');

runs

Every classifier dispatch and tool call. Audit + replay + eval.

CREATE TABLE runs (
  id                bigserial PRIMARY KEY,
  uuid              uuid UNIQUE NOT NULL DEFAULT gen_random_uuid(),
  user_id           bigint NOT NULL REFERENCES users(id),
  thread_id         text,                    -- Zep thread id

  -- Classifier output
  user_query        text NOT NULL,
  class             text,                    -- e.g., 'find_investors'
  classifier_confidence  real,
  classifier_model  text,
  classifier_latency_ms  int,

  -- Tool inputs (flexible — JSON for evolving schemas)
  tool_input        jsonb,                   -- { thesis: {...}, filters: {...} }

  -- Tool outputs
  result_count      int,
  result_ids        bigint[],                -- referenced investor ids
  total_latency_ms  int,
  cost_usd          numeric(10,6),

  -- Failure tracking
  status            text NOT NULL,           -- 'success' | 'classifier_unsure' | 'no_results' | 'error'
  error_message     text,

  created_at        timestamptz NOT NULL DEFAULT now()
);

CREATE INDEX runs_user ON runs (user_id, created_at DESC);
CREATE INDEX runs_class ON runs (class, created_at DESC);
CREATE INDEX runs_status ON runs (status, created_at DESC);

feedback

Per-result feedback so we can train ranking and prune bad recommendations.

CREATE TABLE feedback (
  id           bigserial PRIMARY KEY,
  run_id       bigint NOT NULL REFERENCES runs(id),
  investor_id  bigint NOT NULL REFERENCES investors(id),
  signal       text NOT NULL,    -- 'thumbs_up' | 'thumbs_down' | 'meeting_booked' | 'check_received' | 'pass'
  note         text,
  user_id      bigint NOT NULL REFERENCES users(id),
  created_at   timestamptz NOT NULL DEFAULT now()
);

CREATE INDEX feedback_run ON feedback (run_id);
CREATE INDEX feedback_investor ON feedback (investor_id);

The hot query (find_investors step 6)

Single SQL call, all filters + cosine + ranker, hits one ScaNN index (sector) plus optionally re-ranks with the other five in a CTE.

Single-vector variant (cheapest, MVP)

SELECT
  id, name, kind, firm_id, linkedin_url, domain,
  thesis_summary,
  check_size_min_usd, check_size_max_usd,
  preferred_stages, focus_sectors, geographies,
  last_invested_at, authority_score,
  sector_embedding <=> $1::vector AS sector_distance
FROM investors
WHERE deleted_at IS NULL
  AND active = true
  AND check_size_min_usd <= $2::bigint           -- target check size
  AND (check_size_max_usd IS NULL OR check_size_max_usd >= $2::bigint)
  AND $3::text = ANY(focus_sectors)              -- target sector
  AND geographies && $4::text[]                  -- intersection with allowed geos
  AND $5::text = ANY(preferred_stages)           -- target stage
  AND (last_invested_at IS NULL OR last_invested_at > now() - interval '24 months')
ORDER BY sector_embedding <=> $1::vector
LIMIT 50;

6-vector reranker variant (final UI ranking)

WITH candidates AS (
  -- Hard filter + sector ScaNN to get ~50 candidates fast
  SELECT id
  FROM investors
  WHERE deleted_at IS NULL AND active = true
    AND check_size_min_usd <= $check_size
    AND $sector = ANY(focus_sectors)
    AND geographies && $geos
  ORDER BY sector_embedding <=> $sector_query
  LIMIT 50
)
SELECT
  i.id, i.name, i.firm_id, i.thesis_summary,
  i.last_invested_at, i.authority_score,
  -- 6-dimensional cosine — uses the other 5 ScaNN indexes
  (i.sector_embedding    <=> $sector_query)    AS d_sector,
  (i.stage_embedding     <=> $stage_query)     AS d_stage,
  (i.check_embedding     <=> $check_query)     AS d_check,
  (i.geography_embedding <=> $geo_query)       AS d_geo,
  (i.signal_embedding    <=> $signal_query)    AS d_signal,
  (i.founder_embedding   <=> $founder_query)   AS d_founder,
  -- Weighted final score (weights tuned empirically)
  (
       0.30 * (i.sector_embedding    <=> $sector_query)
     + 0.15 * (i.stage_embedding     <=> $stage_query)
     + 0.15 * (i.check_embedding     <=> $check_query)
     + 0.10 * (i.geography_embedding <=> $geo_query)
     + 0.20 * (i.signal_embedding    <=> $signal_query)
     + 0.10 * (i.founder_embedding   <=> $founder_query)
  ) AS final_score
FROM investors i
JOIN candidates c USING (id)
ORDER BY final_score
LIMIT 25;

Open questions for Mark

  1. Vector dimensions: I drafted 1536 (text-embedding-3-small). You said you have BigQuery thesis vectors already — what dim and what model? If different we need to re-embed or use a separate column.
  2. is_lead_capable — derive offline from total_leads > 0 or store explicitly?
  3. geographies as ISO-3166 strings — fine? Or do we need polygon support for "anywhere within 2hr flight from SF"?
  4. firms as their own table vs. kind='firm' row in investors? I went flat for read simplicity. Your call.
  5. Re-embedding cadence — when an investor publishes new content, do we re-embed all 6 vectors or just signal_embedding? Provenance columns are there to support either.
  6. Authority score — already computed somewhere, or new derived column?
  7. Co-investor walksinvestments joined to itself. Do we want a materialized view of co_investor_pairs for instant warm-path drafts?

Migration path from FalkorDB / Xano

The current Robert Lab agent (8349) reads FalkorDB. Migration:

  1. Mark exposes AlloyDB connection from Xano (lambda or external request).
  2. Port endpoint 8349 logic into new Xano Anything Engine group — switch the Cypher block to the SQL above.
  3. Backfill: one-time job pulls FalkorDB Investor + Funding_Round + Company nodes, embeds 6 vectors, inserts into AlloyDB.
  4. Delta sync from BigQuery (Mark's "couple weeks" item from Apr 21) keeps investors fresh.

References

  • AlloyDB ScaNN docs: docs/alloydb-scann.md
  • Architecture: docs/architecture.md
  • Thesis extraction prompt: prompts/find_investors/thesis-extract.md
  • Apr 21 Mark sync (cohort-first vs vector-first): MEMORY.md "LSI DOGFOOD" section