Data Plane · ScaNN

AlloyDB ScaNN

Hybrid SQL + vector search in one call — ScaNN reference notes.

AlloyDB ScaNN

Hard Filters and Vector Search in One Call

Mark's "key lime pie" discovery (Apr 28). ScaNN combines hard SQL filters with semantic vector search in a single SQL call. No two-pass cohort-then-rank dance.

Apr 29 directive (Mark, 5:53 AM): going "all in" on ScaNN. ScaNN-vs-HNSW is settled — ScaNN wins for our matching/scoring milestone. This doc reflects the all-in stance; HNSW is not on the table. Mintlify doc has the full rationale.

Setup

CREATE EXTENSION IF NOT EXISTS alloydb_scann CASCADE;
-- Auto-installs the `vector` extension as well.

Index Creation

Auto-tuned (recommended default)

CREATE INDEX investor_thesis_scann ON investors
  USING scann (thesis_embedding cosine)
  WITH (MODE='AUTO',
        OPTIMIZATION='SEARCH_OPTIMIZED',
        auto_maintenance='ON');

Manual two-level tree (when you want control)

CREATE INDEX investor_thesis_scann ON investors
  USING scann (thesis_embedding cosine)
  WITH (mode='MANUAL',
        num_leaves=2000,        -- ~sqrt(N) for N rows
        quantizer='SQ8',        -- SQ8 / AH / FLAT
        auto_maintenance='ON');

Three-level tree (for multi-million row tables)

Add max_num_levels=2 to the WITH clause.

Distance functions: l2, dot_product, cosine. OpenAI embeddings are unit-normalized → cosine ≡ dot_product. Use cosine for clarity.

Distance Operators

Operator	Meaning
`<->`	L2 distance
`<=>`	Cosine distance
`<#>`	Inner product

Pure Vector Query

SELECT id, name, thesis_summary
FROM investors
ORDER BY thesis_embedding <=> $1::vector
LIMIT 50;

Hybrid Search (the killer pattern)

This is what Mark called the key lime pie. Hard filter + semantic vector ranking in one statement:

SELECT id, name, firm, thesis_summary,
       thesis_embedding <=> $query::vector AS distance
FROM investors
WHERE
  -- hard filters
  active = true
  AND check_size_min <= $target_check
  AND check_size_max >= $target_check
  AND $sector = ANY(focus_sectors)
  AND geography && $allowed_geos
  AND last_invested_at > NOW() - INTERVAL '18 months'
ORDER BY thesis_embedding <=> $query::vector
LIMIT 25;

Why this matters for find_investors: the 6 vector indexes Mark talked about (sector thesis, stage, check size, geography, signal text, founder fit) all sit on the same row. A single ScaNN-backed query ranks the post-filter cohort directly — no separate "pull cohort, then rank in app code" step.

Tuning Parameters

Parameter	Purpose	Notes
`num_leaves`	Tree partitions	~√N for N rows is a good starting point
`max_num_levels`	Tree depth	1 = 2-level, 2 = 3-level, 3 = 4-level (preview)
`quantizer`	Compression	SQ8 (default, fast), AH (smaller), FLAT (no compression)
`scann.num_leaves_to_search`	Query-time recall vs speed	Higher = better recall, slower

Preview flags:

SET scann.enable_preview_features = on;
SET scann.max_allowed_num_levels = 3;

Column Types

vector(1536) for OpenAI text-embedding-3-small
vector(3072) for OpenAI text-embedding-3-large
Cast real[] columns: CAST(col AS vector(1536))
Need ≥10k rows before auto-tuned ScaNN works

Why this beats FalkorDB for find_investors

Concern	FalkorDB (current)	AlloyDB ScaNN
Hybrid filter+vector	Two-pass: cohort cypher then in-app rank	Single SQL call
Multiple thesis dimensions	One vector per node	Multiple vector columns per row
Latency	8–10s end-to-end (LSI dogfood)	Target sub-second on indexed query
Operator surface	Cypher	Standard SQL — anyone can read it
Hosting	Self-managed	Google managed

What we still need from Mark

AlloyDB connection string for the sandbox
Schema for the investors / fundraising_thesis tables
Decision on which 6 vector columns make it onto the row (sector / stage / check / geo / signal / founder?)
BigQuery → AlloyDB sync cadence (Mark mentioned half-million people, hundreds of thousands of theses)

References

https://docs.cloud.google.com/alloydb/docs/ai/store-index-query-vectors
Apr 28 sync notes: ~/.claude/projects/.../memory/april-28-mark-sync.md