DocsAlloyDB ScaNN
Data Plane · ScaNN

AlloyDB ScaNN

Hybrid SQL + vector search in one call — ScaNN reference notes.

AlloyDB ScaNN

Hard Filters and Vector Search in One Call

Mark's "key lime pie" discovery (Apr 28). ScaNN combines hard SQL filters with semantic vector search in a single SQL call. No two-pass cohort-then-rank dance.

Apr 29 directive (Mark, 5:53 AM): going "all in" on ScaNN. ScaNN-vs-HNSW is settled — ScaNN wins for our matching/scoring milestone. This doc reflects the all-in stance; HNSW is not on the table. Mintlify doc has the full rationale.

Setup

CREATE EXTENSION IF NOT EXISTS alloydb_scann CASCADE;
-- Auto-installs the `vector` extension as well.

Index Creation

Auto-tuned (recommended default)

CREATE INDEX investor_thesis_scann ON investors
  USING scann (thesis_embedding cosine)
  WITH (MODE='AUTO',
        OPTIMIZATION='SEARCH_OPTIMIZED',
        auto_maintenance='ON');

Manual two-level tree (when you want control)

CREATE INDEX investor_thesis_scann ON investors
  USING scann (thesis_embedding cosine)
  WITH (mode='MANUAL',
        num_leaves=2000,        -- ~sqrt(N) for N rows
        quantizer='SQ8',        -- SQ8 / AH / FLAT
        auto_maintenance='ON');

Three-level tree (for multi-million row tables)

Add max_num_levels=2 to the WITH clause.

Distance functions: l2, dot_product, cosine. OpenAI embeddings are unit-normalized → cosinedot_product. Use cosine for clarity.

Distance Operators

OperatorMeaning
<->L2 distance
<=>Cosine distance
<#>Inner product

Pure Vector Query

SELECT id, name, thesis_summary
FROM investors
ORDER BY thesis_embedding <=> $1::vector
LIMIT 50;

Hybrid Search (the killer pattern)

This is what Mark called the key lime pie. Hard filter + semantic vector ranking in one statement:

SELECT id, name, firm, thesis_summary,
       thesis_embedding <=> $query::vector AS distance
FROM investors
WHERE
  -- hard filters
  active = true
  AND check_size_min <= $target_check
  AND check_size_max >= $target_check
  AND $sector = ANY(focus_sectors)
  AND geography && $allowed_geos
  AND last_invested_at > NOW() - INTERVAL '18 months'
ORDER BY thesis_embedding <=> $query::vector
LIMIT 25;

Why this matters for find_investors: the 6 vector indexes Mark talked about (sector thesis, stage, check size, geography, signal text, founder fit) all sit on the same row. A single ScaNN-backed query ranks the post-filter cohort directly — no separate "pull cohort, then rank in app code" step.

Tuning Parameters

ParameterPurposeNotes
num_leavesTree partitions~√N for N rows is a good starting point
max_num_levelsTree depth1 = 2-level, 2 = 3-level, 3 = 4-level (preview)
quantizerCompressionSQ8 (default, fast), AH (smaller), FLAT (no compression)
scann.num_leaves_to_searchQuery-time recall vs speedHigher = better recall, slower

Preview flags:

SET scann.enable_preview_features = on;
SET scann.max_allowed_num_levels = 3;

Column Types

  • vector(1536) for OpenAI text-embedding-3-small
  • vector(3072) for OpenAI text-embedding-3-large
  • Cast real[] columns: CAST(col AS vector(1536))
  • Need ≥10k rows before auto-tuned ScaNN works

Why this beats FalkorDB for find_investors

ConcernFalkorDB (current)AlloyDB ScaNN
Hybrid filter+vectorTwo-pass: cohort cypher then in-app rankSingle SQL call
Multiple thesis dimensionsOne vector per nodeMultiple vector columns per row
Latency8–10s end-to-end (LSI dogfood)Target sub-second on indexed query
Operator surfaceCypherStandard SQL — anyone can read it
HostingSelf-managedGoogle managed

What we still need from Mark

  • AlloyDB connection string for the sandbox
  • Schema for the investors / fundraising_thesis tables
  • Decision on which 6 vector columns make it onto the row (sector / stage / check / geo / signal / founder?)
  • BigQuery → AlloyDB sync cadence (Mark mentioned half-million people, hundreds of thousands of theses)

References