Lead Lookup System Design (case study)
By Eugene, Published on
System Description
A high-scale lead management system designed to store and retrieve contact information for 200M users. The system supports lookups by phone number, email address, and SSN to resolve associated user profiles. A typical request maps a single contact identifier to exactly one user.
Primary Use Cases
- Fast contact lookups (phone / email / SSN → user profile)
- Multiple contacts per user (avg: 1 phone, 1 email; ~0.001% have 2–3)
- Contact reassignment between users
- Full contact history tracking
Requirements
Functional Requirements
- Lookups: Find user by phone, email, or SSN
- Contact management:
- User can have multiple phones (avg 1, rarely 2–3)
- User can have multiple emails (avg 1, rarely 2–3)
- User has exactly one SSN (1:1 relationship)
- Uniqueness: Each phone and email belongs to only one user at a time
- History: Track all contact changes in a separate table (no impact on lookup performance)
- Operations: Add/remove contacts, reassign contacts, update lead data
Non-Functional Requirements
- Scale: 200M users
- Read load: ~200K lookups/day (~2–3 QPS)
- Write load: 10–30% of users updated once every 1–2 months
Solution 1: PostgreSQL
Migrations
CREATE TABLE leads (
user_id BIGSERIAL PRIMARY KEY,
lead_data JSONB NOT NULL,
created_at TIMESTAMPTZ NOT NULL DEFAULT NOW(),
updated_at TIMESTAMPTZ NOT NULL DEFAULT NOW()
);
CREATE TABLE phone_index (
phone VARCHAR(20) PRIMARY KEY,
user_id BIGINT NOT NULL REFERENCES leads(user_id) ON DELETE CASCADE,
created_at TIMESTAMPTZ NOT NULL DEFAULT NOW(),
CONSTRAINT unique_phone UNIQUE (phone)
);
CREATE INDEX idx_phone_hash ON phone_index USING hash(phone);
CREATE INDEX idx_phone_user ON phone_index USING btree(user_id);
CREATE TABLE email_index (
email VARCHAR(255) PRIMARY KEY,
user_id BIGINT NOT NULL REFERENCES leads(user_id) ON DELETE CASCADE,
created_at TIMESTAMPTZ NOT NULL DEFAULT NOW(),
CONSTRAINT unique_email UNIQUE (email)
);
CREATE INDEX idx_email_hash ON email_index USING hash(email);
CREATE INDEX idx_email_user ON email_index USING btree(user_id);
CREATE FUNCTION hash_text(input TEXT) RETURNS BIGINT AS $$ BEGIN RETURN ('x' || substr(md5(input), 1, 16))::bit(64)::bigint; END; $$ LANGUAGE plpgsql IMMUTABLE;
CREATE TABLE ssn_index (
ssn_hash BIGINT PRIMARY KEY,
ssn_last4 CHAR(4) NOT NULL,
user_id BIGINT UNIQUE NOT NULL REFERENCES leads(user_id) ON DELETE CASCADE,
created_at TIMESTAMPTZ NOT NULL DEFAULT NOW()
);
CREATE INDEX idx_ssn_hash ON ssn_index USING hash(ssn_hash);
CREATE INDEX idx_ssn_last4 ON ssn_index USING btree(ssn_last4);
CREATE INDEX idx_ssn_user ON ssn_index USING btree(user_id);
CREATE TABLE contact_history (
id BIGSERIAL PRIMARY KEY,
user_id BIGINT NOT NULL REFERENCES leads(user_id) ON DELETE CASCADE,
contact_type VARCHAR(20) NOT NULL CHECK (contact_type IN ('phone', 'email')),
contact_value VARCHAR(255) NOT NULL,
added_at TIMESTAMPTZ NOT NULL DEFAULT NOW(),
removed_at TIMESTAMPTZ
);
CREATE INDEX idx_contact_history_user ON contact_history(user_id, added_at DESC);
CREATE INDEX idx_contact_history_contact ON contact_history(contact_type, contact_value);
Query Examples (PostgreSQL)
SELECT l.*
FROM leads l
JOIN phone_index p ON l.user_id = p.user_id
WHERE p.phone = '12025551234';
SELECT l.*
FROM leads l
JOIN email_index e ON l.user_id = e.user_id
WHERE e.email = 'john.doe@example.com';
Solution 2: ClickHouse (ReplacingMergeTree)
Core Problem
ReplacingMergeTree is optimized for append-only analytical workloads, not transactional lookups with updates. Deduplication occurs during background merges, making lookup correctness timing-dependent. Queries may return multiple versions unless explicitly constrained.
Key Issues
- Eventual consistency: Multiple versions exist before merge
- Query overhead: FINAL or ORDER BY + LIMIT required (10–50× slower)
- No foreign keys: No ON DELETE CASCADE, orphaned rows possible
-- Reassignment example
INSERT INTO phone_index VALUES ('+1234567890', 123, now());
INSERT INTO phone_index VALUES ('+1234567890', 456, now());
SELECT *
FROM phone_index
WHERE phone = '+1234567890'; -- Returns both rows until merge
Why ClickHouse Still Has Advantages
- Excellent compression: 5–10× smaller than PostgreSQL
- Fast analytics: 10–100× faster aggregations
- Automatic deduplication: No manual version cleanup
- Append-only writes: 100K+ inserts/sec, no B-tree overhead
SELECT toYYYYMM(added_at) AS month, contact_type, count() AS additions, uniq(user_id) AS unique_users
FROM contact_history
GROUP BY month, contact_type
ORDER BY month DESC;
Solution 3: Hybrid (PostgreSQL + ClickHouse)
Architecture
- PostgreSQL (hot path): phone_index, email_index, ssn_index
- O(1) lookups
- ACID uniqueness guarantees
- ClickHouse (storage & analytics):
- leads table
- contact_history table
Query Flow
-- Step 1: lookup
SELECT user_id
FROM phone_index
WHERE phone = '+12025551234';
-- Step 2: fetch lead data
SELECT lead_data
FROM leads
WHERE user_id = 123
ORDER BY updated_at DESC
LIMIT 1;