Entity resolution · Databricks · AI adjudication

AI-assisted lead deduplication, orchestrated on the lakehouse.

Salesforce, HubSpot, and event leads land in Databricks as the same people, spelled a dozen different ways. Deterministic rules resolve the obvious duplicates cheaply; Grove brings an LLM to bear only on the ambiguous matches a rule engine can't settle — and writes nothing it can't defend. This page runs against live data.

—

raw leads

—

match recall

—

precision

—

resolved entities

The problem

Three sources feed the same lakehouse table. A single buyer shows up as ravi.costa@meridiansystems.com in Salesforce, ravi.costa@icloud.com from an event scan, and again under Meridian Systems Corp vs meridian systems in HubSpot — sometimes with a nickname, a transposed letter, or no email at all. Exact-match SQL catches a fraction. The rest is judgment.

The pipeline

Six stages on Databricks. The deterministic layer does the cheap, high-confidence work; Grove is invoked only for the low-confidence band, then a graph pass resolves clusters. Grove also serves this page — every number below is read from Databricks through a Grove workflow.

1
Normalize
Email, phone, company and nickname canonicalization in SQL.
2
Block
Generate candidate pairs by shared key — not 2.9M comparisons.
3
Score
Rule features bucket pairs: auto-match, auto-reject, or review.
4
Adjudicate Grove
An LLM judges each review-band pair: same person, with a reason.
5
Resolve
Connected components over confirmed matches yield golden records.
6
Review
Humans confirm or flag merges; corrections feed the next run.

Try it: ingest an event batch

Six leads arrive from a conference scan. Run a dry-run ingest and watch Grove orchestrate the whole flow as one workflow: block the new leads against the existing lakehouse, auto-match the certain ones, and adjudicate the ambiguous ones with an LLM — live. It writes nothing, so run it as often as you like.

Incoming batch

6 leads

Block vs lakehouse

—

Grove adjudicates

—

Resolve

—

Measured results

Scored against held-out ground truth. Deterministic rules alone are precise but miss most duplicates. Adding Grove on the review band roughly doubles recall while holding precision.

	Precision	Recall	F1
Rules only	—	—	—
Rules + Grove	—	—	—

Rules

—

+ Grove

—

Recall — share of true duplicate pairs recovered

Human-in-the-loop review

The queue Grove hands a human: each pair it adjudicated, with its confidence and reasoning. Confirm a merge or flag a mistake — flags are written back to the lakehouse and become hard rules on the next run.

Reveal ground truth

Loading the review queue through Grove…

Bring agentic steps to your own lakehouse.

Grove runs the non-deterministic work dbt can't — classification, extraction, entity resolution — beside your existing SQL, with RBAC, per-tenant scoping, and append-only audit. This case study is one workflow; the platform is the same one you'd deploy.

Talk to us Data engineering ↗ AI agents ↗

AI-assisted lead deduplication, orchestrated on the lakehouse.

The problem

The pipeline

Normalize

Block

Score

Adjudicate Grove

Resolve

Review

Try it: ingest an event batch

Measured results

Human-in-the-loop review

Bring agentic steps to your own lakehouse.