Entity resolution · Databricks · AI adjudication

AI-assisted lead deduplication, orchestrated on the lakehouse.

Salesforce, HubSpot, and event leads land in Databricks as the same people, spelled a dozen different ways. Deterministic rules resolve the obvious duplicates cheaply; Grove brings an LLM to bear only on the ambiguous matches a rule engine can't settle — and writes nothing it can't defend. This page runs against live data.

raw leads
match recall
precision
resolved entities
Salesforce HubSpot
Sources
Databricks
Lakehouse
Grove
Orchestration

The problem

Three sources feed the same lakehouse table. A single buyer shows up as ravi.costa@meridiansystems.com in Salesforce, ravi.costa@icloud.com from an event scan, and again under Meridian Systems Corp vs meridian systems in HubSpot — sometimes with a nickname, a transposed letter, or no email at all. Exact-match SQL catches a fraction. The rest is judgment.

The pipeline

Six stages on Databricks. The deterministic layer does the cheap, high-confidence work; Grove is invoked only for the low-confidence band, then a graph pass resolves clusters. Grove also serves this page — every number below is read from Databricks through a Grove workflow.

  1. 1

    Normalize

    Email, phone, company and nickname canonicalization in SQL.

  2. 2

    Block

    Generate candidate pairs by shared key — not 2.9M comparisons.

  3. 3

    Score

    Rule features bucket pairs: auto-match, auto-reject, or review.

  4. 4

    Adjudicate Grove

    An LLM judges each review-band pair: same person, with a reason.

  5. 5

    Resolve

    Connected components over confirmed matches yield golden records.

  6. 6

    Review

    Humans confirm or flag merges; corrections feed the next run.

Try it: ingest an event batch

Six leads arrive from a conference scan. Run a dry-run ingest and watch Grove orchestrate the whole flow as one workflow: block the new leads against the existing lakehouse, auto-match the certain ones, and adjudicate the ambiguous ones with an LLM — live. It writes nothing, so run it as often as you like.

Incoming batch
6 leads
Block vs lakehouse
Grove adjudicates
Resolve

Measured results

Scored against held-out ground truth. Deterministic rules alone are precise but miss most duplicates. Adding Grove on the review band roughly doubles recall while holding precision.

PrecisionRecallF1
Rules only
Rules + Grove
Rules
+ Grove

Recall — share of true duplicate pairs recovered

Human-in-the-loop review

The queue Grove hands a human: each pair it adjudicated, with its confidence and reasoning. Confirm a merge or flag a mistake — flags are written back to the lakehouse and become hard rules on the next run.

Loading the review queue through Grove…

Bring agentic steps to your own lakehouse.

Grove runs the non-deterministic work dbt can't — classification, extraction, entity resolution — beside your existing SQL, with RBAC, per-tenant scoping, and append-only audit. This case study is one workflow; the platform is the same one you'd deploy.