Person Record Clustering with Fuzzy Matching

Person Record Clustering with Fuzzy Matching

Entity resolution system for clustering duplicate person records using fuzzy matching and graph algorithms.

Overview

An entity resolution system that identifies and clusters duplicate person records across large datasets. Handles real-world data variations including typos, date errors, and location changes.

Technical Stack

  • Fuzzy Matching: Metaphone encoding, Jaro-Winkler distance, Token Sort Ratio
  • Graph Algorithms: Connected components for transitive clustering
  • Data Processing: Pandas, NumPy
  • Python: Core implementation

Pipeline Steps

1. Preprocessing

  • Normalize names and standardize locations
  • Parse dates with multiple format support

2. Blocking Strategy

  • Group candidates by phonetic encoding + birth year (±1)
  • Reduces comparisons from O(n²) to O(n)

3. Fuzzy Matching

  • Name similarity (50% weight): Token Sort Ratio + Jaro-Winkler
  • Date similarity (35% weight): Graduated scoring
  • Location similarity (15% weight): Fuzzy string matching
  • Threshold: 0.75 for match classification

4. Graph Clustering

  • Build similarity graph from matched pairs
  • Find connected components for transitive closure

Performance

On 50,000 records:

  • Total runtime: ~70 seconds
  • Blocking reduces comparisons by 99%+ (1.25B → 250K pairs)

Project Details

Technologies
Python Machine Learning Data Engineering Entity Resolution Graph Algorithms