Show HN: GoldenMatch – 100M-row dedupe on Ray in 213s, no Spark, Arrow-native

Category: infrastructure

Tags: entity-resolution, data-quality, ray, arrow-native, deduplication

Score: 7.8/10 (Innovation: 7, Technical: 9, Documentation: 7, Utility: 8)

GoldenMatch is a high-performance entity resolution and data quality toolkit that scales from CSV files to 100M+ rows on Ray clusters without Spark, achieving deduplication in 213 seconds with Arrow-native processing. Its polyglot design (Python, TypeScript, Rust extensions for PostgreSQL/DuckDB) and AI-native interfaces (MCP, A2A, REST) make it a compelling alternative to traditional Spark-based deduplication pipelines.

Target audience: data engineers, backend devs, devops

Repository: https://github.com/benseverndev-oss/goldenmatch · Python · MIT · 74 stars

View on Hacker News