Show HN: Let E-Commerce Products Build Their Own Taxonomy
Category: library
Tags: semantic-deduplication, data-cleaning, machine-learning
Score: 7.0/10 (Innovation: 6, Technical: 7, Documentation: 8, Utility: 7)
SemHash is a lightweight multimodal library for semantic deduplication, outlier filtering, and representative sample selection. It stands out by combining fast embeddings (Model2Vec) and efficient similarity search (Vicinity) to provide explainable, scalable deduplication for text, images, and other modalities. Its inspection tools and cross-dataset operations make it particularly useful for cleaning ML training data.
Target audience: data scientists, ML engineers, data engineers
Repository: https://mirakl.tech/let-products-build-their-own-taxonomy-4d1bdee450a7 · Python · MIT · 937 stars
View on Hacker News