Show HN: The user agents crawling HN today
Category: library
Tags: web-scraping, text-extraction, nlp
Score: 7.8/10 (Innovation: 6, Technical: 8, Documentation: 9, Utility: 8)
Trafilatura is a comprehensive Python library and CLI tool for web crawling, scraping, and extracting clean text and metadata from HTML pages. It combines advanced discovery via sitemaps/feeds with robust extraction algorithms, outperforming alternatives in benchmarks, and is widely adopted by major research institutions and companies. Its technically mature design and excellent documentation make it a standout in the web scraping ecosystem.
Target audience: data engineers, nlp researchers, backend devs
Repository: https://ai.realhackers.org/user_agents.txt · Python · Apache-2.0 · 5978 stars
View on Hacker News