Show HN: Rust Backed Fast Dataloader
Category: library
Tags: dataset-loader, machine-learning, rust, parquet, cloud-native, pytorch
Score: 8.0/10 (Innovation: 8, Technical: 9, Documentation: 8, Utility: 7)
Ferroload is a Rust-backed, cloud-native multimodal dataset format and runtime for ML training, combining sharded tar data with a DuckDB-queryable Parquet index for fast, parallel streaming from local or object storage. Its innovative design integrates columnar indexing, deterministic shuffling, and in-Rust decoding to significantly outperform existing loaders like WebDataset and HuggingFace datasets. This project is interesting for its potential to become a new standard for efficient, scalable ML data pipelines.
Target audience: data engineers, ml engineers, backend devs
Repository: https://github.com/midhunharikumar/ferroload · Rust · 1 stars
View on Hacker News