Show HN: FlashAttention-2 in Cute, from Scratch

Category: library

Tags: gpu-kernels, transformer, flashattention, cuda, triton, deep-learning

Score: 7.8/10 (Innovation: 6, Technical: 9, Documentation: 8, Utility: 8)

This project provides high-performance Triton and CUDA kernels for transformer operations, including FlashAttention-2, RMSNorm, SwiGLU, and quantized matrix multiplications, with detailed benchmarks showing near-native performance. Its standout contribution is a production-style CuTe-based rewrite of FlashAttention-2 that achieves parity with Tri Dao's implementation, accompanied by thorough educational blog posts and explanatory demos.

Target audience: backend devs, data engineers, machine learning engineers

Repository: https://blog.echen.io/p/flashattention-2-in-cute-from-scratch/ · Python

View on Hacker News