flash-attn Prebuilt Wheels
Install Flash Attention in Seconds - No Compilation Needed
Download prebuilt flash-attn wheels for Python 3.10-3.13, PyTorch 2.x, and CUDA 11.8-12.8. Skip 30+ minute compilation times. Works on Linux and Windows. Latest version: flash-attn 2.8.3.
Find Your Compatible Flash Attention Wheel
Select your platform, Flash Attention version, Python version, PyTorch version, and CUDA version below. We'll search our database of prebuilt wheels and show you the matching downloads with ready-to-use pip commands.
How to Install Flash Attention Without Compiling
Installing Flash Attention from source is notoriously difficult and time-consuming. With prebuilt wheels, you can skip the entire compilation process and get started in seconds.
- Select your configuration
Choose your operating system platform: Linux x86_64 for most servers and workstations, Linux ARM64 for ARM-based systems like AWS Graviton or Apple Silicon with Linux, or Windows AMD64 for Windows machines. Then select your Flash Attention version, Python version (3.8-3.12), PyTorch version (2.0+), and CUDA version (11.8-12.6).
- Find a compatible wheel
Our tool searches multiple community repositories including mjun0812's prebuild wheels and the official Dao-AILab releases. We match your exact configuration to find wheels that will work with your setup. If multiple wheels are found, we show all options.
- Install with one command
Copy the generated pip or uv install command and paste it into your terminal. The wheel will download and install directly without any compilation. Using uv instead of pip can make installation even faster. Within seconds, Flash Attention will be ready to accelerate your transformer models.
flash-attn Supported Versions and Platforms
Python Versions
Python 3.10-3.13 supported. Python 3.11 recommended for best wheel availability.
- • Python 3.10
- • Python 3.11 (recommended)
- • Python 3.12 (recommended)
- • Python 3.13 (v2.8.3+)
PyTorch Versions
PyTorch 2.0+ supported. PyTorch 2.4-2.5 have the best wheel coverage.
- • PyTorch 2.3-2.4
- • PyTorch 2.5 (recommended)
- • PyTorch 2.6-2.8
- • PyTorch 2.9 (v2.8.3+)
CUDA Versions
Check with nvidia-smi or nvcc --version.
- • CUDA 11.8
- • CUDA 12.1, 12.2, 12.3
- • CUDA 12.4 (recommended)
- • CUDA 12.6, 12.8
Windows wheels available for select configurations. See version history for full compatibility matrix.
Frequently Asked Questions About Flash Attention Wheels
What is the latest version of flash-attn?
The latest version of flash-attn is 2.8.3, released in January 2026. It supports Python 3.10-3.13, PyTorch 2.4-2.9, and CUDA 11.8-12.8. Use the wheel finder to download prebuilt wheels for the latest version.
Does flash-attn support Python 3.12 and 3.13?
Yes! Python 3.12 is fully supported with excellent wheel availability. Python 3.13 support was added in flash-attn 2.8.3 and requires PyTorch 2.6 or later. Python 3.11 has the widest wheel coverage.
Does flash-attn work on Windows?
Yes, prebuilt Windows wheels are available for many Python/PyTorch/CUDA combinations. Select "Windows AMD64" in the wheel finder. For more options, consider using WSL2 with Linux wheels.
How do I install flash-attn without compiling?
Use the wheel finder to select your Python, PyTorch, CUDA version, and platform. Copy the generated pip command (e.g., pip install https://...). The wheel installs in seconds without needing CUDA toolkit or compilers.
Why does pip install flash-attn fail?
Installing from PyPI requires compilation, which needs CUDA toolkit, C++ compilers, and 30+ minutes. It often fails due to version mismatches. Use prebuilt wheels instead to skip compilation entirely.
Which CUDA versions are supported by flash-attn?
Prebuilt wheels are available for CUDA 11.8, 12.1, 12.2, 12.3, 12.4, 12.6, and 12.8. CUDA 12.8 support was added in flash-attn 2.8.3. Check your CUDA version with nvidia-smi or nvcc --version.
Which PyTorch versions work with flash-attn?
flash-attn supports PyTorch 2.0 and later. PyTorch 2.4-2.5 have the best wheel availability. PyTorch 2.9 is supported with flash-attn 2.8.3+. Your PyTorch CUDA version must match the wheel.
What is the fastest way to install flash-attn?
Use uv (fast Python package manager) with a prebuilt wheel: uv pip install [wheel-url]. This combines the speed of uv with prebuilt wheels for installation in seconds. The wheel finder generates both pip and uv commands.
Can I use flash-attn with Hugging Face Transformers?
Yes! Once installed, set attn_implementation="flash_attention_2" when loading models, or use model.to_bettertransformer(). Transformers 4.34+ automatically uses Flash Attention when available.
What should I do if no wheel matches my configuration?
Try: 1) Using Python 3.11 (best coverage), 2) Matching your PyTorch version exactly, 3) Checking a different flash-attn version. For Windows, WSL2 provides more Linux wheel options. As a last resort, compile from source.
What is Flash Attention?
Flash Attention is a groundbreaking fast and memory-efficient exact attention algorithm developed by Tri Dao at Stanford University. Published in 2022, Flash Attention revolutionized how transformer models handle the attention mechanism by optimizing GPU memory access patterns and reducing memory usage from quadratic to linear in sequence length.
The algorithm achieves significant speedups (2-4x faster) compared to standard attention implementations while using less GPU memory. This enables training and inference of transformer models with much longer context lengths. Flash Attention 2, released in 2023, brought additional improvements with even better parallelism and work partitioning strategies.
Flash Attention is now widely used in production machine learning systems and is integrated into popular frameworks like Hugging Face Transformers, PyTorch (as scaled_dot_product_attention), and various LLM inference engines. Major language models including Llama 2, Mistral, and many others leverage Flash Attention for efficient training and serving.
The main challenge with Flash Attention is installation: compiling from source requires the CUDA toolkit, compatible C++ compilers, and can take over 30 minutes. Build failures due to version mismatches are common. That's why prebuilt wheels are so valuable—they eliminate all these compilation headaches and let you start using Flash Attention immediately.
This tool aggregates prebuilt wheels from trusted community repositories including mjun0812/flash-attention-prebuild-wheels and the official Dao-AILab/flash-attention repository, making it easy to find the right wheel for your specific configuration.