Flash Attention Prebuilt Wheels
Download flash-attn for PyTorch & CUDA
Find and download prebuilt Flash Attention wheels for your specific Python, PyTorch, and CUDA configuration. Skip the lengthy compilation process and install flash-attn with pip or uv in seconds. Our tool searches multiple repositories to find compatible wheels for Linux and Windows platforms.
Find Your Compatible Flash Attention Wheel
Select your platform, Flash Attention version, Python version, PyTorch version, and CUDA version below. We'll search our database of prebuilt wheels and show you the matching downloads with ready-to-use pip commands.
How to Install Flash Attention Without Compiling
Installing Flash Attention from source is notoriously difficult and time-consuming. With prebuilt wheels, you can skip the entire compilation process and get started in seconds.
- Select your configuration
Choose your operating system platform: Linux x86_64 for most servers and workstations, Linux ARM64 for ARM-based systems like AWS Graviton or Apple Silicon with Linux, or Windows AMD64 for Windows machines. Then select your Flash Attention version, Python version (3.8-3.12), PyTorch version (2.0+), and CUDA version (11.8-12.6).
- Find a compatible wheel
Our tool searches multiple community repositories including mjun0812's prebuild wheels and the official Dao-AILab releases. We match your exact configuration to find wheels that will work with your setup. If multiple wheels are found, we show all options.
- Install with one command
Copy the generated pip or uv install command and paste it into your terminal. The wheel will download and install directly without any compilation. Using uv instead of pip can make installation even faster. Within seconds, Flash Attention will be ready to accelerate your transformer models.
Supported Versions and Platforms
Python Versions
Flash Attention prebuilt wheels support Python 3.8 through Python 3.12. We recommend Python 3.10 or 3.11 for optimal compatibility with PyTorch and CUDA libraries.
- • Python 3.8 (legacy support)
- • Python 3.9
- • Python 3.10 (recommended)
- • Python 3.11 (recommended)
- • Python 3.12 (latest)
CUDA Versions
Wheels are built for multiple CUDA toolkit versions. Check your installed CUDA version with nvcc --version or nvidia-smi.
- • CUDA 11.8
- • CUDA 12.1
- • CUDA 12.2
- • CUDA 12.3
- • CUDA 12.4
- • CUDA 12.6
Frequently Asked Questions About Flash Attention Wheels
What are Flash Attention prebuilt wheels?
Flash Attention prebuilt wheels are pre-compiled Python packages (.whl files) that allow you to install Flash Attention without compiling from source. This saves significant time and avoids complex build dependencies like CUDA toolkit and C++ compilers.
How do I install flash-attn without compiling?
Use this tool to find a compatible prebuilt wheel for your Python version, PyTorch version, and CUDA version. Then install directly with pip using: pip install [wheel-url]. You can also use uv for faster installation: uv pip install [wheel-url].
Which CUDA versions are supported?
Prebuilt wheels are available for CUDA 11.8, 12.1, 12.2, 12.3, 12.4, and 12.6. The availability depends on the Flash Attention version and your platform (Linux x86_64, Linux ARM64, or Windows).
Why use prebuilt wheels instead of pip install flash-attn?
Installing flash-attn from PyPI requires compilation, which can take 30+ minutes, needs CUDA toolkit installed, and often fails due to version mismatches. Prebuilt wheels install in seconds and work reliably.
What platforms are supported?
Prebuilt Flash Attention wheels are available for Linux x86_64 (most common for servers and workstations), Linux ARM64 (for ARM-based systems like AWS Graviton), and Windows AMD64.
What Python versions are compatible with Flash Attention?
Flash Attention prebuilt wheels are typically available for Python 3.8, 3.9, 3.10, 3.11, and 3.12. The exact versions depend on the Flash Attention release. We recommend using Python 3.10 or 3.11 for the best compatibility.
Which PyTorch versions work with Flash Attention?
Flash Attention supports PyTorch 2.0 and later versions. Prebuilt wheels are available for PyTorch 2.0, 2.1, 2.2, 2.3, 2.4, and 2.5. Make sure your PyTorch version matches the wheel you download.
What is CXX11 ABI and which should I choose?
CXX11 ABI refers to the C++ Application Binary Interface. Most modern Linux distributions use CXX11 ABI TRUE. If you encounter import errors, try the opposite ABI setting. Ubuntu 20.04+ and recent PyTorch builds typically use CXX11 ABI TRUE.
Can I use Flash Attention with transformers library?
Yes! Once Flash Attention is installed, Hugging Face Transformers automatically uses it for compatible models. You can also explicitly enable it with model.to_bettertransformer() or by setting attn_implementation="flash_attention_2" when loading models.
What should I do if no wheel matches my configuration?
If no prebuilt wheel matches your exact configuration, try: 1) Using a different Python version, 2) Upgrading or downgrading PyTorch, 3) Checking if a newer Flash Attention version has your configuration. As a last resort, you can compile from source.
What is Flash Attention?
Flash Attention is a groundbreaking fast and memory-efficient exact attention algorithm developed by Tri Dao at Stanford University. Published in 2022, Flash Attention revolutionized how transformer models handle the attention mechanism by optimizing GPU memory access patterns and reducing memory usage from quadratic to linear in sequence length.
The algorithm achieves significant speedups (2-4x faster) compared to standard attention implementations while using less GPU memory. This enables training and inference of transformer models with much longer context lengths. Flash Attention 2, released in 2023, brought additional improvements with even better parallelism and work partitioning strategies.
Flash Attention is now widely used in production machine learning systems and is integrated into popular frameworks like Hugging Face Transformers, PyTorch (as scaled_dot_product_attention), and various LLM inference engines. Major language models including Llama 2, Mistral, and many others leverage Flash Attention for efficient training and serving.
The main challenge with Flash Attention is installation: compiling from source requires the CUDA toolkit, compatible C++ compilers, and can take over 30 minutes. Build failures due to version mismatches are common. That's why prebuilt wheels are so valuable—they eliminate all these compilation headaches and let you start using Flash Attention immediately.
This tool aggregates prebuilt wheels from trusted community repositories including mjun0812/flash-attention-prebuild-wheels and the official Dao-AILab/flash-attention repository, making it easy to find the right wheel for your specific configuration.