Back to all articles
LLM Fine-Tuning

How to Fine-Tune Qwen3: A Practical Guide with Code

Step-by-step tutorial for fine-tuning Qwen3 models including dense and MoE variants. Includes complete code examples, thinking mode configuration, dataset preparation, and deployment tips.

Flash Attention TeamJanuary 8, 202610 min read
Qwen3fine-tuningHugging FacetransformerstutorialLoRAQLoRA

Qwen3 represents the latest generation of open-weight LLMs from Alibaba, featuring both dense and Mixture-of-Experts architectures with native reasoning capabilities. This guide provides complete, working code to fine-tune Qwen3 for your specific use case.

Qwen3 Model Overview

ModelParametersActive ParamsContext LengthArchitecture
Qwen3-0.6B0.6B0.6B32KDense
Qwen3-1.7B1.7B1.7B32KDense
Qwen3-4B4B4B32KDense
Qwen3-8B8B8B128KDense
Qwen3-14B14B14B128KDense
Qwen3-32B32B32B128KDense
Qwen3-30B-A3B30B3B128KMoE
Qwen3-235B-A22B235B22B128KMoE

All models support Flash Attention and are released under the Apache 2.0 license.

Prerequisites

Environment Setup

# Create environment
conda create -n qwen3-finetune python=3.11
conda activate qwen3-finetune

# Install PyTorch with CUDA
pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu121

# Install training libraries (requires transformers>=4.51.0 for Qwen3)
pip install "transformers>=4.51.0" datasets accelerate peft trl bitsandbytes
pip install flash-attn --no-build-isolation

# For Weights & Biases logging (optional)
pip install wandb

Hardware Requirements

ModelFull FTLoRAQLoRA
Qwen3-0.6BRTX 3080 10GBRTX 3060 12GBRTX 3060 12GB
Qwen3-4BA100 40GBRTX 4090 24GBRTX 3080 10GB
Qwen3-8BA100 80GBRTX 4090 24GBRTX 4090 24GB
Qwen3-14B2× A100A100 40GBRTX 4090 24GB
Qwen3-32B4× A100A100 80GBA100 40GB
Qwen3-30B-A3B (MoE)2× A100A100 40GBRTX 4090 24GB

Understanding Qwen3's Thinking Mode

Qwen3 introduces a unique "thinking mode" that enables step-by-step reasoning. This is crucial for fine-tuning decisions.

Thinking Mode Enabled (Default)

When thinking is enabled, the model generates reasoning in <think>...</think> blocks:

from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("Qwen/Qwen3-8B")

messages = [
    {"role": "user", "content": "What is 15% of 80?"}
]

# Thinking enabled (default)
text = tokenizer.apply_chat_template(
    messages,
    tokenize=False,
    add_generation_prompt=True,
    enable_thinking=True  # Default
)

Output includes reasoning:

<|im_start|>user
What is 15% of 80?<|im_end|>
<|im_start|>assistant
<think>
To find 15% of 80, I need to multiply 80 by 0.15...
</think>
15% of 80 is 12.<|im_end|>

Thinking Mode Disabled

For tasks where reasoning isn't needed (faster inference):

text = tokenizer.apply_chat_template(
    messages,
    tokenize=False,
    add_generation_prompt=True,
    enable_thinking=False
)

Soft Switch with /think and /no_think

Users can dynamically control thinking per-turn:

messages = [
    {"role": "user", "content": "/no_think What's the capital of France?"},
    {"role": "assistant", "content": "The capital of France is Paris."},
    {"role": "user", "content": "/think Now explain why it became the capital."}
]

Dataset Preparation

Chat Format for Qwen3

Qwen3 uses the ChatML format with <|im_start|> and <|im_end|> tokens:

from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("Qwen/Qwen3-8B")

messages = [
    {"role": "system", "content": "You are a helpful assistant."},
    {"role": "user", "content": "What is machine learning?"},
    {"role": "assistant", "content": "Machine learning is a subset of AI..."}
]

formatted = tokenizer.apply_chat_template(messages, tokenize=False)
print(formatted)

Output format:

<|im_start|>system
You are a helpful assistant.<|im_end|>
<|im_start|>user
What is machine learning?<|im_end|>
<|im_start|>assistant
Machine learning is a subset of AI...<|im_end|>

Preparing Your Dataset

from datasets import Dataset
import json

# Load your data
with open("training_data.json", "r") as f:
    data = json.load(f)

dataset = Dataset.from_list(data)

# Format function for Qwen3
def format_qwen3(example, enable_thinking=False):
    messages = [
        {"role": "system", "content": example.get("system", "You are a helpful assistant.")},
        {"role": "user", "content": example["instruction"]},
        {"role": "assistant", "content": example["output"]}
    ]
    return {
        "text": tokenizer.apply_chat_template(
            messages,
            tokenize=False,
            enable_thinking=enable_thinking
        )
    }

# For reasoning tasks (75% with thinking, 25% without - recommended by Qwen team)
def format_mixed_reasoning(example):
    import random
    enable_thinking = random.random() < 0.75
    return format_qwen3(example, enable_thinking=enable_thinking)

dataset = dataset.map(format_mixed_reasoning)

Reasoning Dataset Format

For maintaining Qwen3's reasoning capabilities, include thinking traces:

reasoning_example = {
    "instruction": "Solve: If a train travels 120 km in 2 hours, what is its speed?",
    "output": """<think>
To find speed, I use the formula: speed = distance / time
Distance = 120 km
Time = 2 hours
Speed = 120 / 2 = 60 km/h
</think>
The train's speed is 60 km/h."""
}

Fine-Tuning Qwen3-8B with QLoRA

Complete working example:

import torch
from transformers import (
    AutoModelForCausalLM,
    AutoTokenizer,
    BitsAndBytesConfig,
    TrainingArguments,
)
from peft import LoraConfig, get_peft_model, prepare_model_for_kbit_training
from trl import SFTTrainer
from datasets import load_dataset

# Model configuration
model_id = "Qwen/Qwen3-8B"
output_dir = "./qwen3-8b-finetuned"

# Quantization config for QLoRA
bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_compute_dtype=torch.bfloat16,
    bnb_4bit_use_double_quant=True,
)

# Load model
model = AutoModelForCausalLM.from_pretrained(
    model_id,
    quantization_config=bnb_config,
    device_map="auto",
    attn_implementation="flash_attention_2",
    torch_dtype=torch.bfloat16,
    trust_remote_code=True,
)

# Load tokenizer
tokenizer = AutoTokenizer.from_pretrained(model_id, trust_remote_code=True)
tokenizer.pad_token = tokenizer.eos_token
tokenizer.padding_side = "right"

# Prepare for QLoRA
model = prepare_model_for_kbit_training(model)

# LoRA configuration for Qwen3
lora_config = LoraConfig(
    r=64,
    lora_alpha=128,
    target_modules=[
        "q_proj", "k_proj", "v_proj", "o_proj",
        "gate_proj", "up_proj", "down_proj",
    ],
    lora_dropout=0.05,
    bias="none",
    task_type="CAUSAL_LM",
)

model = get_peft_model(model, lora_config)
model.print_trainable_parameters()

# Load dataset
dataset = load_dataset("your_dataset", split="train")

# Training arguments
training_args = TrainingArguments(
    output_dir=output_dir,
    num_train_epochs=3,
    per_device_train_batch_size=4,
    gradient_accumulation_steps=4,
    learning_rate=2e-4,
    weight_decay=0.01,
    warmup_ratio=0.03,
    lr_scheduler_type="cosine",
    logging_steps=10,
    save_strategy="epoch",
    bf16=True,
    gradient_checkpointing=True,
    gradient_checkpointing_kwargs={"use_reentrant": False},
    optim="paged_adamw_8bit",
    max_grad_norm=0.3,
    report_to="wandb",
)

# Initialize trainer
trainer = SFTTrainer(
    model=model,
    train_dataset=dataset,
    args=training_args,
    tokenizer=tokenizer,
    dataset_text_field="text",
    max_seq_length=4096,
    packing=True,
)

# Train
trainer.train()

# Save LoRA weights
trainer.save_model(output_dir)

Fine-Tuning Qwen3-14B with LoRA (Full Precision)

import torch
from transformers import AutoModelForCausalLM, AutoTokenizer, TrainingArguments
from peft import LoraConfig, get_peft_model
from trl import SFTTrainer

model_id = "Qwen/Qwen3-14B"
output_dir = "./qwen3-14b-finetuned"

# Load model in BF16 (no quantization)
model = AutoModelForCausalLM.from_pretrained(
    model_id,
    torch_dtype=torch.bfloat16,
    device_map="auto",
    attn_implementation="flash_attention_2",
    trust_remote_code=True,
)

tokenizer = AutoTokenizer.from_pretrained(model_id, trust_remote_code=True)
tokenizer.pad_token = tokenizer.eos_token

# LoRA config
lora_config = LoraConfig(
    r=32,
    lora_alpha=64,
    target_modules=[
        "q_proj", "k_proj", "v_proj", "o_proj",
        "gate_proj", "up_proj", "down_proj",
    ],
    lora_dropout=0.05,
    bias="none",
    task_type="CAUSAL_LM",
)

model = get_peft_model(model, lora_config)

training_args = TrainingArguments(
    output_dir=output_dir,
    num_train_epochs=3,
    per_device_train_batch_size=2,
    gradient_accumulation_steps=8,
    learning_rate=1e-4,
    warmup_ratio=0.03,
    lr_scheduler_type="cosine",
    logging_steps=10,
    save_strategy="epoch",
    bf16=True,
    gradient_checkpointing=True,
    optim="adamw_torch_fused",
)

trainer = SFTTrainer(
    model=model,
    train_dataset=dataset,
    args=training_args,
    tokenizer=tokenizer,
    dataset_text_field="text",
    max_seq_length=8192,
    packing=True,
)

trainer.train()
trainer.save_model(output_dir)

Fine-Tuning Qwen3 MoE (30B-A3B)

The MoE model requires special handling:

import torch
from transformers import AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig

model_id = "Qwen/Qwen3-30B-A3B"

# QLoRA is recommended for MoE models
bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_compute_dtype=torch.bfloat16,
)

model = AutoModelForCausalLM.from_pretrained(
    model_id,
    quantization_config=bnb_config,
    device_map="auto",
    attn_implementation="flash_attention_2",
    trust_remote_code=True,
)

# For MoE models, target attention and MLP layers
lora_config = LoraConfig(
    r=32,
    lora_alpha=64,
    target_modules=[
        "q_proj", "k_proj", "v_proj", "o_proj",
        "gate_proj", "up_proj", "down_proj",
    ],
    lora_dropout=0.05,
    bias="none",
    task_type="CAUSAL_LM",
)

Important: DeepSpeed ZeRO-3 is incompatible with QLoRA. Use ZeRO-2 for multi-GPU QLoRA training.

Using Unsloth for 2x Faster Training

Unsloth provides optimized Qwen3 fine-tuning with 70% less VRAM:

from unsloth import FastLanguageModel

model, tokenizer = FastLanguageModel.from_pretrained(
    model_name="unsloth/Qwen3-8B",
    max_seq_length=4096,
    dtype=None,  # Auto-detect
    load_in_4bit=True,
)

model = FastLanguageModel.get_peft_model(
    model,
    r=64,
    target_modules=[
        "q_proj", "k_proj", "v_proj", "o_proj",
        "gate_proj", "up_proj", "down_proj",
    ],
    lora_alpha=64,
    lora_dropout=0,
    bias="none",
    use_gradient_checkpointing="unsloth",
    random_state=42,
)

Inference with Fine-Tuned Model

Loading LoRA Adapter

from transformers import AutoModelForCausalLM, AutoTokenizer
from peft import PeftModel

# Load base model
base_model = AutoModelForCausalLM.from_pretrained(
    "Qwen/Qwen3-8B",
    torch_dtype=torch.bfloat16,
    device_map="auto",
    attn_implementation="flash_attention_2",
    trust_remote_code=True,
)

# Load LoRA adapter
model = PeftModel.from_pretrained(base_model, "./qwen3-8b-finetuned")
tokenizer = AutoTokenizer.from_pretrained("Qwen/Qwen3-8B", trust_remote_code=True)

# Generate (with thinking disabled for faster inference)
messages = [{"role": "user", "content": "Your prompt here"}]
text = tokenizer.apply_chat_template(
    messages,
    tokenize=False,
    add_generation_prompt=True,
    enable_thinking=False  # Disable for faster inference
)

inputs = tokenizer(text, return_tensors="pt").to("cuda")

# Generation parameters (recommended by Qwen team)
output = model.generate(
    **inputs,
    max_new_tokens=512,
    temperature=0.7,
    top_p=0.8,
    top_k=20,
    do_sample=True,
)

print(tokenizer.decode(output[0], skip_special_tokens=True))

Merging LoRA for Deployment

# Merge LoRA weights into base model
merged_model = model.merge_and_unload()

# Save merged model
merged_model.save_pretrained("./qwen3-8b-merged")
tokenizer.save_pretrained("./qwen3-8b-merged")

Generation Parameters

Qwen team recommends different settings based on mode:

ModeTemperatureTop-PTop-KMin-P
Thinking enabled0.60.95200
Thinking disabled0.70.8200

Important: Do NOT use greedy decoding (temperature=0) with Qwen3—it can cause endless repetitions and performance degradation.

Common Issues and Solutions

Out of Memory

# Solution 1: Reduce batch size and increase gradient accumulation
per_device_train_batch_size=1
gradient_accumulation_steps=16

# Solution 2: Use QLoRA instead of LoRA
load_in_4bit=True

# Solution 3: Reduce sequence length
max_seq_length=2048

# Solution 4: Enable gradient checkpointing
gradient_checkpointing=True

Model Not Following Instructions

# Ensure proper chat template
text = tokenizer.apply_chat_template(
    messages,
    tokenize=False,
    add_generation_prompt=True,
)

# Verify the format includes correct special tokens
assert "<|im_start|>" in text
assert "<|im_end|>" in text

Preserving Reasoning Capabilities

# Mix reasoning and non-reasoning data (75/25 split recommended)
# Include thinking traces in your training data
thinking_example = """<think>
Step 1: ...
Step 2: ...
</think>
Final answer: ..."""

Evaluation

from transformers import pipeline

pipe = pipeline(
    "text-generation",
    model="./qwen3-8b-merged",
    torch_dtype=torch.bfloat16,
    device_map="auto",
)

test_prompts = [
    "Explain quantum computing in simple terms.",
    "Write a Python function to sort a list.",
    "What are the benefits of exercise?",
]

for prompt in test_prompts:
    messages = [{"role": "user", "content": prompt}]
    output = pipe(
        messages,
        max_new_tokens=256,
        temperature=0.7,
        top_p=0.8,
        top_k=20,
    )
    print(f"Prompt: {prompt}")
    print(f"Response: {output[0]['generated_text'][-1]['content']}")
    print("-" * 50)

References

  1. Qwen Team. (2025). "Qwen3: Think Deeper, Act Faster." Qwen Blog

  2. Qwen Team. (2025). "Qwen3 GitHub Repository." GitHub

  3. Hugging Face. (2025). "The 4 Things Qwen-3's Chat Template Teaches Us." Hugging Face Blog

  4. Unsloth. (2025). "Qwen3 - How to Run & Fine-tune." Unsloth Documentation

  5. DataCamp. (2025). "Fine-Tuning Qwen3: A Step-by-Step Guide." DataCamp Tutorial

Frequently Asked Questions

Related Articles

Need Flash Attention wheels?

Skip the 30+ minute compilation. Find prebuilt wheels for your exact configuration.

Find Your Wheel