Qwen3 represents the latest generation of open-weight LLMs from Alibaba, featuring both dense and Mixture-of-Experts architectures with native reasoning capabilities. This guide provides complete, working code to fine-tune Qwen3 for your specific use case.
Qwen3 Model Overview
| Model | Parameters | Active Params | Context Length | Architecture |
|---|---|---|---|---|
| Qwen3-0.6B | 0.6B | 0.6B | 32K | Dense |
| Qwen3-1.7B | 1.7B | 1.7B | 32K | Dense |
| Qwen3-4B | 4B | 4B | 32K | Dense |
| Qwen3-8B | 8B | 8B | 128K | Dense |
| Qwen3-14B | 14B | 14B | 128K | Dense |
| Qwen3-32B | 32B | 32B | 128K | Dense |
| Qwen3-30B-A3B | 30B | 3B | 128K | MoE |
| Qwen3-235B-A22B | 235B | 22B | 128K | MoE |
All models support Flash Attention and are released under the Apache 2.0 license.
Prerequisites
Environment Setup
# Create environment
conda create -n qwen3-finetune python=3.11
conda activate qwen3-finetune
# Install PyTorch with CUDA
pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu121
# Install training libraries (requires transformers>=4.51.0 for Qwen3)
pip install "transformers>=4.51.0" datasets accelerate peft trl bitsandbytes
pip install flash-attn --no-build-isolation
# For Weights & Biases logging (optional)
pip install wandb
Hardware Requirements
| Model | Full FT | LoRA | QLoRA |
|---|---|---|---|
| Qwen3-0.6B | RTX 3080 10GB | RTX 3060 12GB | RTX 3060 12GB |
| Qwen3-4B | A100 40GB | RTX 4090 24GB | RTX 3080 10GB |
| Qwen3-8B | A100 80GB | RTX 4090 24GB | RTX 4090 24GB |
| Qwen3-14B | 2× A100 | A100 40GB | RTX 4090 24GB |
| Qwen3-32B | 4× A100 | A100 80GB | A100 40GB |
| Qwen3-30B-A3B (MoE) | 2× A100 | A100 40GB | RTX 4090 24GB |
Understanding Qwen3's Thinking Mode
Qwen3 introduces a unique "thinking mode" that enables step-by-step reasoning. This is crucial for fine-tuning decisions.
Thinking Mode Enabled (Default)
When thinking is enabled, the model generates reasoning in <think>...</think> blocks:
from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained("Qwen/Qwen3-8B")
messages = [
{"role": "user", "content": "What is 15% of 80?"}
]
# Thinking enabled (default)
text = tokenizer.apply_chat_template(
messages,
tokenize=False,
add_generation_prompt=True,
enable_thinking=True # Default
)
Output includes reasoning:
<|im_start|>user
What is 15% of 80?<|im_end|>
<|im_start|>assistant
<think>
To find 15% of 80, I need to multiply 80 by 0.15...
</think>
15% of 80 is 12.<|im_end|>
Thinking Mode Disabled
For tasks where reasoning isn't needed (faster inference):
text = tokenizer.apply_chat_template(
messages,
tokenize=False,
add_generation_prompt=True,
enable_thinking=False
)
Soft Switch with /think and /no_think
Users can dynamically control thinking per-turn:
messages = [
{"role": "user", "content": "/no_think What's the capital of France?"},
{"role": "assistant", "content": "The capital of France is Paris."},
{"role": "user", "content": "/think Now explain why it became the capital."}
]
Dataset Preparation
Chat Format for Qwen3
Qwen3 uses the ChatML format with <|im_start|> and <|im_end|> tokens:
from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained("Qwen/Qwen3-8B")
messages = [
{"role": "system", "content": "You are a helpful assistant."},
{"role": "user", "content": "What is machine learning?"},
{"role": "assistant", "content": "Machine learning is a subset of AI..."}
]
formatted = tokenizer.apply_chat_template(messages, tokenize=False)
print(formatted)
Output format:
<|im_start|>system
You are a helpful assistant.<|im_end|>
<|im_start|>user
What is machine learning?<|im_end|>
<|im_start|>assistant
Machine learning is a subset of AI...<|im_end|>
Preparing Your Dataset
from datasets import Dataset
import json
# Load your data
with open("training_data.json", "r") as f:
data = json.load(f)
dataset = Dataset.from_list(data)
# Format function for Qwen3
def format_qwen3(example, enable_thinking=False):
messages = [
{"role": "system", "content": example.get("system", "You are a helpful assistant.")},
{"role": "user", "content": example["instruction"]},
{"role": "assistant", "content": example["output"]}
]
return {
"text": tokenizer.apply_chat_template(
messages,
tokenize=False,
enable_thinking=enable_thinking
)
}
# For reasoning tasks (75% with thinking, 25% without - recommended by Qwen team)
def format_mixed_reasoning(example):
import random
enable_thinking = random.random() < 0.75
return format_qwen3(example, enable_thinking=enable_thinking)
dataset = dataset.map(format_mixed_reasoning)
Reasoning Dataset Format
For maintaining Qwen3's reasoning capabilities, include thinking traces:
reasoning_example = {
"instruction": "Solve: If a train travels 120 km in 2 hours, what is its speed?",
"output": """<think>
To find speed, I use the formula: speed = distance / time
Distance = 120 km
Time = 2 hours
Speed = 120 / 2 = 60 km/h
</think>
The train's speed is 60 km/h."""
}
Fine-Tuning Qwen3-8B with QLoRA
Complete working example:
import torch
from transformers import (
AutoModelForCausalLM,
AutoTokenizer,
BitsAndBytesConfig,
TrainingArguments,
)
from peft import LoraConfig, get_peft_model, prepare_model_for_kbit_training
from trl import SFTTrainer
from datasets import load_dataset
# Model configuration
model_id = "Qwen/Qwen3-8B"
output_dir = "./qwen3-8b-finetuned"
# Quantization config for QLoRA
bnb_config = BitsAndBytesConfig(
load_in_4bit=True,
bnb_4bit_quant_type="nf4",
bnb_4bit_compute_dtype=torch.bfloat16,
bnb_4bit_use_double_quant=True,
)
# Load model
model = AutoModelForCausalLM.from_pretrained(
model_id,
quantization_config=bnb_config,
device_map="auto",
attn_implementation="flash_attention_2",
torch_dtype=torch.bfloat16,
trust_remote_code=True,
)
# Load tokenizer
tokenizer = AutoTokenizer.from_pretrained(model_id, trust_remote_code=True)
tokenizer.pad_token = tokenizer.eos_token
tokenizer.padding_side = "right"
# Prepare for QLoRA
model = prepare_model_for_kbit_training(model)
# LoRA configuration for Qwen3
lora_config = LoraConfig(
r=64,
lora_alpha=128,
target_modules=[
"q_proj", "k_proj", "v_proj", "o_proj",
"gate_proj", "up_proj", "down_proj",
],
lora_dropout=0.05,
bias="none",
task_type="CAUSAL_LM",
)
model = get_peft_model(model, lora_config)
model.print_trainable_parameters()
# Load dataset
dataset = load_dataset("your_dataset", split="train")
# Training arguments
training_args = TrainingArguments(
output_dir=output_dir,
num_train_epochs=3,
per_device_train_batch_size=4,
gradient_accumulation_steps=4,
learning_rate=2e-4,
weight_decay=0.01,
warmup_ratio=0.03,
lr_scheduler_type="cosine",
logging_steps=10,
save_strategy="epoch",
bf16=True,
gradient_checkpointing=True,
gradient_checkpointing_kwargs={"use_reentrant": False},
optim="paged_adamw_8bit",
max_grad_norm=0.3,
report_to="wandb",
)
# Initialize trainer
trainer = SFTTrainer(
model=model,
train_dataset=dataset,
args=training_args,
tokenizer=tokenizer,
dataset_text_field="text",
max_seq_length=4096,
packing=True,
)
# Train
trainer.train()
# Save LoRA weights
trainer.save_model(output_dir)
Fine-Tuning Qwen3-14B with LoRA (Full Precision)
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer, TrainingArguments
from peft import LoraConfig, get_peft_model
from trl import SFTTrainer
model_id = "Qwen/Qwen3-14B"
output_dir = "./qwen3-14b-finetuned"
# Load model in BF16 (no quantization)
model = AutoModelForCausalLM.from_pretrained(
model_id,
torch_dtype=torch.bfloat16,
device_map="auto",
attn_implementation="flash_attention_2",
trust_remote_code=True,
)
tokenizer = AutoTokenizer.from_pretrained(model_id, trust_remote_code=True)
tokenizer.pad_token = tokenizer.eos_token
# LoRA config
lora_config = LoraConfig(
r=32,
lora_alpha=64,
target_modules=[
"q_proj", "k_proj", "v_proj", "o_proj",
"gate_proj", "up_proj", "down_proj",
],
lora_dropout=0.05,
bias="none",
task_type="CAUSAL_LM",
)
model = get_peft_model(model, lora_config)
training_args = TrainingArguments(
output_dir=output_dir,
num_train_epochs=3,
per_device_train_batch_size=2,
gradient_accumulation_steps=8,
learning_rate=1e-4,
warmup_ratio=0.03,
lr_scheduler_type="cosine",
logging_steps=10,
save_strategy="epoch",
bf16=True,
gradient_checkpointing=True,
optim="adamw_torch_fused",
)
trainer = SFTTrainer(
model=model,
train_dataset=dataset,
args=training_args,
tokenizer=tokenizer,
dataset_text_field="text",
max_seq_length=8192,
packing=True,
)
trainer.train()
trainer.save_model(output_dir)
Fine-Tuning Qwen3 MoE (30B-A3B)
The MoE model requires special handling:
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig
model_id = "Qwen/Qwen3-30B-A3B"
# QLoRA is recommended for MoE models
bnb_config = BitsAndBytesConfig(
load_in_4bit=True,
bnb_4bit_quant_type="nf4",
bnb_4bit_compute_dtype=torch.bfloat16,
)
model = AutoModelForCausalLM.from_pretrained(
model_id,
quantization_config=bnb_config,
device_map="auto",
attn_implementation="flash_attention_2",
trust_remote_code=True,
)
# For MoE models, target attention and MLP layers
lora_config = LoraConfig(
r=32,
lora_alpha=64,
target_modules=[
"q_proj", "k_proj", "v_proj", "o_proj",
"gate_proj", "up_proj", "down_proj",
],
lora_dropout=0.05,
bias="none",
task_type="CAUSAL_LM",
)
Important: DeepSpeed ZeRO-3 is incompatible with QLoRA. Use ZeRO-2 for multi-GPU QLoRA training.
Using Unsloth for 2x Faster Training
Unsloth provides optimized Qwen3 fine-tuning with 70% less VRAM:
from unsloth import FastLanguageModel
model, tokenizer = FastLanguageModel.from_pretrained(
model_name="unsloth/Qwen3-8B",
max_seq_length=4096,
dtype=None, # Auto-detect
load_in_4bit=True,
)
model = FastLanguageModel.get_peft_model(
model,
r=64,
target_modules=[
"q_proj", "k_proj", "v_proj", "o_proj",
"gate_proj", "up_proj", "down_proj",
],
lora_alpha=64,
lora_dropout=0,
bias="none",
use_gradient_checkpointing="unsloth",
random_state=42,
)
Inference with Fine-Tuned Model
Loading LoRA Adapter
from transformers import AutoModelForCausalLM, AutoTokenizer
from peft import PeftModel
# Load base model
base_model = AutoModelForCausalLM.from_pretrained(
"Qwen/Qwen3-8B",
torch_dtype=torch.bfloat16,
device_map="auto",
attn_implementation="flash_attention_2",
trust_remote_code=True,
)
# Load LoRA adapter
model = PeftModel.from_pretrained(base_model, "./qwen3-8b-finetuned")
tokenizer = AutoTokenizer.from_pretrained("Qwen/Qwen3-8B", trust_remote_code=True)
# Generate (with thinking disabled for faster inference)
messages = [{"role": "user", "content": "Your prompt here"}]
text = tokenizer.apply_chat_template(
messages,
tokenize=False,
add_generation_prompt=True,
enable_thinking=False # Disable for faster inference
)
inputs = tokenizer(text, return_tensors="pt").to("cuda")
# Generation parameters (recommended by Qwen team)
output = model.generate(
**inputs,
max_new_tokens=512,
temperature=0.7,
top_p=0.8,
top_k=20,
do_sample=True,
)
print(tokenizer.decode(output[0], skip_special_tokens=True))
Merging LoRA for Deployment
# Merge LoRA weights into base model
merged_model = model.merge_and_unload()
# Save merged model
merged_model.save_pretrained("./qwen3-8b-merged")
tokenizer.save_pretrained("./qwen3-8b-merged")
Generation Parameters
Qwen team recommends different settings based on mode:
| Mode | Temperature | Top-P | Top-K | Min-P |
|---|---|---|---|---|
| Thinking enabled | 0.6 | 0.95 | 20 | 0 |
| Thinking disabled | 0.7 | 0.8 | 20 | 0 |
Important: Do NOT use greedy decoding (temperature=0) with Qwen3—it can cause endless repetitions and performance degradation.
Common Issues and Solutions
Out of Memory
# Solution 1: Reduce batch size and increase gradient accumulation
per_device_train_batch_size=1
gradient_accumulation_steps=16
# Solution 2: Use QLoRA instead of LoRA
load_in_4bit=True
# Solution 3: Reduce sequence length
max_seq_length=2048
# Solution 4: Enable gradient checkpointing
gradient_checkpointing=True
Model Not Following Instructions
# Ensure proper chat template
text = tokenizer.apply_chat_template(
messages,
tokenize=False,
add_generation_prompt=True,
)
# Verify the format includes correct special tokens
assert "<|im_start|>" in text
assert "<|im_end|>" in text
Preserving Reasoning Capabilities
# Mix reasoning and non-reasoning data (75/25 split recommended)
# Include thinking traces in your training data
thinking_example = """<think>
Step 1: ...
Step 2: ...
</think>
Final answer: ..."""
Evaluation
from transformers import pipeline
pipe = pipeline(
"text-generation",
model="./qwen3-8b-merged",
torch_dtype=torch.bfloat16,
device_map="auto",
)
test_prompts = [
"Explain quantum computing in simple terms.",
"Write a Python function to sort a list.",
"What are the benefits of exercise?",
]
for prompt in test_prompts:
messages = [{"role": "user", "content": prompt}]
output = pipe(
messages,
max_new_tokens=256,
temperature=0.7,
top_p=0.8,
top_k=20,
)
print(f"Prompt: {prompt}")
print(f"Response: {output[0]['generated_text'][-1]['content']}")
print("-" * 50)
References
-
Qwen Team. (2025). "Qwen3: Think Deeper, Act Faster." Qwen Blog
-
Qwen Team. (2025). "Qwen3 GitHub Repository." GitHub
-
Hugging Face. (2025). "The 4 Things Qwen-3's Chat Template Teaches Us." Hugging Face Blog
-
Unsloth. (2025). "Qwen3 - How to Run & Fine-tune." Unsloth Documentation
-
DataCamp. (2025). "Fine-Tuning Qwen3: A Step-by-Step Guide." DataCamp Tutorial