Should I enable or disable thinking mode for fine-tuning?

It depends on your use case. For reasoning tasks (math, coding, analysis), keep thinking enabled and include thinking traces in your training data. The Qwen team recommends a 75% thinking / 25% non-thinking mix to preserve reasoning capabilities. For simple Q&A or chat, you can disable thinking for faster inference.

What's the difference between Qwen3 dense and MoE models?

Dense models (0.6B to 32B) activate all parameters for every token. MoE models like Qwen3-30B-A3B have 30B total parameters but only activate 3B per token through expert routing. MoE models offer better efficiency—you get near-30B quality with 3B compute cost. For fine-tuning, MoE models need similar VRAM to their active parameter count.

Why does my Qwen3 generate endless repetitions?

Never use greedy decoding (temperature=0) with Qwen3—it causes repetition loops. Always use sampling with the recommended parameters: temperature=0.6-0.7, top_p=0.8-0.95, top_k=20. Also ensure you're using transformers>=4.51.0 which has the correct generation config.

Can I fine-tune Qwen3 on consumer GPUs?

Yes! Qwen3-4B with QLoRA fits on an RTX 3080 10GB. Qwen3-8B with QLoRA works on RTX 4090 24GB. For larger models, use the MoE variant—Qwen3-30B-A3B only activates 3B parameters, making it surprisingly efficient. Unsloth can reduce VRAM requirements by an additional 70%.

How to Fine-Tune Qwen3: A Practical Guide with Code

Qwen3 represents the latest generation of open-weight LLMs from Alibaba, featuring both dense and Mixture-of-Experts architectures with native reasoning capabilities. This guide provides complete, working code to fine-tune Qwen3 for your specific use case.

Qwen3 Model Overview

Model	Parameters	Active Params	Context Length	Architecture
Qwen3-0.6B	0.6B	0.6B	32K	Dense
Qwen3-1.7B	1.7B	1.7B	32K	Dense
Qwen3-4B	4B	4B	32K	Dense
Qwen3-8B	8B	8B	128K	Dense
Qwen3-14B	14B	14B	128K	Dense
Qwen3-32B	32B	32B	128K	Dense
Qwen3-30B-A3B	30B	3B	128K	MoE
Qwen3-235B-A22B	235B	22B	128K	MoE

All models support Flash Attention and are released under the Apache 2.0 license.

Prerequisites

Environment Setup

# Create environment
conda create -n qwen3-finetune python=3.11
conda activate qwen3-finetune

# Install PyTorch with CUDA
pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu121

# Install training libraries (requires transformers>=4.51.0 for Qwen3)
pip install "transformers>=4.51.0" datasets accelerate peft trl bitsandbytes
pip install flash-attn --no-build-isolation

# For Weights & Biases logging (optional)
pip install wandb

Hardware Requirements

Model	Full FT	LoRA	QLoRA
Qwen3-0.6B	RTX 3080 10GB	RTX 3060 12GB	RTX 3060 12GB
Qwen3-4B	A100 40GB	RTX 4090 24GB	RTX 3080 10GB
Qwen3-8B	A100 80GB	RTX 4090 24GB	RTX 4090 24GB
Qwen3-14B	2× A100	A100 40GB	RTX 4090 24GB
Qwen3-32B	4× A100	A100 80GB	A100 40GB
Qwen3-30B-A3B (MoE)	2× A100	A100 40GB	RTX 4090 24GB

Understanding Qwen3's Thinking Mode

Qwen3 introduces a unique "thinking mode" that enables step-by-step reasoning. This is crucial for fine-tuning decisions.

Thinking Mode Enabled (Default)

When thinking is enabled, the model generates reasoning in <think>...</think> blocks:

from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("Qwen/Qwen3-8B")

messages = [
    {"role": "user", "content": "What is 15% of 80?"}
]

# Thinking enabled (default)
text = tokenizer.apply_chat_template(
    messages,
    tokenize=False,
    add_generation_prompt=True,
    enable_thinking=True  # Default
)

Output includes reasoning:

<|im_start|>user
What is 15% of 80?<|im_end|>
<|im_start|>assistant
<think>
To find 15% of 80, I need to multiply 80 by 0.15...
</think>
15% of 80 is 12.<|im_end|>

Thinking Mode Disabled

For tasks where reasoning isn't needed (faster inference):

text = tokenizer.apply_chat_template(
    messages,
    tokenize=False,
    add_generation_prompt=True,
    enable_thinking=False
)

Soft Switch with /think and /no_think

Users can dynamically control thinking per-turn:

messages = [
    {"role": "user", "content": "/no_think What's the capital of France?"},
    {"role": "assistant", "content": "The capital of France is Paris."},
    {"role": "user", "content": "/think Now explain why it became the capital."}
]

Dataset Preparation

Chat Format for Qwen3

Qwen3 uses the ChatML format with <|im_start|> and <|im_end|> tokens:

from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("Qwen/Qwen3-8B")

messages = [
    {"role": "system", "content": "You are a helpful assistant."},
    {"role": "user", "content": "What is machine learning?"},
    {"role": "assistant", "content": "Machine learning is a subset of AI..."}
]

formatted = tokenizer.apply_chat_template(messages, tokenize=False)
print(formatted)

Output format:

<|im_start|>system
You are a helpful assistant.<|im_end|>
<|im_start|>user
What is machine learning?<|im_end|>
<|im_start|>assistant
Machine learning is a subset of AI...<|im_end|>

Preparing Your Dataset

from datasets import Dataset
import json

# Load your data
with open("training_data.json", "r") as f:
    data = json.load(f)

dataset = Dataset.from_list(data)

# Format function for Qwen3
def format_qwen3(example, enable_thinking=False):
    messages = [
        {"role": "system", "content": example.get("system", "You are a helpful assistant.")},
        {"role": "user", "content": example["instruction"]},
        {"role": "assistant", "content": example["output"]}
    ]
    return {
        "text": tokenizer.apply_chat_template(
            messages,
            tokenize=False,
            enable_thinking=enable_thinking
        )
    }

# For reasoning tasks (75% with thinking, 25% without - recommended by Qwen team)
def format_mixed_reasoning(example):
    import random
    enable_thinking = random.random() < 0.75
    return format_qwen3(example, enable_thinking=enable_thinking)

dataset = dataset.map(format_mixed_reasoning)

Reasoning Dataset Format

For maintaining Qwen3's reasoning capabilities, include thinking traces:

reasoning_example = {
    "instruction": "Solve: If a train travels 120 km in 2 hours, what is its speed?",
    "output": """<think>
To find speed, I use the formula: speed = distance / time
Distance = 120 km
Time = 2 hours
Speed = 120 / 2 = 60 km/h
</think>
The train's speed is 60 km/h."""
}

Fine-Tuning Qwen3-8B with QLoRA

Complete working example:

import torch
from transformers import (
    AutoModelForCausalLM,
    AutoTokenizer,
    BitsAndBytesConfig,
    TrainingArguments,
)
from peft import LoraConfig, get_peft_model, prepare_model_for_kbit_training
from trl import SFTTrainer
from datasets import load_dataset

# Model configuration
model_id = "Qwen/Qwen3-8B"
output_dir = "./qwen3-8b-finetuned"

# Quantization config for QLoRA
bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_compute_dtype=torch.bfloat16,
    bnb_4bit_use_double_quant=True,
)

# Load model
model = AutoModelForCausalLM.from_pretrained(
    model_id,
    quantization_config=bnb_config,
    device_map="auto",
    attn_implementation="flash_attention_2",
    torch_dtype=torch.bfloat16,
    trust_remote_code=True,
)

# Load tokenizer
tokenizer = AutoTokenizer.from_pretrained(model_id, trust_remote_code=True)
tokenizer.pad_token = tokenizer.eos_token
tokenizer.padding_side = "right"

# Prepare for QLoRA
model = prepare_model_for_kbit_training(model)

# LoRA configuration for Qwen3
lora_config = LoraConfig(
    r=64,
    lora_alpha=128,
    target_modules=[
        "q_proj", "k_proj", "v_proj", "o_proj",
        "gate_proj", "up_proj", "down_proj",
    ],
    lora_dropout=0.05,
    bias="none",
    task_type="CAUSAL_LM",
)

model = get_peft_model(model, lora_config)
model.print_trainable_parameters()

# Load dataset
dataset = load_dataset("your_dataset", split="train")

# Training arguments
training_args = TrainingArguments(
    output_dir=output_dir,
    num_train_epochs=3,
    per_device_train_batch_size=4,
    gradient_accumulation_steps=4,
    learning_rate=2e-4,
    weight_decay=0.01,
    warmup_ratio=0.03,
    lr_scheduler_type="cosine",
    logging_steps=10,
    save_strategy="epoch",
    bf16=True,
    gradient_checkpointing=True,
    gradient_checkpointing_kwargs={"use_reentrant": False},
    optim="paged_adamw_8bit",
    max_grad_norm=0.3,
    report_to="wandb",
)

# Initialize trainer
trainer = SFTTrainer(
    model=model,
    train_dataset=dataset,
    args=training_args,
    tokenizer=tokenizer,
    dataset_text_field="text",
    max_seq_length=4096,
    packing=True,
)

# Train
trainer.train()

# Save LoRA weights
trainer.save_model(output_dir)

Fine-Tuning Qwen3-14B with LoRA (Full Precision)

import torch
from transformers import AutoModelForCausalLM, AutoTokenizer, TrainingArguments
from peft import LoraConfig, get_peft_model
from trl import SFTTrainer

model_id = "Qwen/Qwen3-14B"
output_dir = "./qwen3-14b-finetuned"

# Load model in BF16 (no quantization)
model = AutoModelForCausalLM.from_pretrained(
    model_id,
    torch_dtype=torch.bfloat16,
    device_map="auto",
    attn_implementation="flash_attention_2",
    trust_remote_code=True,
)

tokenizer = AutoTokenizer.from_pretrained(model_id, trust_remote_code=True)
tokenizer.pad_token = tokenizer.eos_token

# LoRA config
lora_config = LoraConfig(
    r=32,
    lora_alpha=64,
    target_modules=[
        "q_proj", "k_proj", "v_proj", "o_proj",
        "gate_proj", "up_proj", "down_proj",
    ],
    lora_dropout=0.05,
    bias="none",
    task_type="CAUSAL_LM",
)

model = get_peft_model(model, lora_config)

training_args = TrainingArguments(
    output_dir=output_dir,
    num_train_epochs=3,
    per_device_train_batch_size=2,
    gradient_accumulation_steps=8,
    learning_rate=1e-4,
    warmup_ratio=0.03,
    lr_scheduler_type="cosine",
    logging_steps=10,
    save_strategy="epoch",
    bf16=True,
    gradient_checkpointing=True,
    optim="adamw_torch_fused",
)

trainer = SFTTrainer(
    model=model,
    train_dataset=dataset,
    args=training_args,
    tokenizer=tokenizer,
    dataset_text_field="text",
    max_seq_length=8192,
    packing=True,
)

trainer.train()
trainer.save_model(output_dir)

Fine-Tuning Qwen3 MoE (30B-A3B)

The MoE model requires special handling:

import torch
from transformers import AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig

model_id = "Qwen/Qwen3-30B-A3B"

# QLoRA is recommended for MoE models
bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_compute_dtype=torch.bfloat16,
)

model = AutoModelForCausalLM.from_pretrained(
    model_id,
    quantization_config=bnb_config,
    device_map="auto",
    attn_implementation="flash_attention_2",
    trust_remote_code=True,
)

# For MoE models, target attention and MLP layers
lora_config = LoraConfig(
    r=32,
    lora_alpha=64,
    target_modules=[
        "q_proj", "k_proj", "v_proj", "o_proj",
        "gate_proj", "up_proj", "down_proj",
    ],
    lora_dropout=0.05,
    bias="none",
    task_type="CAUSAL_LM",
)

Important: DeepSpeed ZeRO-3 is incompatible with QLoRA. Use ZeRO-2 for multi-GPU QLoRA training.

Using Unsloth for 2x Faster Training

Unsloth provides optimized Qwen3 fine-tuning with 70% less VRAM:

from unsloth import FastLanguageModel

model, tokenizer = FastLanguageModel.from_pretrained(
    model_name="unsloth/Qwen3-8B",
    max_seq_length=4096,
    dtype=None,  # Auto-detect
    load_in_4bit=True,
)

model = FastLanguageModel.get_peft_model(
    model,
    r=64,
    target_modules=[
        "q_proj", "k_proj", "v_proj", "o_proj",
        "gate_proj", "up_proj", "down_proj",
    ],
    lora_alpha=64,
    lora_dropout=0,
    bias="none",
    use_gradient_checkpointing="unsloth",
    random_state=42,
)

Inference with Fine-Tuned Model

Loading LoRA Adapter

from transformers import AutoModelForCausalLM, AutoTokenizer
from peft import PeftModel

# Load base model
base_model = AutoModelForCausalLM.from_pretrained(
    "Qwen/Qwen3-8B",
    torch_dtype=torch.bfloat16,
    device_map="auto",
    attn_implementation="flash_attention_2",
    trust_remote_code=True,
)

# Load LoRA adapter
model = PeftModel.from_pretrained(base_model, "./qwen3-8b-finetuned")
tokenizer = AutoTokenizer.from_pretrained("Qwen/Qwen3-8B", trust_remote_code=True)

# Generate (with thinking disabled for faster inference)
messages = [{"role": "user", "content": "Your prompt here"}]
text = tokenizer.apply_chat_template(
    messages,
    tokenize=False,
    add_generation_prompt=True,
    enable_thinking=False  # Disable for faster inference
)

inputs = tokenizer(text, return_tensors="pt").to("cuda")

# Generation parameters (recommended by Qwen team)
output = model.generate(
    **inputs,
    max_new_tokens=512,
    temperature=0.7,
    top_p=0.8,
    top_k=20,
    do_sample=True,
)

print(tokenizer.decode(output[0], skip_special_tokens=True))

Merging LoRA for Deployment

# Merge LoRA weights into base model
merged_model = model.merge_and_unload()

# Save merged model
merged_model.save_pretrained("./qwen3-8b-merged")
tokenizer.save_pretrained("./qwen3-8b-merged")

Generation Parameters

Qwen team recommends different settings based on mode:

Mode	Temperature	Top-P	Top-K	Min-P
Thinking enabled	0.6	0.95	20	0
Thinking disabled	0.7	0.8	20	0

Important: Do NOT use greedy decoding (temperature=0) with Qwen3—it can cause endless repetitions and performance degradation.

Common Issues and Solutions

Out of Memory

# Solution 1: Reduce batch size and increase gradient accumulation
per_device_train_batch_size=1
gradient_accumulation_steps=16

# Solution 2: Use QLoRA instead of LoRA
load_in_4bit=True

# Solution 3: Reduce sequence length
max_seq_length=2048

# Solution 4: Enable gradient checkpointing
gradient_checkpointing=True

Model Not Following Instructions

# Ensure proper chat template
text = tokenizer.apply_chat_template(
    messages,
    tokenize=False,
    add_generation_prompt=True,
)

# Verify the format includes correct special tokens
assert "<|im_start|>" in text
assert "<|im_end|>" in text

Preserving Reasoning Capabilities

# Mix reasoning and non-reasoning data (75/25 split recommended)
# Include thinking traces in your training data
thinking_example = """<think>
Step 1: ...
Step 2: ...
</think>
Final answer: ..."""

Evaluation

from transformers import pipeline

pipe = pipeline(
    "text-generation",
    model="./qwen3-8b-merged",
    torch_dtype=torch.bfloat16,
    device_map="auto",
)

test_prompts = [
    "Explain quantum computing in simple terms.",
    "Write a Python function to sort a list.",
    "What are the benefits of exercise?",
]

for prompt in test_prompts:
    messages = [{"role": "user", "content": prompt}]
    output = pipe(
        messages,
        max_new_tokens=256,
        temperature=0.7,
        top_p=0.8,
        top_k=20,
    )
    print(f"Prompt: {prompt}")
    print(f"Response: {output[0]['generated_text'][-1]['content']}")
    print("-" * 50)

References

Qwen Team. (2025). "Qwen3: Think Deeper, Act Faster." Qwen Blog
Qwen Team. (2025). "Qwen3 GitHub Repository." GitHub
Hugging Face. (2025). "The 4 Things Qwen-3's Chat Template Teaches Us." Hugging Face Blog
Unsloth. (2025). "Qwen3 - How to Run & Fine-tune." Unsloth Documentation
DataCamp. (2025). "Fine-Tuning Qwen3: A Step-by-Step Guide." DataCamp Tutorial