QLoRA实践
4位量化微调大模型的完整指南
环境准备
# 安装依赖
pip install transformers peft accelerate bitsandbytes
# 推荐版本
transformers>=4.33.0
peft>=0.5.0
bitsandbytes>=0.41.0
加载4位量化模型
from transformers import AutoModelForCausalLM, BitsAndBytesConfig
# 4位量化配置
bnb_config = BitsAndBytesConfig(
load_in_4bit=True,
bnb_4bit_quant_type="nf4", # NormalFloat4
bnb_4bit_compute_dtype=torch.bfloat16,
bnb_4bit_use_double_quant=True, # 双重量化
)
model = AutoModelForCausalLM.from_pretrained(
"meta-llama/Llama-2-7b-hf",
quantization_config=bnb_config,
device_map="auto"
)
配置QLoRA
from peft import LoraConfig, get_peft_model, prepare_model_for_kbit_training
# 准备模型进行量化训练
model = prepare_model_for_kbit_training(model)
# LoRA配置
lora_config = LoraConfig(
r=16, # 秩
lora_alpha=32, # 缩放因子
target_modules=["q_proj", "k_proj", "v_proj", "o_proj"],
lora_dropout=0.05,
bias="none",
task_type="CAUSAL_LM"
)
model = get_peft_model(model, lora_config)
完整训练脚本
from transformers import TrainingArguments, Trainer
training_args = TrainingArguments(
output_dir="./qlora-output",
num_train_epochs=3,
per_device_train_batch_size=4,
gradient_accumulation_steps=4,
learning_rate=2e-4,
fp16=True,
logging_steps=10,
optim="paged_adamw_8bit", # 8位优化器
save_strategy="epoch",
)
trainer = Trainer(
model=model,
args=training_args,
train_dataset=dataset,
data_collator=data_collator,
)
trainer.train()
显存需求对比
| 模型 | 全参数 | LoRA | QLoRA |
|---|---|---|---|
| 7B | 28GB | 16GB | 6GB |
| 13B | 52GB | 28GB | 10GB |
| 33B | 132GB | 68GB | 20GB |
| 65B | 260GB | 130GB | 40GB |
最佳实践
使用NF4量化
NormalFloat4比标准4位量化保持更好精度
双重量化
对量化常数再量化,节省约0.5GB显存
Paged AdamW
使用分页优化器避免OOM
梯度检查点
model.gradient_checkpointing_enable()
----