大模型微调：SFT和RLHF

2025-02-06

字数统计: 3.7k | 阅读时长≈ 17 分钟

微调按参数对象分为高效参数微调PEFT和全参量微调FFT，按策略分为监督微调SFT和基于人类反馈的强化学习RLHF。
本次借助huggingface平台下载大模型和数据集，并在本地尝试进行微调，目的是跑通并体验微调过程。将依次进行三个，分别是SFT方法（PEFT的LoRA、Prefix），以及RLHF方法（PEFT或FFT）。
注意：huggingface下载都是需要代理的，文中出现的python库自行用pip安装即可。

🔥模型下载

huggingface上选择模型，这里以Qwen/Qwen2.5-3B-Instruct为例。

如下图可以直接在界面中的files and versions中下载模型，把所有文件下载进一个文件夹即可。
也可以通过命令行一次性下载，先通过pip install -U huggingface_hub安装命令行工具，在cmd中通过以下指令登录huggingface并下载模型：

huggingface-cli login
huggingface-cli download --resume-download Qwen/Qwen2.5-3B-Instruct --local-dir D://你的存储路径

首先第一句指令登录，会让你输入huggingface个人账户界面的access tokens的令牌，输入后即可通过第二条指令下载模型到指定位置。

🔥数据集下载

huggingface上选择数据集，以弱智吧数据集 LooksJuicy/ruozhiba 为例。

通过python的datasets库可以直接下载源文件：

import torch
from datasets import Dataset, load_dataset, load_from_disk
import os

# 配置你的代理
os.environ['HTTP_PROXY'] = 'http://127.0.0.1:7890'
os.environ['HTTPS_PROXY'] = 'http://127.0.0.1:7890'

dataset = load_dataset("LooksJuicy/ruozhiba", split='train')
dataset.save_to_disk("./datasets/ruozhiba")  # 保存到该目录下

数据集格式

这样只是下载源文件没有进行处理，在官网可以预览文件内容如上，该数据集由 instruction 和 output 组成，比较简单。
我们可以直接通过如下代码进行预处理，思路就是缓存数据集源文件并转换成可以用于训练的message（json格式），最后保存到json文件当中，这次不划分数据集，直接全部用于训练。
缓存是不会自动清掉的，可以在C盘的.cache里面手动删掉。

from datasets import load_from_disk, load_dataset
import os

# 配置代理
os.environ['HTTP_PROXY'] = 'http://127.0.0.1:7890'
os.environ['HTTPS_PROXY'] = 'http://127.0.0.1:7890'
# 系统 prompt，可以自行设置
system_message = "回答问题"

# 转换为 messages
def create_conversation(sample):
    return {
        "messages": [
            {"role": "system", "content": system_message},
            {"role": "user", "content": sample["instruction"]},
            {"role": "assistant", "content": sample["output"]}
        ]
    }

# 从 hub 加载数据集
dataset = load_dataset("LooksJuicy/ruozhiba", split="train")

# 转换 dataset 为 OAI messages
dataset = dataset.map(create_conversation, remove_columns=dataset.features, batched=False)

print(dataset[345]["messages"])

# 保存到磁盘
dataset.to_json("train_dataset.json", orient="records")

🔥SFT（LoRA）

训练

有了数据集和模型，就可以进行微调了，首先是LoRA微调，使用transformers、trl等库进行微调。
主要流程就是加载数据集、加载模型及其分词器、配置LoRA参数、配置训练参数，最后就可以定义训练器进行训练了。流程上和传统深度学习一样，只不过对应的库都进行了封装，不需要手动编写训练的常规流程。
超参数可以自行调整，减少显存占用可以降低批次大小per_device_train_batch_size和梯度积累gradient_accumulation_steps，减少训练总时长可以减少迭代数num_train_epochs（可以小于1）或调整批次大小。其他参数可以自行尝试。
注意：生成的LoRA模型是增量模型，也就是依赖原模型存在，所以生成的模型不会很大，如有需要可以自行查找方法来保存完整模型。
代码如下：

import torch, os
from transformers import AutoTokenizer, AutoModelForCausalLM, BitsAndBytesConfig
from trl import SFTTrainer, setup_chat_format
from peft import LoraConfig
from datasets import load_dataset
from transformers import TrainingArguments
import warnings
warnings.filterwarnings("ignore", message="`tokenizer` is deprecated")
warnings.filterwarnings("ignore", message="`use_cache=True` is incompatible")
warnings.filterwarnings("ignore", message="torch.utils.checkpoint: please pass in use_reentrant")
warnings.filterwarnings("ignore", message="Torch was not compiled with flash attention")

# 设置设备和环境
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

# 加载数据集
dataset = load_dataset("json", data_files="train_dataset.json所在路径", split="train")

# 加载模型和分词器
model_id = "qwen2.5-3b模型文件夹的路径"
bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_use_double_quant=True,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_compute_dtype=torch.bfloat16
)

model = AutoModelForCausalLM.from_pretrained(
    model_id,
    device_map="auto",
    torch_dtype=torch.bfloat16,
    quantization_config=bnb_config,
)
tokenizer = AutoTokenizer.from_pretrained(model_id)
tokenizer.padding_side = 'right'

# 配置LoRA
peft_config = LoraConfig(
    lora_alpha=16,
    lora_dropout=0.1,
    r=64,
    bias="none",
    task_type="CAUSAL_LM",
)

# 定义训练参数
args = TrainingArguments(
    output_dir=os.path.join(os.getcwd(), "保存模型的路径"),
    num_train_epochs=10,
    per_device_train_batch_size=8,
    gradient_accumulation_steps=2,
    gradient_checkpointing=True,
    optim="adamw_torch_fused",
    logging_steps=200,
    save_strategy="steps",
    save_steps=1000,
    learning_rate=3e-4,
    fp16=True,  # 启用 fp16 混合精度训练
    max_grad_norm=0.3,
    warmup_ratio=0.03,
    lr_scheduler_type="constant",
    push_to_hub=False,
    report_to="tensorboard",
)

# 创建SFTTrainer
trainer = SFTTrainer(
    model=model,
    args=args,
    train_dataset=dataset,
    peft_config=peft_config,
    tokenizer=tokenizer,
)

# 开始训练
trainer.train()
trainer.save_model()

测试

训练完成后就可以进行问答，通过以下脚本实现，测试时为了使得模型稳定生成结果，可以将温度值temperature调小一些。

from transformers import pipeline

# 加载模型
pipe = pipeline("text-generation", model="微调后的模型文件夹")
# 提供输入
input_text = "为什么我的银行卡在高压锅里煮了一晚上，还是冻结状态？"

# 调整生成参数
output = pipe(
    input_text,
    max_length=1024,  # 增加生成的最大长度
    num_return_sequences=1,  # 生成多个序列
    temperature=0.01,
)
print(output)

测试结果

可以看到模型是可以应对弱智吧的问题。

🔥SFT（Prefix）

prefix和lora在代码的区别上很小，效果差不太多，所以就只展示代码。

import torch, os
from transformers import AutoTokenizer, AutoModelForCausalLM, BitsAndBytesConfig
from trl import SFTTrainer
from peft import PrefixTuningConfig
from datasets import load_dataset, DatasetDict
from transformers import TrainingArguments
import warnings

warnings.filterwarnings("ignore", message="`tokenizer` is deprecated")
warnings.filterwarnings("ignore", message="`use_cache=True` is incompatible")
warnings.filterwarnings("ignore", message="torch.utils.checkpoint: please pass in use_reentrant")
warnings.filterwarnings("ignore", message="Torch was not compiled with flash attention")

# 设置设备和环境
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

# 加载数据集
dataset = load_dataset("json", data_files="./datasets/ruozhiba/train_dataset_1.json", split="train")
# 划分训练集和验证集
train_test_split = dataset.train_test_split(test_size=0.1, seed=42)  # 10% 的数据作为验证集
# 创建 DatasetDict
dataset_dict = DatasetDict({
    "train": train_test_split["train"],
    "validation": train_test_split["test"]
})

# 加载模型和分词器
model_id = "./models/qwen2.5-3b"
bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_use_double_quant=True,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_compute_dtype=torch.bfloat16
)

model = AutoModelForCausalLM.from_pretrained(
    model_id,
    device_map="auto",
    torch_dtype=torch.bfloat16,
    quantization_config=bnb_config,
)
tokenizer = AutoTokenizer.from_pretrained(model_id, padding_side="right", truncation=True)

# 配置Prefix Tuning
prefix_config = PrefixTuningConfig(
    peft_type="PREFIX_TUNING",
    task_type="CAUSAL_LM",
    num_virtual_tokens=20,      # 虚拟前缀的长度
    token_dim=2048,             # 前缀嵌入的维度，同原模型
    num_layers=36,              # 模型隐藏层数，同原模型
    prefix_projection=True,     # 是否对前缀进行投影
    encoder_hidden_size=2048    # 前缀编码器的隐藏层大小，同原模型
)

# 定义训练参数
args = TrainingArguments(
    output_dir=os.path.join(os.getcwd(), "results/qwen2.5-finetuned-prefix"),
    num_train_epochs=1,
    per_device_train_batch_size=2,
    gradient_accumulation_steps=2,
    gradient_checkpointing=False,
    optim="adamw_torch_fused",
    logging_steps=200,
    save_strategy="steps",
    save_steps=400,
    evaluation_strategy="steps",  # 添加评估策略
    eval_steps=200,  # 每 200 步评估一次
    learning_rate=1e-4,
    fp16=True,  # 启用 fp16 混合精度训练，提升效率
    max_grad_norm=0.3,
    warmup_ratio=0.3,
    lr_scheduler_type="linear",
    push_to_hub=False,
    report_to="tensorboard",
    load_best_model_at_end=True,  # 确保加载最佳模型
    metric_for_best_model="loss",  # 监控验证集损失
    greater_is_better=False,  # 损失越小越好
)

# 创建SFTTrainer
trainer = SFTTrainer(
    model=model,
    args=args,
    train_dataset=dataset_dict["train"],
    eval_dataset=dataset_dict["validation"],  # 添加验证集
    peft_config=prefix_config,  # 使用PrefixTuningConfig
    tokenizer=tokenizer,
)

# 开始训练
trainer.train()
trainer.save_model()

🔥RLHF（GRPO）

接下来就是强化学习方法，用的是GRPO算法。

GRPO的核心思想是通过组内相对奖励来优化策略模型，而不是依赖传统的价值网络（critic model）。具体来说，对于每个输入问题，模型会生成一组可能的输出，然后通过这些输出之间的相对表现来进行优化。这种方法消除了对独立评估器的需求，使得训练过程更加高效。减少计算负担、稳定性和可控性强。

准备

由于个人电脑有可能没法应付qwen2.5-3b的RLHF微调，所以可以按照之前的方法下载Qwen/Qwen2.5-0.5B-Instruct，然后数据集使用openai/gsm8k，是一个英文的数学题的数据集，在输出最末尾附有正确答案。

训练

import re
import torch
from datasets import load_dataset, Dataset
from transformers import AutoTokenizer, AutoModelForCausalLM
from trl import GRPOConfig, GRPOTrainer
from typing import Union

SYSTEM_PROMPT = """
Respond in the following format:
<reasoning>
...
</reasoning>
<answer>
...
</answer>
"""
XML_COT_FORMAT = """\
<reasoning>
{reasoning}
</reasoning>
<answer>
{answer}
</answer>
"""

# 数据、数据集处理部分
def extract_xml_answer(text: str) -> str:       # 获取llm的回答部分
    answer = text.split("<answer>")[-1]
    answer = answer.split("</answer>")[0]
    return answer.strip()

def extract_hash_answer(text: str) -> Union[str, None]:   # 获取正确答案
    if "####" not in text:
        return None
    return text.split("####")[1].strip()

# uncomment middle messages for 1-shot prompting
def get_gsm8k_questions(split = "train") -> Dataset:    # 获得数据集并加入prompt
    data = load_dataset('openai/gsm8k', 'main')[split] # type: ignore
    data = data.map(lambda x: { # type: ignore
        'prompt': [
            {'role': 'system', 'content': SYSTEM_PROMPT},
            {'role': 'user', 'content': x['question']}
        ],
        'answer': extract_hash_answer(x['answer'])
    }) # type: ignore
    return data # type: ignore

首先导入库后，设置好prompt来引导模型按<reasoning>和<answer>的xml格式输出推导过程和答案，这样方便我们提取出答案来进行奖励判断。
然后就是处理数据集和输出文本的函数了，extract_xml_answer用于将生成好的文本提取出答案；extract_hash_answer则是获取数据集中的答案；get_gsm8k_questions则加载数据集，并且融入prompt。

def correctness_reward_func(prompts, completions, answer, **kwargs) -> list[float]:
    responses = [completion[0]['content'] for completion in completions]
    q = prompts[0][-1]['content']
    extracted_responses = [extract_xml_answer(r) for r in responses]
    print('-'*20, f"Question:\n{q}", f"\nAnswer:\n{answer[0]}", f"\nResponse:\n{responses[0]}", f"\nExtracted:\n{extracted_responses[0]}")
    return [2.0 if r == a else 0.0 for r, a in zip(extracted_responses, answer)]

def int_reward_func(completions, **kwargs) -> list[float]:
    responses = [completion[0]['content'] for completion in completions]
    extracted_responses = [extract_xml_answer(r) for r in responses]
    return [0.5 if r.isdigit() else 0.0 for r in extracted_responses]

def strict_format_reward_func(completions, **kwargs) -> list[float]:
    """Reward function that checks if the completion has a specific format."""
    pattern = r"^<reasoning>\n.*?\n</reasoning>\n<answer>\n.*?\n</answer>\n$"
    responses = [completion[0]["content"] for completion in completions]
    matches = [re.match(pattern, r) for r in responses]
    return [0.5 if match else 0.0 for match in matches]

def soft_format_reward_func(completions, **kwargs) -> list[float]:
    """Reward function that checks if the completion has a specific format."""
    pattern = r"<reasoning>.*?</reasoning>\s*<answer>.*?</answer>"
    responses = [completion[0]["content"] for completion in completions]
    matches = [re.match(pattern, r) for r in responses]
    return [0.5 if match else 0.0 for match in matches]

def count_xml(text) -> float:
    count = 0.0
    if text.count("<reasoning>\n") == 1:
        count += 0.125
    if text.count("\n</reasoning>\n") == 1:
        count += 0.125
    if text.count("\n<answer>\n") == 1:
        count += 0.125
        count -= len(text.split("\n</answer>\n")[-1])*0.001
    if text.count("\n</answer>") == 1:
        count += 0.125
        count -= (len(text.split("\n</answer>")[-1]) - 1)*0.001
    return count

def xmlcount_reward_func(completions, **kwargs) -> list[float]:
    contents = [completion[0]["content"] for completion in completions]
    return [count_xml(c) for c in contents]

接着就是4个奖励函数，分别用于判断答案是否正确、是否回答整数、是否按格式生成答案（严格和宽松两种）。

dataset = get_gsm8k_questions()
model_name = "./models/qwen2.5-3b"

output_dir="results/qwen2.5-finetuned_GRPO"
run_name="Qwen-2.5B-GRPO-gsm8k"

training_args = GRPOConfig(
    output_dir=output_dir,
    run_name=run_name,
    learning_rate=5e-6,
    adam_beta1 = 0.9,
    adam_beta2 = 0.99,
    weight_decay = 0.1,
    warmup_ratio = 0.1,
    lr_scheduler_type='cosine',
    logging_steps=1,
    bf16=True,
    per_device_train_batch_size=1,
    gradient_accumulation_steps=4,
    num_generations=16,
    max_prompt_length=256,
    max_completion_length=200,
    num_train_epochs=1,
    save_steps=100,
    max_grad_norm=0.1,
    log_on_each_node=False,
    use_vllm=False,
    vllm_gpu_memory_utilization=.3,
    vllm_device="cuda:0",
    report_to="none" #I'm disabling Wandb.
)

model = AutoModelForCausalLM.from_pretrained(
    model_name,
    torch_dtype=torch.bfloat16,
    device_map=None
).to("cuda")

tokenizer = AutoTokenizer.from_pretrained(model_name)
tokenizer.pad_token = tokenizer.eos_token

trainer = GRPOTrainer(
    model=model,
    processing_class=tokenizer,
    reward_funcs=[
        xmlcount_reward_func,
        soft_format_reward_func,
        strict_format_reward_func,
        int_reward_func,
        correctness_reward_func],
    args=training_args,
    train_dataset=dataset,
    #peft_config=peft_config
)
trainer.train()

trainer.save_model(output_dir)

最后是训练主流程，这就和SFT差不多了，就是加载模型和数据集、配置微调参数、定义GRPO算法的训练器，就可以训练了。
所以整体代码就是多了奖励函数，还有为了方便判断结果要用提示词引导模型，毕竟核心的GRPO算法已经被集成到库里了，调用就行。
peft_config这里可以像前面SFT部分一样设置LoRA之类的PEFT方法，不设置就是全量微调。

效果

由于训练不太可能全跑完，所以观察到了有改善就停了。

-------------------- Question:
Ahmed and Emily are having a contest to see who can get the best grade in the class. There have been 9 assignments and Ahmed has a 91 in the class. Emily has a 92. The final assignment is worth the same amount as all the other assignments. Emily got a 90 on the final assignment. What is the minimum grade Ahmed needs to get to beat Emily if all grades are whole numbers? 
Answer:
100 
Response:
To determine the minimum grade Ahmed needs to get to beat Emily, we first need to calculate the total grade the class will have for all assignments. The class has 9 assignments, and Emily has already achieved a 92 in all assignments. Therefore, the total grade for Emily is already 92.

Since Emily's final assignment is worth the same amount as all the other assignments, Emily will score another 90 on that assignment. Let's denote Ahmed's score on the final assignment as \(x\). The total grade for Emily will then be:
\[ x + 90 = 92 + 90 = 182 \]
The total grade for Ahmed will be:
\[ 91 + x \]
To beat Emily, Ahmed needs to score more than 182. Therefore, we need to find the smallest integer \(x\) such that:
\[ 91 + x > 182 \]
Subtracting 91 from 
Extracted:
To determine the minimum grade Ahmed needs to get to beat Emily, ……（这里提取函数提取不出答案就会把整个答案打出来，就省略了）

可以看到刚开始模型基本不会按格式回答，回答长度也超限制了，答案也很容易错。这种情况会持续比较久，因为长时间没有正反馈，模型需要误打误撞才能得到正确的行为来获得奖励。

-------------------- Question:
A basketball team won 35 out of the 50 games that they have played. They still have 25 games left to play for the season. How many of the remaining games must they win so that their percentage of wins in the entire season is 64%? 
Answer:
13 
Response:
<reasoning>
To calculate the number of games the team must win to reach a 64% win rate across all games in the season, we can follow these steps:

1. Let's assume the total number of games in the season is \( x \). We know that the team has played \( x - 25 \) games, and they have won 35 out of those games. Therefore, the team has won \( \frac{35}{x - 25} \) games.

2. The team wants to achieve a 64% win rate across all 50 games. We can set up the following equation to represent this:
   \[
   \frac{35}{x - 25} \times 100 = 64 \times x
   \]

3. Solving this equation will give us the value of \( x \). However, since the specific value of \( x \) is not 
Extracted:
<reasoning

接下来模型就出现了按格式答题的情况，虽然可能还会长度超限或者答错，但是一旦出现这种情况，就很容易按格式答题了。

-------------------- Question:
Dave bought 8 books about animals, 6 books about outer space, and 3 books about trains to keep him busy over the holidays. Each book cost $6. How much did Dave spend on the books? 
Answer:
102 
Response:
<reasoning>
Dave bought a total of \(8 + 6 + 3 = 17\) books. Each book cost $6, so the total cost is \(17 \times 6 = \$102\).

</reasoning>
<answer>102</answer> 
Extracted:
102

最后训练了几百题之后，模型就可以稳定地按格式回答问题，并且对于不复杂的题目基本都能做对了（Answer是数据集正确答案，Extracted是模型回答中提取的答案）。说明RLHF是有效果的，即便应对不怎么智能的小模型。

🔥总结

本次通过LoRA和Prefix方法体验了SFT微调，通过GRPO算法体验了RLHF微调，算是初步尝试了最主流的微调方案及其方法。
通过该方案可以比较容易地上手微调，但是微调本身还是很吃算力的，即便是这种参数高效微调对个人电脑也相当慢，效果也不一定好，像传统深度学习训练一样，也需要很多调整，强化学习就更是需要更多的时间和算力了。所以只是用于尝试和体验微调过程，丰富相关经验，增强知识的理解。