Skip to content

bubbleee030/sft_project

Repository files navigation

SFT Judge Model Fine-tuning Data Pipeline

A complete data preprocessing and formatting pipeline for fine-tuning a Judge Model used in LLM Supervised Fine-Tuning (SFT) evaluation workflows. This project processes raw evaluation outputs from multiple generator/judge model combinations, cleans and deduplicates the data, and produces training datasets in ShareGPT format ready for LLaMA-Factory.


Background

In modern LLM development, a Judge Model (裁判模型) is used to automatically evaluate the quality of responses generated by other language models. Instead of relying on human annotation for every response, a fine-tuned judge model can:

  • Score responses across multiple dimensions (task relevance, format quality, etc.)
  • Scale evaluation to large datasets
  • Provide consistent, reproducible scoring

This project built the training data pipeline for such a judge model, targeting Traditional/Simplified Chinese government document generation tasks across four mission types: public documents (公文), press releases (新聞稿), petitions (陳情書), and Q&A (問答).


Pipeline Architecture

eval_results/                         ← Raw LLM evaluation outputs
        │
        ▼
extract_data1.py / extract_data2.py   ← Step 1: Extract, Validate & Deduplicate
        │  · Parse .jsonl/.json files from multiple gen/judge model combos
        │  · Validate required fields and format tags
        │  · Near-duplicate removal via SimHash (O(n) sorted comparison)
        │  · Language detection (Traditional vs. Simplified Chinese)
        │
        ▼
final_classified_output/              ← Classified by language × mission type
        │  traditional_od.json, traditional_press.json, ...
        │  simplified_od.json, simplified_press.json, ...
        │
        ▼
balance_data.py                       ← Step 2: Strategic Dataset Balancing
        │  · Down-sample over-represented mission types
        │  · Prioritize rare model combination samples
        │
        ▼
final_classified_output_new/          ← Balanced & classified output
        │
        ▼
format_data.py                        ← Step 3: Format for Fine-tuning
        │  · Build user/assistant ShareGPT conversation pairs
        │  · user:  [original task prompt] + [model response to score]
        │  · assistant: [ground-truth judge score + reasoning]
        │
        ▼
judge_training_data/
        judge_finetuning_dataset.jsonl  ← Training set for LLaMA-Factory SFT
        judge_inference_dataset.jsonl   ← Inference set for evaluation

PrepareInferenceData.py               ← Prepare inference data (ShareGPT format)
eval_local.py                         ← Local batch inference with HuggingFace models

Visualization Results

The pipeline generates statistical reports and charts showing data quality and distribution across model combinations.

Overall Status Distribution Format Error Analysis
Status Errors
GenModel × JudgeModel Heatmap (Valid Data) Mission Distribution by Model
Heatmap OD

Repository Structure

sft_project/
├── extract_data1.py          # Data extraction, validation, dedup & classification (v1)
├── extract_data2.py          # Improved version with unified metadata tracking
├── balance_data.py           # Mission-stratified dataset balancing
├── format_data.py            # Format classified data into judge training set (ShareGPT)
├── PrepareInferenceData.py   # Format data for judge model inference
├── eval_local.py             # Local batch LLM inference (HuggingFace + PyTorch)
├── VisualizationData.py      # Data quality visualization & reporting
├── analyze_length.py         # Token length analysis for training cutoff selection
├── simplified_chars.txt      # Simplified Chinese character reference set
├── judge_training_data/
│   └── judge_inference_dataset.jsonl  # Sample inference dataset (ShareGPT format)
└── visualization_report/     # Generated charts and HTML reports

Note: Large output data files (final_classified_output/, final_classified_output_new/, judge_training_data/judge_finetuning_dataset.jsonl) are excluded from this repository due to size constraints. Run the pipeline scripts against your own eval_results/ data to regenerate them.


Key Technical Features

1. Near-Duplicate Detection with SimHash

extract_data1.py / extract_data2.py implement O(n) near-duplicate removal by:

  • Computing 64-bit SimHash fingerprints for each full_output field
  • Sorting by hash value to avoid O(n²) pairwise comparison
  • Flagging items with Hamming distance ≤ 2 as duplicates

2. Multi-dimensional Validation

Each data item is validated for:

  • Presence and non-emptiness of full_output and model_response
  • Absence of <think> reasoning traces (contamination filter)
  • Required evaluation tags: 【給分原因】, 【分數】, 題文匹配度(MatchPoint), 文本格式(TextFormat)

3. Language Classification

A character-level lookup against a Simplified Chinese character set (simplified_chars.txt) classifies each item as Traditional or Simplified Chinese — without relying on external NLP libraries.

4. Priority-based Dataset Balancing

balance_data.py balances mission categories by:

  • Targeting the count of the rarest mission type
  • Preferentially retaining samples from under-represented model combinations

5. ShareGPT Format Output

Both training and inference datasets use the ShareGPT messages format expected by LLaMA-Factory:

{
  "messages": [
    {"role": "user", "content": "### 原始任務指令:\n...\n### 待評分的回覆:\n..."},
    {"role": "assistant", "content": "【給分原因】...\n【分數】..."}
  ]
}

Data Format

Input (eval_results/)

Nested directories structured as eval_results/<genmodel>/<judgemodel>/<mission>.*:

  • .jsonl files: line-delimited JSON with fields qid, prompt, resp, model_response, full_output
  • .json files: wrapped in eval_result_from_first_iteration array

Output

File Description
traditional_od.json Traditional Chinese public document samples
traditional_press.json Traditional Chinese press release samples
traditional_petition.json Traditional Chinese petition samples
traditional_qa.json Traditional Chinese Q&A samples
simplified_*.json Corresponding Simplified Chinese samples
judge_finetuning_dataset.jsonl Final ShareGPT training set
judge_inference_dataset.jsonl ShareGPT inference/evaluation set

Setup & Usage

Requirements

pip install pandas simhash transformers torch tqdm fire matplotlib seaborn

Run the Pipeline

Step 1 – Extract, validate, deduplicate & classify:

python extract_data2.py        # Recommended (v2 with full metadata tracking)
# or
python extract_data1.py        # v1 (separate valid/invalid outputs)

Step 2 – Balance the dataset:

python balance_data.py --input_dir final_classified_output --output_dir final_classified_output_new

Step 3 – Format for fine-tuning:

python format_data.py          # Produces judge_training_data/judge_finetuning_dataset.jsonl

Prepare inference data:

python PrepareInferenceData.py  # Produces judge_training_data/judge_inference_dataset.jsonl

Run local batch inference:

python eval_local.py \
  --model_path /path/to/judge_model \
  --data_path judge_training_data/judge_inference_dataset.jsonl \
  --output_path results.jsonl \
  --batch_size 8 \
  --max_new_tokens 1024

Analyze token lengths (set training cutoff):

HF_TOKEN=your_token python analyze_length.py

Visualize data quality:

python VisualizationData.py    # Outputs charts to visualization_report/

Technologies Used

Category Tools
Data Processing Python, pandas
Deduplication SimHash
LLM Inference HuggingFace Transformers, PyTorch
Training Framework LLaMA-Factory (ShareGPT format)
Visualization Matplotlib, Seaborn
Version Control Git, Git LFS

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors