A complete data preprocessing and formatting pipeline for fine-tuning a Judge Model used in LLM Supervised Fine-Tuning (SFT) evaluation workflows. This project processes raw evaluation outputs from multiple generator/judge model combinations, cleans and deduplicates the data, and produces training datasets in ShareGPT format ready for LLaMA-Factory.
In modern LLM development, a Judge Model (裁判模型) is used to automatically evaluate the quality of responses generated by other language models. Instead of relying on human annotation for every response, a fine-tuned judge model can:
- Score responses across multiple dimensions (task relevance, format quality, etc.)
- Scale evaluation to large datasets
- Provide consistent, reproducible scoring
This project built the training data pipeline for such a judge model, targeting Traditional/Simplified Chinese government document generation tasks across four mission types: public documents (公文), press releases (新聞稿), petitions (陳情書), and Q&A (問答).
eval_results/ ← Raw LLM evaluation outputs
│
▼
extract_data1.py / extract_data2.py ← Step 1: Extract, Validate & Deduplicate
│ · Parse .jsonl/.json files from multiple gen/judge model combos
│ · Validate required fields and format tags
│ · Near-duplicate removal via SimHash (O(n) sorted comparison)
│ · Language detection (Traditional vs. Simplified Chinese)
│
▼
final_classified_output/ ← Classified by language × mission type
│ traditional_od.json, traditional_press.json, ...
│ simplified_od.json, simplified_press.json, ...
│
▼
balance_data.py ← Step 2: Strategic Dataset Balancing
│ · Down-sample over-represented mission types
│ · Prioritize rare model combination samples
│
▼
final_classified_output_new/ ← Balanced & classified output
│
▼
format_data.py ← Step 3: Format for Fine-tuning
│ · Build user/assistant ShareGPT conversation pairs
│ · user: [original task prompt] + [model response to score]
│ · assistant: [ground-truth judge score + reasoning]
│
▼
judge_training_data/
judge_finetuning_dataset.jsonl ← Training set for LLaMA-Factory SFT
judge_inference_dataset.jsonl ← Inference set for evaluation
PrepareInferenceData.py ← Prepare inference data (ShareGPT format)
eval_local.py ← Local batch inference with HuggingFace models
The pipeline generates statistical reports and charts showing data quality and distribution across model combinations.
| Overall Status Distribution | Format Error Analysis |
|---|---|
![]() |
![]() |
| GenModel × JudgeModel Heatmap (Valid Data) | Mission Distribution by Model |
|---|---|
![]() |
![]() |
sft_project/
├── extract_data1.py # Data extraction, validation, dedup & classification (v1)
├── extract_data2.py # Improved version with unified metadata tracking
├── balance_data.py # Mission-stratified dataset balancing
├── format_data.py # Format classified data into judge training set (ShareGPT)
├── PrepareInferenceData.py # Format data for judge model inference
├── eval_local.py # Local batch LLM inference (HuggingFace + PyTorch)
├── VisualizationData.py # Data quality visualization & reporting
├── analyze_length.py # Token length analysis for training cutoff selection
├── simplified_chars.txt # Simplified Chinese character reference set
├── judge_training_data/
│ └── judge_inference_dataset.jsonl # Sample inference dataset (ShareGPT format)
└── visualization_report/ # Generated charts and HTML reports
Note: Large output data files (
final_classified_output/,final_classified_output_new/,judge_training_data/judge_finetuning_dataset.jsonl) are excluded from this repository due to size constraints. Run the pipeline scripts against your owneval_results/data to regenerate them.
extract_data1.py / extract_data2.py implement O(n) near-duplicate removal by:
- Computing 64-bit SimHash fingerprints for each
full_outputfield - Sorting by hash value to avoid O(n²) pairwise comparison
- Flagging items with Hamming distance ≤ 2 as duplicates
Each data item is validated for:
- Presence and non-emptiness of
full_outputandmodel_response - Absence of
<think>reasoning traces (contamination filter) - Required evaluation tags:
【給分原因】,【分數】,題文匹配度(MatchPoint),文本格式(TextFormat)
A character-level lookup against a Simplified Chinese character set (simplified_chars.txt) classifies each item as Traditional or Simplified Chinese — without relying on external NLP libraries.
balance_data.py balances mission categories by:
- Targeting the count of the rarest mission type
- Preferentially retaining samples from under-represented model combinations
Both training and inference datasets use the ShareGPT messages format expected by LLaMA-Factory:
{
"messages": [
{"role": "user", "content": "### 原始任務指令:\n...\n### 待評分的回覆:\n..."},
{"role": "assistant", "content": "【給分原因】...\n【分數】..."}
]
}Nested directories structured as eval_results/<genmodel>/<judgemodel>/<mission>.*:
.jsonlfiles: line-delimited JSON with fieldsqid,prompt,resp,model_response,full_output.jsonfiles: wrapped ineval_result_from_first_iterationarray
| File | Description |
|---|---|
traditional_od.json |
Traditional Chinese public document samples |
traditional_press.json |
Traditional Chinese press release samples |
traditional_petition.json |
Traditional Chinese petition samples |
traditional_qa.json |
Traditional Chinese Q&A samples |
simplified_*.json |
Corresponding Simplified Chinese samples |
judge_finetuning_dataset.jsonl |
Final ShareGPT training set |
judge_inference_dataset.jsonl |
ShareGPT inference/evaluation set |
pip install pandas simhash transformers torch tqdm fire matplotlib seabornStep 1 – Extract, validate, deduplicate & classify:
python extract_data2.py # Recommended (v2 with full metadata tracking)
# or
python extract_data1.py # v1 (separate valid/invalid outputs)Step 2 – Balance the dataset:
python balance_data.py --input_dir final_classified_output --output_dir final_classified_output_newStep 3 – Format for fine-tuning:
python format_data.py # Produces judge_training_data/judge_finetuning_dataset.jsonlPrepare inference data:
python PrepareInferenceData.py # Produces judge_training_data/judge_inference_dataset.jsonlRun local batch inference:
python eval_local.py \
--model_path /path/to/judge_model \
--data_path judge_training_data/judge_inference_dataset.jsonl \
--output_path results.jsonl \
--batch_size 8 \
--max_new_tokens 1024Analyze token lengths (set training cutoff):
HF_TOKEN=your_token python analyze_length.pyVisualize data quality:
python VisualizationData.py # Outputs charts to visualization_report/| Category | Tools |
|---|---|
| Data Processing | Python, pandas |
| Deduplication | SimHash |
| LLM Inference | HuggingFace Transformers, PyTorch |
| Training Framework | LLaMA-Factory (ShareGPT format) |
| Visualization | Matplotlib, Seaborn |
| Version Control | Git, Git LFS |



