This project is a Python-based tool for generating fine-tuning datasets using the DeepSeek API. It creates single-turn Q&A pairs or multi-turn conversations in various styles, based on customizable topics and categories. It also includes a dataset validator to ensure quality before fine-tuning.
-
Generate realistic Q&A examples with customizable styles
-
Support for single or multi-turn conversations
-
Style options:
helpful,corporate,casual,technical,creative,educational -
Validate
.jsonldatasets before fine-tuning -
Topic categories covering technology, business, education, lifestyle, and more
-
Rate limiting support and environment configuration via
.env
-
Python 3.7+
-
Dependencies:
bash
pip install -r requirements.txt
-
Clone the repository:
bash
git clone https://github.com/luisriverag/deepseek-api_dataset-generator.gitcd deepseek-api_dataset-generator -
Create a
.envfile:CopyEdit
python3 generator.py --create-env -
Edit
.envand add your DeepSeek API key:env
DEEPSEEK_API_KEY=your_deepseek_api_key_here
python3 generator.py
python3 generator.py \ --output my_dataset.jsonl \ --count 100 \ --style technical \ --categories technology education \ --conversation-turns 2
python3 generator.py --validate-only --output my_dataset.jsonl
Available topic categories:
-
technology -
business -
education -
lifestyle -
creative -
science -
all(default)
Each line in the output .jsonl file follows the format:
json
{ "messages": [ {"role": "user", "content": "What is quantum computing?"}, {"role": "assistant", "content": "Quantum computing uses principles of quantum mechanics to perform computations..."} ]}
Ensures:
-
Proper JSON structure
-
Valid roles:
user,assistant,system -
Token estimation
-
Warnings for long/empty content
-
Review your dataset.
-
Use it for fine-tuning a model that supports DeepSeek-style training.
-
Monitor performance and adjust generation parameters as needed.
MIT License. See LICENSE for details.