DeepSeek API Dataset Generator

This project is a Python-based tool for generating fine-tuning datasets using the DeepSeek API. It creates single-turn Q&A pairs or multi-turn conversations in various styles, based on customizable topics and categories. It also includes a dataset validator to ensure quality before fine-tuning.

🚀 Features

Generate realistic Q&A examples with customizable styles
Support for single or multi-turn conversations
Style options: helpful, corporate, casual, technical, creative, educational
Validate .jsonl datasets before fine-tuning
Topic categories covering technology, business, education, lifestyle, and more
Rate limiting support and environment configuration via .env

🧰 Requirements

Python 3.7+
DeepSeek API Key
Dependencies:

bash

pip install -r requirements.txt

🔐 Setup

Clone the repository:

bash

git clone https://github.com/luisriverag/deepseek-api_dataset-generator.gitcd deepseek-api_dataset-generator
Create a .env file:

CopyEdit

python3 generator.py --create-env
Edit .env and add your DeepSeek API key:

env

DEEPSEEK_API_KEY=your_deepseek_api_key_here

🛠️ Usage

Generate 50 examples (default):

python3 generator.py

Customize output:

python3 generator.py \ --output my_dataset.jsonl \ --count 100 \ --style technical \ --categories technology education \ --conversation-turns 2

Validate an existing dataset:

python3 generator.py --validate-only --output my_dataset.jsonl

📚 Categories

Available topic categories:

technology
business
education
lifestyle
creative
science
all (default)

📄 Output Format

Each line in the output .jsonl file follows the format:

json

{ "messages": [ {"role": "user", "content": "What is quantum computing?"}, {"role": "assistant", "content": "Quantum computing uses principles of quantum mechanics to perform computations..."} ]}

✅ Validation

Ensures:

Proper JSON structure
Valid roles: user, assistant, system
Token estimation
Warnings for long/empty content

🤖 Next Steps

Review your dataset.
Use it for fine-tuning a model that supports DeepSeek-style training.
Monitor performance and adjust generation parameters as needed.

📄 License

MIT License. See LICENSE for details.

Name		Name	Last commit message	Last commit date
Latest commit History 2 Commits
LICENSE		LICENSE
generator.py		generator.py
readme.md		readme.md
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

DeepSeek API Dataset Generator

🚀 Features

🧰 Requirements

🔐 Setup

🛠️ Usage

Generate 50 examples (default):

Customize output:

Validate an existing dataset:

📚 Categories

📄 Output Format

✅ Validation

🤖 Next Steps

📄 License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

DeepSeek API Dataset Generator

🚀 Features

🧰 Requirements

🔐 Setup

🛠️ Usage

Generate 50 examples (default):

Customize output:

Validate an existing dataset:

📚 Categories

📄 Output Format

✅ Validation

🤖 Next Steps

📄 License

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages