๐Ÿ”‘ Sign In Required

Please sign in with your Hugging Face account to access the synthetic data generation service. Click the Sign in button above to continue.

This tool allows you to generate synthetic data from existing datasets, for all your fine-tuning/research/data augmentation needs!

DataForge is built on top of DataTrove, our backend data generation script is open-source and available on GitHub. DataForge is FREE for HuggingFace PRO users (10,000 samples) โ€ข 100 samples for free users.

All generated datasets will be publicly available under the synthetic-data-universe organization.

Step-by-Step Process:

  1. Choose Model: Select from 20+ models
  2. Load Dataset: Enter a HF dataset name
  3. Load Info: Click "Load Dataset Info"
  4. Configure: Set generation parameters
  5. Submit: Monitor progress in Statistics tab

Requirements:

  • Input dataset must be public on HF Hub
  • Model must be publicly accessible
  • Free users: 100 samples max, PRO: 10K max
  • Token limit: 8,192 per sample

Popular Use Cases:

Conversational: Multi-turn dialogues

  • Models: Llama-3.2-3B, Mistral-7B
  • Temperature: 0.7-0.9

Code: Problem โ†’ Solution

  • Models: Qwen2.5-Coder, DeepSeek-Coder
  • Temperature: 0.1-0.3

Example datasets to try:

simplescaling/s1K-1.1
HuggingFaceH4/ultrachat_200k
iamtarun/python_code_instructions_18k_alpaca