AI

Easy Dataset: Open-Source Framework for Synthesizing LLM Fine-Tuning Data

Easy Dataset is an open-source GUI-based framework for creating high-quality fine-tuning datasets from unstructured documents like PDFs, Markdown, and DOCX.

Easy Dataset: Open-Source Framework for Synthesizing LLM Fine-Tuning Data

Fine-tuning large language models has become essential for organizations that need domain-specific AI performance, but the process has always been bottlenecked by one critical resource: high-quality training data. Creating instruction-tuning datasets manually is expensive, slow, and requires domain expertise that is often in short supply. Easy Dataset, an open-source framework by ConardLi, directly addresses this bottleneck by providing a GUI-based system for synthesizing fine-tuning datasets from unstructured documents.

The core idea is elegantly simple: take your existing documents – PDFs, Markdown files, DOCX documents – and use an LLM to generate diverse question-answer pairs from the content. Easy Dataset handles the entire pipeline, from document parsing and chunking through LLM-driven data synthesis, quality filtering, and export to standard fine-tuning formats.

What sets Easy Dataset apart from ad-hoc data generation scripts is its structured approach. The framework supports persona-driven prompt diversity, configurable difficulty levels, quality filtering through self-consistency checks, and a clean web UI that makes the entire process accessible to non-programmers.


How Does Easy Dataset Work?

The data synthesis pipeline proceeds through several stages, each configurable through the web interface.

graph TD
    A[Upload Documents\nPDF, MD, DOCX, TXT] --> B[Document Parser]
    B --> C[Chunking & Context\nPreservation]
    C --> D[Persona Selection\nConfigurable Personas]
    D --> E[LLM Data Synthesis\nQ&A Generation]
    E --> F[Quality Filtering\nSelf-Consistency & Heuristics]
    F --> G[Export\nJSONL, CSV, Parquet]
    G --> H[Fine-Tune\nYour LLM]
Pipeline StagePurposeConfiguration Options
Document ParsingExtract text from source filesOCR toggle, language detection, table extraction
ChunkingSplit documents into manageable sectionsChunk size, overlap, strategy (paragraph/section/semantic)
Persona SelectionDefine AI personas for diverse outputsBuilt-in personas or custom persona definitions
Data SynthesisGenerate Q&A pairs from chunksSample questions, output format, number of pairs
Quality FilteringRemove low-quality or duplicate entriesDeduplication, heuristic rules, LLM-as-judge
Format ExportOutput to fine-tuning formatsJSONL, CSV, Parquet, Hugging Face Hub

What Document Formats Does Easy Dataset Support?

Easy Dataset supports a broad range of input formats, making it easy to work with existing knowledge bases.

FormatFile ExtensionParser Notes
PDF.pdfMulti-column support, table extraction, OCR
Markdown.mdPreserves headings, lists, code blocks
Word.docxPreserves formatting and embedded images
Plain Text.txtSimple text extraction
CSV/JSON.csv, .json, .jsonlStructured data support
HTML.html, .htmWeb page content extraction
EPUB.epubE-book format support
LaTeX.texAcademic paper support
PowerPoint.pptxSlide content extraction

The chunking engine pays careful attention to context preservation. When a chunk crosses a semantic boundary (like a section heading), it includes the heading context to maintain coherence in the generated Q&A pairs.


How Do Persona-Driven Prompts Work?

The persona system is one of Easy Dataset’s most powerful features. Instead of generating all questions from the same perspective, you define multiple personas that each generate questions from their unique viewpoint.

PersonaPerspectiveExample Question Generated
BeginnerSimplified, conceptual“What is the main purpose of this system?”
PractitionerApplied, practical“How do I configure the retry mechanism?”
ExpertAdvanced, analytical“What are the trade-offs between these two architectures?”
ReviewerCritical, comparative“What potential edge cases are not addressed?”

This diversity is critical for producing robust fine-tuning datasets. A model trained on single-perspective data tends to overfit to that style, while multi-persona data produces models that generalize better across different use cases.


What Export Formats Does Easy Dataset Support?

Once the dataset is synthesized and quality-filtered, Easy Dataset supports multiple export options.

Export FormatCommon Use CaseStructure
JSONL (ShareGPT)Chat model fine-tuningconversations with roles and turns
JSONL (Alpaca)Instruction tuninginstruction, input, output
JSONL (OpenAI)OpenAI fine-tuning APImessages array format
CSVSimple processingquestion, answer, context columns
ParquetLarge-scale trainingColumnar, compressed format
Hugging Face HubDirect publishingAuto-upload to dataset repository

What Is the Quality Filtering Process?

Easy Dataset includes built-in quality assurance that runs after data synthesis. The filtering system uses both automated heuristics and LLM-based evaluation.

Filter TypeMethodCatches
DeduplicationSemantic similarity detectionNear-duplicate Q&A pairs
Length filterMinimum and maximum length thresholdsToo short or too long responses
Self-consistencyLLM generates answer twice, comparesHallucinated or inconsistent content
Relevance checkCosine similarity between question and document chunkOff-topic generations
Heuristic rulesConfigurable pattern matchingToxic content, PII, formatting issues

The default pipeline typically filters out 5-15% of generated pairs, depending on the source document quality and the LLM used for synthesis.


FAQ

What is Easy Dataset? Easy Dataset is an open-source GUI-based framework by ConardLi for creating high-quality fine-tuning datasets from unstructured documents. It processes PDFs, Markdown, DOCX, and other formats, using LLM-driven data synthesis with persona-driven prompts to generate diverse training examples. It supports multiple export formats and is designed for both instruction tuning and preference alignment.

What document formats does Easy Dataset support? Easy Dataset supports PDF, Markdown (.md), DOCX (.docx), TXT, CSV, JSON, JSONL, HTML, EPUB, LaTeX (.tex), and PowerPoint (.pptx). Documents are parsed into structured chunks that preserve context, formatting, and hierarchical relationships. The framework handles multi-column PDFs, tables, and embedded images through OCR integration.

How do persona-driven prompts work in Easy Dataset? Persona-driven prompts use configurable AI personas to generate diverse question-answer pairs from the same source material. For example, a ‘beginner’ persona may generate simple definition questions while an ’expert’ persona generates complex analytical questions. This approach produces datasets with natural variability that significantly improves downstream model generalization.

What export formats does Easy Dataset support? Easy Dataset exports to the most common fine-tuning formats including JSONL (ShareGPT-style, Alpaca-style, OpenAI-style), CSV, Parquet, and Hugging Face Datasets format. It also supports direct export to Hugging Face Hub. Custom output templates can be defined through the plugin system.

What research paper is Easy Dataset based on? Easy Dataset is grounded in the paper ‘Large Language Models are Effective Dataset Generators’ which demonstrates that LLM-synthesized training data can match or exceed human-curated data for fine-tuning. The framework implements the paper’s key findings, including persona-driven diversity, difficulty calibration, and quality filtering through self-consistency checks and heuristic validation.


Further Reading

TAG
CATEGORIES