Open Source

Trafilatura: Open-Source Web Text Extraction for LLM Datasets and Research

Trafilatura is a Python tool for web text extraction and crawling with the highest F-Score among open-source extractors, used by HuggingFace, IBM, and Microsoft.

Keeping this site alive takes effort — your support means everything.
無程式碼也能輕鬆打造專業LINE官方帳號!一鍵導入模板,讓AI助你行銷加分! 無程式碼也能輕鬆打造專業LINE官方帳號!一鍵導入模板,讓AI助你行銷加分!
Trafilatura: Open-Source Web Text Extraction for LLM Datasets and Research

Extracting clean, structured text from web pages is a foundational task for LLM training datasets, research corpora, and content analysis pipelines. Trafilatura has emerged as the gold standard for this task – a Python library that consistently achieves the highest F-Score among open-source text extraction tools while remaining lightweight, fast, and easy to integrate.

Developed by Adrien Barbaresi at the Berlin-Brandenburg Academy of Sciences and Humanities, Trafilatura goes beyond simple HTML-to-text conversion. It identifies the main content area of a webpage, strips away navigation, headers, footers, ads, and sidebars, and returns only the meaningful textual content. Its crawling capabilities allow it to recursively follow links within a domain, building comprehensive text corpora from entire websites.

The tool’s accuracy and reliability have earned it adoption by major organizations including HuggingFace, IBM, and Microsoft, as well as widespread use in academic NLP research. Its benchmark-topping performance in the Clean-Eval and other evaluation frameworks makes it the default choice for researchers who need trustworthy text extraction at scale.


How Does Trafilatura’s Accuracy Compare to Other Extractors?

Trafilatura’s performance advantage is documented in academic benchmarks that measure precision and recall across diverse web content types.

ToolF-ScorePrecisionRecallLanguage Support
Trafilatura0.940.950.9345+ languages
Newspaper3k0.820.840.8020+ languages
readability0.790.810.77Mainly English
boilerpy30.760.780.7410+ languages
jusText0.710.740.6815+ languages

The gap is particularly pronounced on complex page layouts with heavy navigation, embedded media, and dynamic content – the kinds of pages that dominate the modern web. Trafilatura’s heuristic-based approach, combined with its ability to handle multiple content formats within a single page, gives it a consistent edge.


What Output Formats Does Trafilatura Support?

Trafilatura’s versatility in output formats makes it suitable for a wide range of downstream applications.

graph LR
    A[HTML Input] --> B[Trafilatura]
    B --> C[Plain Text]
    B --> D[Markdown]
    B --> E[JSON]
    B --> F[XML]
    B --> G[CSV]
    C --> H[LLM Training Data]
    C --> I[Full-Text Search]
    D --> J[RAG Chunks]
    D --> K[Documentation]
    E --> L[Structured Analysis]
    F --> M[TEI Encoding]
    G --> N[Spreadsheet Import]
FormatBest ForExample Use Case
Plain TextLLM training corporaFine-tuning datasets
MarkdownRAG pipeline documentsStructured knowledge base
JSONProgrammatic analysisContent metadata extraction
XML/TEIAcademic archivingDigital humanities research
CSVBulk processingBatch URL extraction

Each format preserves the extracted text along with configurable metadata such as URL, title, author, publication date, and extraction timestamp.


How Do You Get Started with Trafilatura?

Installation and basic usage are remarkably simple, requiring only a single pip command and a few lines of Python.

TaskCommand / CodeNotes
Installpip install trafilaturaPython 3.8+ required
Extract from URLtrafilatura --url "https://example.com"CLI one-liner
Python basicfrom trafilatura import fetch_url, extractCore import
Python usagecontent = extract(fetch_url(url))Returns Markdown
Batch processingtrafilatura --list urls.txt --output-dir ./outputCrawl support
Crawl domaintrafilatura --sitemap "https://example.com/sitemap.xml"Recursive crawl

The library also provides fine-grained control via options for output format selection, language detection, content exclusion rules, and extraction strategy configuration.


FAQ

What is Trafilatura? Trafilatura is a Python-based tool for web text extraction and crawling that identifies and extracts the main textual content from HTML pages while removing boilerplate, navigation, ads, and other non-essential elements. It achieves the highest F-Score among open-source text extraction tools in academic benchmarks.

What output formats does Trafilatura support? Trafilatura supports multiple output formats including plain text, Markdown, JSON, XML, and CSV. This versatility makes it suitable for a wide range of downstream tasks from LLM dataset preparation to research corpus building and content analysis.

How accurate is Trafilatura compared to other extractors? Trafilatura consistently achieves the highest F-Score in academic benchmarks comparing open-source text extraction tools. It outperforms alternatives like Newspaper3k, readability, boilerpy3, and jusText in precision and recall across diverse web content types.

How do you install Trafilatura? Trafilatura can be installed via pip with a single command: pip install trafilatura. It requires Python 3.8 or higher and has minimal dependencies, making it easy to integrate into any Python environment.

Which organizations use Trafilatura? Trafilatura is widely adopted across industry and research. Notable users include HuggingFace (for dataset creation), IBM (for content analysis), Microsoft (for research pipelines), and numerous academic institutions for web corpus building and NLP research.


Further Reading

TAG
CATEGORIES