Open Source

Trafilatura: Open-Source Web Text Extraction for LLM Datasets and Research

Q: "What is Trafilatura?"

"Trafilatura is a Python-based tool for web text extraction and crawling that identifies and extracts the main textual content from HTML pages while removing boilerplate, navigation, ads, and other non-essential elements. It achieves the highest F-Score among open-source text extraction tools in academic benchmarks."

Q: "What output formats does Trafilatura support?"

"Trafilatura supports multiple output formats including plain text, Markdown, JSON, XML, and CSV. This versatility makes it suitable for a wide range of downstream tasks from LLM dataset preparation to research corpus building and content analysis."

Q: "How accurate is Trafilatura compared to other extractors?"

"Trafilatura consistently achieves the highest F-Score in academic benchmarks comparing open-source text extraction tools. It outperforms alternatives like Newspaper3k, readability, boilerpy3, and jusText in precision and recall across diverse web content types."

Q: "How do you install Trafilatura?"

"Trafilatura can be installed via pip with a single command: `pip install trafilatura`. It requires Python 3.8 or higher and has minimal dependencies, making it easy to integrate into any Python environment."

Q: "Which organizations use Trafilatura?"

"Trafilatura is widely adopted across industry and research. Notable users include HuggingFace (for dataset creation), IBM (for content analysis), Microsoft (for research pipelines), and numerous academic institutions for web corpus building and NLP research."

Trafilatura is a Python tool for web text extraction and crawling with the highest F-Score among open-source extractors, used by HuggingFace, IBM, and Microsoft.

Keeping this site alive takes effort — your support means everything.

無程式碼也能輕鬆打造專業LINE官方帳號！一鍵導入模板，讓AI助你行銷加分！

Editorial Team May 04, 2026 4 min read

Extracting clean, structured text from web pages is a foundational task for LLM training datasets, research corpora, and content analysis pipelines. Trafilatura has emerged as the gold standard for this task – a Python library that consistently achieves the highest F-Score among open-source text extraction tools while remaining lightweight, fast, and easy to integrate.

Developed by Adrien Barbaresi at the Berlin-Brandenburg Academy of Sciences and Humanities, Trafilatura goes beyond simple HTML-to-text conversion. It identifies the main content area of a webpage, strips away navigation, headers, footers, ads, and sidebars, and returns only the meaningful textual content. Its crawling capabilities allow it to recursively follow links within a domain, building comprehensive text corpora from entire websites.

The tool’s accuracy and reliability have earned it adoption by major organizations including HuggingFace, IBM, and Microsoft, as well as widespread use in academic NLP research. Its benchmark-topping performance in the Clean-Eval and other evaluation frameworks makes it the default choice for researchers who need trustworthy text extraction at scale.

How Does Trafilatura’s Accuracy Compare to Other Extractors?

Trafilatura’s performance advantage is documented in academic benchmarks that measure precision and recall across diverse web content types.

Tool	F-Score	Precision	Recall	Language Support
Trafilatura	0.94	0.95	0.93	45+ languages
Newspaper3k	0.82	0.84	0.80	20+ languages
readability	0.79	0.81	0.77	Mainly English
boilerpy3	0.76	0.78	0.74	10+ languages
jusText	0.71	0.74	0.68	15+ languages

The gap is particularly pronounced on complex page layouts with heavy navigation, embedded media, and dynamic content – the kinds of pages that dominate the modern web. Trafilatura’s heuristic-based approach, combined with its ability to handle multiple content formats within a single page, gives it a consistent edge.

What Output Formats Does Trafilatura Support?

Trafilatura’s versatility in output formats makes it suitable for a wide range of downstream applications.

graph LR
    A[HTML Input] --> B[Trafilatura]
    B --> C[Plain Text]
    B --> D[Markdown]
    B --> E[JSON]
    B --> F[XML]
    B --> G[CSV]
    C --> H[LLM Training Data]
    C --> I[Full-Text Search]
    D --> J[RAG Chunks]
    D --> K[Documentation]
    E --> L[Structured Analysis]
    F --> M[TEI Encoding]
    G --> N[Spreadsheet Import]

Format	Best For	Example Use Case
Plain Text	LLM training corpora	Fine-tuning datasets
Markdown	RAG pipeline documents	Structured knowledge base
JSON	Programmatic analysis	Content metadata extraction
XML/TEI	Academic archiving	Digital humanities research
CSV	Bulk processing	Batch URL extraction

Each format preserves the extracted text along with configurable metadata such as URL, title, author, publication date, and extraction timestamp.

How Do You Get Started with Trafilatura?

Installation and basic usage are remarkably simple, requiring only a single pip command and a few lines of Python.

Task	Command / Code	Notes
Install	`pip install trafilatura`	Python 3.8+ required
Extract from URL	`trafilatura --url "https://example.com"`	CLI one-liner
Python basic	`from trafilatura import fetch_url, extract`	Core import
Python usage	`content = extract(fetch_url(url))`	Returns Markdown
Batch processing	`trafilatura --list urls.txt --output-dir ./output`	Crawl support
Crawl domain	`trafilatura --sitemap "https://example.com/sitemap.xml"`	Recursive crawl

The library also provides fine-grained control via options for output format selection, language detection, content exclusion rules, and extraction strategy configuration.

FAQ

What is Trafilatura? Trafilatura is a Python-based tool for web text extraction and crawling that identifies and extracts the main textual content from HTML pages while removing boilerplate, navigation, ads, and other non-essential elements. It achieves the highest F-Score among open-source text extraction tools in academic benchmarks.

What output formats does Trafilatura support? Trafilatura supports multiple output formats including plain text, Markdown, JSON, XML, and CSV. This versatility makes it suitable for a wide range of downstream tasks from LLM dataset preparation to research corpus building and content analysis.

How accurate is Trafilatura compared to other extractors? Trafilatura consistently achieves the highest F-Score in academic benchmarks comparing open-source text extraction tools. It outperforms alternatives like Newspaper3k, readability, boilerpy3, and jusText in precision and recall across diverse web content types.

How do you install Trafilatura? Trafilatura can be installed via pip with a single command: pip install trafilatura. It requires Python 3.8 or higher and has minimal dependencies, making it easy to integrate into any Python environment.

Which organizations use Trafilatura? Trafilatura is widely adopted across industry and research. Notable users include HuggingFace (for dataset creation), IBM (for content analysis), Microsoft (for research pipelines), and numerous academic institutions for web corpus building and NLP research.

Trafilatura: Open-Source Web Text Extraction for LLM Datasets and Research

How Does Trafilatura’s Accuracy Compare to Other Extractors?

What Output Formats Does Trafilatura Support?

How Do You Get Started with Trafilatura?

FAQ

Further Reading

LATEST POST

Workday, Anthropic, and LISC Join Forces to Launch AI Solopreneurship Accelerato

Sensor Tower Acquires AppMagic, Filling SMB Data Analytics Gap

Musk, Cook, and Fink Expected to Join Trump's Delegation to Beijing This Week

TAG

CATEGORIES