Name: Trafilatura — Web Content Extraction & Crawling
Availability: InStock
Author: ai-supply

Trafilatura — Web Content Extraction & Crawling

Trafilatura is a battle-tested Python and CLI tool for web content extraction: crawl pages, strip boilerplate (navs, ads, footers), and extract clean main text with metadata. Used for building training corpora, content monitoring, SEO analysis, and news aggregation pipelines.

Key features

Removes boilerplate with high precision (outperforms BeautifulSoup heuristics)
Output as plain text, CSV, JSON, Markdown, XML, or CONLL
Metadata extraction: title, author, date, language, tags
Sitemap and feed-aware crawler built in
CLI for batch crawling, Python API for integration

Quick start

pip install trafilatura
# Extract text from a URL
trafilatura -u https://example.com/article --output-format markdown

import trafilatura

downloaded = trafilatura.fetch_url("https://example.com/article")
text = trafilatura.extract(downloaded, output_format="markdown",
                           include_metadata=True)
print(text)

npx ai-supply add trafilatura-web-content-extraction

Curated mirror of the open-source Trafilatura (Apache-2.0). Get it from the source.

Trafilatura — Web Content Extraction & Crawling

Trafilatura — Web Content Extraction & Crawling

Key features

Quick start

More from @ai-supply