⬡PipelineMarketingFree
Trafilatura — Web Content Extraction & Crawling
Python tool to scrape, extract, and clean main text from web pages, with CSV/JSON/Markdown/XML output.
Instalaciones95k
Valoración★ 4.8
Reseñas32
Trafilatura — Web Content Extraction & Crawling
Trafilatura is a battle-tested Python and CLI tool for web content extraction: crawl pages, strip boilerplate (navs, ads, footers), and extract clean main text with metadata. Used for building training corpora, content monitoring, SEO analysis, and news aggregation pipelines.
Key features
- Removes boilerplate with high precision (outperforms BeautifulSoup heuristics)
- Output as plain text, CSV, JSON, Markdown, XML, or CONLL
- Metadata extraction: title, author, date, language, tags
- Sitemap and feed-aware crawler built in
- CLI for batch crawling, Python API for integration
Quick start
pip install trafilatura
# Extract text from a URL
trafilatura -u https://example.com/article --output-format markdown
import trafilatura
downloaded = trafilatura.fetch_url("https://example.com/article")
text = trafilatura.extract(downloaded, output_format="markdown",
include_metadata=True)
print(text)
npx ai-supply add trafilatura-web-content-extraction
Curated mirror of the open-source Trafilatura (Apache-2.0). Get it from the source.