⇄ConnectorData & ETLFree
Unstructured
Open-source document ingestion library — partition PDFs, HTML, DOCX, and 25+ formats into clean elements for RAG pipelines.
Unstructured
Unstructured is the leading open-source library for ingesting and preprocessing unstructured documents for use in LLM applications. It partitions documents into typed elements (Title, NarrativeText, Table, Image, etc.), extracts metadata, and cleans text — making raw files ready for chunking and embedding.
Key Features
- 25+ file formats — PDF, HTML, DOCX, PPTX, XLSX, EML, MSG, MD, RST, TXT, images, and more
- Partition strategies —
fast(rule-based),hi_res(layout detection with detectron2),ocr_only - Element types — Title, NarrativeText, Table, ListItem, Image, Header, Footer, FigureCaption
- Table extraction — HTML table extraction from PDFs with
hi_resstrategy - LangChain / LlamaIndex connectors —
UnstructuredLoaderis a first-class integration in both - Connectors — S3, GCS, Azure Blob, Confluence, Google Drive, SharePoint, Slack, and more
Quick Start
pip install unstructured[pdf]
from unstructured.partition.pdf import partition_pdf
elements = partition_pdf("research_paper.pdf", strategy="hi_res")
for el in elements[:5]:
print(el.category, ":", str(el)[:80])
Install via ai-supply
npx ai-supply add unstructured-document-ingestion
Curated mirror of the open-source Unstructured project (Apache-2.0). Install upstream from the repository.