⇄ConnectorData & ETLFree
MarkItDown
Microsoft's universal document-to-Markdown converter: PDF, DOCX, PPTX, XLSX, HTML, images, audio, and ZIP — all to clean Markdown.
MarkItDown
MarkItDown is Microsoft's open-source utility that converts virtually any file format to clean Markdown text, making documents ingestible by LLMs and RAG pipelines. It handles PDFs, Word documents, PowerPoint presentations, Excel spreadsheets, HTML pages, images (with OCR/LLM description), audio files (via Whisper), and ZIP archives.
Key Features
- Universal input — PDF, DOCX, PPTX, XLSX, XLS, HTML, EPUB, MSG, CSV, JSON, XML, WAV, MP3, PNG, JPEG, ZIP
- LLM-enhanced — optionally use a vision model to describe images embedded in documents
- Audio transcription — integrates with Whisper for audio-to-text within document pipelines
- MCP server — official
markitdown-mcplets agents convert files via tool calls - CLI + Python API — use from the command line or as a library in pipelines
- Structure preservation — tables, headings, lists, and code blocks are faithfully converted
Quick Start
pip install markitdown[all]
from markitdown import MarkItDown
md = MarkItDown()
result = md.convert("report.pdf")
print(result.text_content[:500])
# CLI usage
markitdown presentation.pptx > output.md
Install via ai-supply
npx ai-supply add markitdown-document-converter
Curated mirror of the open-source MarkItDown project (MIT). Install upstream from the repository.