Check out this clever Python tool that converts Word documents, Excel sheets, PowerPoint decks and PDFs into a plain text format.
Markitdown is a Python library released by Microsoft via Github, converts many documents into nearly plain text (called MarkDown).
It’ll convert all these formats into text.
- Word, docx only
- Excel, both xlsx and xls files.
- PowerPoint, pptx only
- Outlook messages
- Images (EXIF metadata and OCR)
- Audio (EXIF metadata and speech transcription)
- HTML
- Text-based formats (CSV, JSON, XML)
- ZIP files (iterates over contents)
- Youtube URLs
- EPub
Some conversions need extra pip
flags like ([pdf]
, [audio-transcription]
. Check the docs on Github.
Markdown is a plain text format will minimal formatting that’s often used by developers and to feed data into AI systems. Any plain text editor/viewer can cope with Markdown files (Notepad, any browser etc). Wikipedia has some Markdown examples
Some MarkItDown conversions can be done locally but others are cloud based files through Azure Document Intelligence or external OCR/LLM calls.
This is mostly a tech tool for converting content into a consistent machine readable form (feeding Copilot or other AI system, building a search index, cleaning archives) but is worth keeping in mind for bulk conversion to text needs.
For one-off conversions, Microsoft Office has File | Export options, including Plain Text.