Skip to content

Convert Office docs to text with Markitdown

Check out this clever Python tool that converts Word documents, Excel sheets, PowerPoint decks and PDFs into a plain text format.

Markitdown is a Python library released by Microsoft via Github, converts many documents into nearly plain text (called MarkDown).

It’ll convert all these formats into text.

  • Word, docx only
  • Excel, both xlsx and xls files.
  • PowerPoint, pptx only
  • Outlook messages
  • PDF
  • Images (EXIF metadata and OCR)
  • Audio (EXIF metadata and speech transcription)
  • HTML
  • Text-based formats (CSV, JSON, XML)
  • ZIP files (iterates over contents)
  • Youtube URLs
  • EPub

Some conversions need extra pip flags like ([pdf], [audio-transcription]. Check the docs on Github.

Markdown is a plain text format will minimal formatting that’s often used by developers and to feed data into AI systems.  Any plain text editor/viewer can cope with Markdown files (Notepad, any browser etc). Wikipedia has some Markdown examples

Some MarkItDown conversions can be done locally but others are cloud based files through Azure Document Intelligence or external OCR/LLM calls.

This is mostly a tech tool for converting content into a consistent machine readable form (feeding Copilot or other AI system, building a search index, cleaning archives) but is worth keeping in mind for bulk conversion to text needs.

For one-off conversions, Microsoft Office has File | Export options, including Plain Text.

About this author

Office-Watch.com

Office Watch is the independent source of Microsoft Office news, tips and help since 1996. Don't miss our famous free newsletter.