RectoPDF
English
· RectoPDF team

PDF to Word: when does it actually preserve formatting?

A frank look at PDF-to-Word conversion. What works, what doesn't, and why a 'perfect' converter is mathematically impossible.

“PDF to Word with 100% accuracy” is the most-promised, least-delivered feature in the document-conversion industry. Every tool says it. None of them mean it. The reason is structural, not technical.

This article explains why, what to expect from honest converters (including ours), and the few cases where conversion is near-perfect.

The fundamental mismatch

A PDF describes how a page looks. A Word document describes how content is structured.

When you write in Word, you type a heading, hit Enter, type a paragraph. Word stores “Heading 1: ‘Quarterly Results’” as a semantic object. When that document is exported to PDF, all the semantics get flattened into “place the glyphs Q-u-a-r-t-e-r-l-y at x=72, y=720 in Calibri Bold at 18pt.” The structure is gone — only the visual rendering survives.

Converting back is reverse-engineering. A converter looks at “glyph Q-u-a-r-t-e-r-l-y at 18pt at the top of the page” and has to guess: was that a heading? An emphasized phrase? A title? It uses heuristics — font size, position, font weight, whitespace around it — and gets the answer right most of the time. But it’s a guess, not a recovery.

What our converter actually does

PDF to Word runs eight analysis phases on a PDF, entirely in your browser:

  1. Extract — walks every page’s content stream, building a list of glyph runs (text + font + position) and image XObjects. Handles ToUnicode CMaps, WinAnsi/MacRoman/Standard encodings, and Adobe Glyph List differences.
  2. Analyze — clusters glyph runs by baseline-y into lines, groups lines into words by gap detection, detects strikethrough by spotting horizontal lines through word mid-x-height.
  3. Reading order — finds vertical gutters in the page, splits lines that cross columns, reads column-by-column.
  4. Semantics — groups lines into paragraphs by vertical gap and indent, detects headings by font-size statistics, finds bullet/number list markers.
  5. Shapes — extracts vector graphics (lines, rectangles, beziers).
  6. Tables — clusters horizontal rules, derives columns from rule segments, builds cell membership.
  7. Images — JPEG passthrough, FlateDecode → PNG, JP2 passthrough, CCITTFax wrapped in TIFF.
  8. DOCX emit — writes a real .docx with paragraphs, headings, lists, tables, and <w:drawing> image anchors.

When all eight phases agree, you get a clean Word document. When they don’t, you get visible artifacts.

What works well

  • Single-column reports with one font family, clear headings, and prose paragraphs.
  • Tables with visible borders.
  • Bulleted and numbered lists with consistent indentation.
  • Inline images (JPEG and PNG embed verbatim).
  • Bold, italic, strikethrough — detected per run.

What partially works

  • Multi-column layouts — we detect 2 and 3 columns reliably. Mixed layouts (some pages 1-col, others 2-col) usually work.
  • Tables without borders — we fall back to text-x-coordinate clustering, which is correct for most cases but can misalign loose tables.
  • Mixed scripts — Latin, Greek, and Cyrillic work. Arabic, Chinese, Hebrew, and other RTL/CJK scripts aren’t supported yet.

What doesn’t work

  • Scanned PDFs. A scan is just an image of text. We don’t run OCR (yet), so a scanned report converts into a DOCX with one big image per page and zero editable text. Use an OCR tool first.
  • Equations. Mathematical typesetting in PDF (whether from LaTeX, MathType, or Word’s equation editor) flattens to glyph runs of weirdly-positioned symbols. There’s no structural recovery path.
  • Floating layouts. Marketing PDFs, posters, and “designed” documents that use absolute positioning are visually fine but semantically incoherent. Don’t expect a clean Word version.

A useful test

Want to know if your specific PDF will convert well? Look at it in a viewer and ask:

  1. Can you copy-paste a paragraph and have it land as one line of clean text? If yes (most text-based PDFs), conversion will work well. If you get gibberish or weird character substitutions, the source PDF doesn’t have proper Unicode info and no converter can fix that.
  2. Are tables drawn with visible borders? Better conversion if yes.
  3. Is the layout one or two columns? Best results at one or two; three is doable; four+ is unreliable.

Going the other direction

If you’ve been asked to “send me a Word doc, not a PDF,” the simpler answer is to start with Word and use Word to PDF when you’re done. PDF-to-Word is a useful tool, but it’s an emergency tool — you reach for it because someone sent you a PDF and you need to edit it, not as part of a normal workflow.

Privacy note

Everything runs in your browser. The engine loads once (cached afterwards) and processes your PDF entirely in the tab. Your file is never uploaded — for legal documents and confidential reports, that matters more than which converter has 92% accuracy vs 91%.

Try the PDF to Word converter and see what it does on yours.