feat(extradocx): experimental DOCX → GFM Markdown AST converter by sripathikrishnan · Pull Request #62 · think41/extrasuite

sripathikrishnan · 2026-04-08T15:46:59Z

Adds a new experimental module that converts Microsoft Word .docx files
to a GFM-oriented AST as a proof of concept for bidirectional DOCX ↔
Markdown transformation.

Core design:

ast_nodes.py: Pandoc-inspired AST (Block/Inline split) where every node
carries an xpath attribute pointing back to the source element in
word/document.xml. Text is always TextRun leaf nodes (never bare strings),
preserving run-level formatting (bold, italic, underline, etc.).
parser.py: Reads word/document.xml + support files (styles.xml, numbering.xml,
rels) and produces the AST. Handles headings, paragraphs, bullet/ordered lists,
tables, links, images, and inline formatting.
serializers.py: Two serializers — to_json() (full-fidelity, XPath-preserving)
and to_markdown() (GFM output).

Fidelity verified against pandoc 3.1.3:

35/35 headings matched
26/26 bullet items, 28/28 ordered items identical
8/8 tables with matching content
Markdown output matches pandoc's GFM reference conversion

Test coverage: 29 tests covering parser, XPath traceability, markdown
serializer, and JSON serializer.

https://claude.ai/code/session_01JsJ2Q6WeDjvkbrsr1meeuR

Adds a new experimental module that converts Microsoft Word .docx files to a GFM-oriented AST as a proof of concept for bidirectional DOCX ↔ Markdown transformation. Core design: - ast_nodes.py: Pandoc-inspired AST (Block/Inline split) where every node carries an `xpath` attribute pointing back to the source element in word/document.xml. Text is always `TextRun` leaf nodes (never bare strings), preserving run-level formatting (bold, italic, underline, etc.). - parser.py: Reads word/document.xml + support files (styles.xml, numbering.xml, rels) and produces the AST. Handles headings, paragraphs, bullet/ordered lists, tables, links, images, and inline formatting. - serializers.py: Two serializers — to_json() (full-fidelity, XPath-preserving) and to_markdown() (GFM output). Fidelity verified against pandoc 3.1.3: - 35/35 headings matched - 26/26 bullet items, 28/28 ordered items identical - 8/8 tables with matching content - Markdown output matches pandoc's GFM reference conversion Test coverage: 29 tests covering parser, XPath traceability, markdown serializer, and JSON serializer. https://claude.ai/code/session_01JsJ2Q6WeDjvkbrsr1meeuR

https://claude.ai/code/session_01JsJ2Q6WeDjvkbrsr1meeuR

claude added 2 commits April 8, 2026 13:50

chore(extradocx): add additional test DOCX files from pandoc test suite

70b1e5c

https://claude.ai/code/session_01JsJ2Q6WeDjvkbrsr1meeuR

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(extradocx): experimental DOCX → GFM Markdown AST converter#62

feat(extradocx): experimental DOCX → GFM Markdown AST converter#62
sripathikrishnan wants to merge 2 commits into
mainfrom
claude/docx-to-markdown-ast-8Ddud

sripathikrishnan commented Apr 8, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

sripathikrishnan commented Apr 8, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants