Skip to content

feat(extradocx): experimental DOCX → GFM Markdown AST converter#62

Open
sripathikrishnan wants to merge 2 commits into
mainfrom
claude/docx-to-markdown-ast-8Ddud
Open

feat(extradocx): experimental DOCX → GFM Markdown AST converter#62
sripathikrishnan wants to merge 2 commits into
mainfrom
claude/docx-to-markdown-ast-8Ddud

Conversation

@sripathikrishnan
Copy link
Copy Markdown
Contributor

Adds a new experimental module that converts Microsoft Word .docx files
to a GFM-oriented AST as a proof of concept for bidirectional DOCX ↔
Markdown transformation.

Core design:

  • ast_nodes.py: Pandoc-inspired AST (Block/Inline split) where every node
    carries an xpath attribute pointing back to the source element in
    word/document.xml. Text is always TextRun leaf nodes (never bare strings),
    preserving run-level formatting (bold, italic, underline, etc.).
  • parser.py: Reads word/document.xml + support files (styles.xml, numbering.xml,
    rels) and produces the AST. Handles headings, paragraphs, bullet/ordered lists,
    tables, links, images, and inline formatting.
  • serializers.py: Two serializers — to_json() (full-fidelity, XPath-preserving)
    and to_markdown() (GFM output).

Fidelity verified against pandoc 3.1.3:

  • 35/35 headings matched
  • 26/26 bullet items, 28/28 ordered items identical
  • 8/8 tables with matching content
  • Markdown output matches pandoc's GFM reference conversion

Test coverage: 29 tests covering parser, XPath traceability, markdown
serializer, and JSON serializer.

https://claude.ai/code/session_01JsJ2Q6WeDjvkbrsr1meeuR

claude added 2 commits April 8, 2026 13:50
Adds a new experimental module that converts Microsoft Word .docx files
to a GFM-oriented AST as a proof of concept for bidirectional DOCX ↔
Markdown transformation.

Core design:
- ast_nodes.py: Pandoc-inspired AST (Block/Inline split) where every node
  carries an `xpath` attribute pointing back to the source element in
  word/document.xml. Text is always `TextRun` leaf nodes (never bare strings),
  preserving run-level formatting (bold, italic, underline, etc.).
- parser.py: Reads word/document.xml + support files (styles.xml, numbering.xml,
  rels) and produces the AST. Handles headings, paragraphs, bullet/ordered lists,
  tables, links, images, and inline formatting.
- serializers.py: Two serializers — to_json() (full-fidelity, XPath-preserving)
  and to_markdown() (GFM output).

Fidelity verified against pandoc 3.1.3:
  - 35/35 headings matched
  - 26/26 bullet items, 28/28 ordered items identical
  - 8/8 tables with matching content
  - Markdown output matches pandoc's GFM reference conversion

Test coverage: 29 tests covering parser, XPath traceability, markdown
serializer, and JSON serializer.

https://claude.ai/code/session_01JsJ2Q6WeDjvkbrsr1meeuR
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants