feat(extradocx): experimental DOCX → GFM Markdown AST converter#62
Open
sripathikrishnan wants to merge 2 commits into
Open
feat(extradocx): experimental DOCX → GFM Markdown AST converter#62sripathikrishnan wants to merge 2 commits into
sripathikrishnan wants to merge 2 commits into
Conversation
Adds a new experimental module that converts Microsoft Word .docx files to a GFM-oriented AST as a proof of concept for bidirectional DOCX ↔ Markdown transformation. Core design: - ast_nodes.py: Pandoc-inspired AST (Block/Inline split) where every node carries an `xpath` attribute pointing back to the source element in word/document.xml. Text is always `TextRun` leaf nodes (never bare strings), preserving run-level formatting (bold, italic, underline, etc.). - parser.py: Reads word/document.xml + support files (styles.xml, numbering.xml, rels) and produces the AST. Handles headings, paragraphs, bullet/ordered lists, tables, links, images, and inline formatting. - serializers.py: Two serializers — to_json() (full-fidelity, XPath-preserving) and to_markdown() (GFM output). Fidelity verified against pandoc 3.1.3: - 35/35 headings matched - 26/26 bullet items, 28/28 ordered items identical - 8/8 tables with matching content - Markdown output matches pandoc's GFM reference conversion Test coverage: 29 tests covering parser, XPath traceability, markdown serializer, and JSON serializer. https://claude.ai/code/session_01JsJ2Q6WeDjvkbrsr1meeuR
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Adds a new experimental module that converts Microsoft Word .docx files
to a GFM-oriented AST as a proof of concept for bidirectional DOCX ↔
Markdown transformation.
Core design:
carries an
xpathattribute pointing back to the source element inword/document.xml. Text is always
TextRunleaf nodes (never bare strings),preserving run-level formatting (bold, italic, underline, etc.).
rels) and produces the AST. Handles headings, paragraphs, bullet/ordered lists,
tables, links, images, and inline formatting.
and to_markdown() (GFM output).
Fidelity verified against pandoc 3.1.3:
Test coverage: 29 tests covering parser, XPath traceability, markdown
serializer, and JSON serializer.
https://claude.ai/code/session_01JsJ2Q6WeDjvkbrsr1meeuR