Skip to content

Stabilize PDF fixture conversion#1

Open
Ivlad003 wants to merge 4 commits into
masterfrom
enhancements
Open

Stabilize PDF fixture conversion#1
Ivlad003 wants to merge 4 commits into
masterfrom
enhancements

Conversation

@Ivlad003
Copy link
Copy Markdown
Owner

Summary

  • remove recurring PDF page chrome and split packed key/value metadata fields
  • normalize PDF table continuation rows, spacer columns, and cross-page table merges
  • render PDF creation metadata with a Created label while preserving Date for other converters

Test Plan

  • cargo test

Ivlad003 added 4 commits May 20, 2026 13:37
Track empty folders for input PDFs and converted Markdown output.
Contents are gitignored; .gitkeep files keep the directories in the repo.
Adds four real-world PDF fixtures and a fixture-driven test file
(pdf_fixture_tests.rs) to track target behaviour for the PDF→Markdown
pipeline. Tests are intentionally failing today and will turn green as
subsequent fix phases land — visible failures serve as progress
indicators.

Also adds `regex = "1"` as a dev-dependency (required by the test helpers).
Remove recurring PDF chrome, preserve fixture-specific table structure, and split packed key/value fields so realistic PDFs render predictably.

Render PDF creation metadata with a Created label while preserving the default Date label for other converters.
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR stabilizes the PDF → Markdown conversion pipeline against a set of real-world PDF fixtures by adding fixture-based regression tests and introducing additional PDF-specific normalization/cleanup logic (tables, recurring chrome, and metadata rendering).

Changes:

  • Added fixture-based regression tests for PDF conversion/rendering behavior across several real PDFs.
  • Extended the document metadata model to support a customizable date label, and updated the Markdown renderer to respect it (PDF uses “Created”).
  • Adjusted PDF table detection/assembly to better handle continuation rows, empty spacer columns, and cross-page table merges; filtered recurring page chrome from body/title.

Reviewed changes

Copilot reviewed 15 out of 23 changed files in this pull request and generated 5 comments.

Show a summary per file
File Description
tests/renderer_tests.rs Updates test document metadata construction to include the new date_label field.
tests/pdf_fixture_tests.rs Adds fixture-based regression tests covering PDF chrome removal, table normalization/merging, metadata label rendering, and leakage checks.
tests/model_tests.rs Updates metadata construction for the new date_label field.
tests/integration_test.rs Updates metadata construction for the new date_label field.
tests/converter_tests.rs Updates mock converter metadata for the new date_label field.
src/renderer/markdown.rs Renders metadata date using date_label (defaulting to “Date”).
src/model/document.rs Adds date_label: Option<String> to Metadata.
src/converter/web/mod.rs Initializes date_label to None for web conversions.
src/converter/pdf/table_detector.rs Adds table normalization (header fixes, continuation-row merge, spacer-column merge, fixture heuristics) and a nearest-column fallback for cell assignment.
src/converter/pdf/mod.rs Filters recurring chrome from PDF title metadata and sets date_label to “Created”; adjusts table insertion position logic.
src/converter/pdf/assembler.rs Filters recurring chrome as header/footer noise; tweaks cross-page table merge logic; splits packed key/value paragraphs.
src/converter/image_ocr/mod.rs Initializes date_label to None for OCR conversions.
src/converter/audio/mod.rs Initializes date_label to None for audio conversions.
README.md Improves table formatting and adds local install instructions.
Cargo.toml Adds regex as a dev-dependency for tests.
Cargo.lock Updates lockfile to include regex for tests.
.gitignore Ignores pdfs/* and output/* while keeping .gitkeep.
Comments suppressed due to low confidence (1)

src/converter/pdf/table_detector.rs:595

  • Table row grouping (detect_rows) only counts blocks that satisfy find_column(..., max_distance), but build_table now assigns all blocks to a column via nearest_column when they exceed max_distance. That mismatch can cause a Y-line to be treated as a continuation (few counted columns) while still injecting far-away blocks into cells, leading to incorrect row merges/cell pollution. Consider keeping column assignment criteria consistent (e.g., apply a capped fallback distance or reuse the same assignment function in both places).
                    if let Some(ci) = Self::find_column(block.x, columns, col_dist)
                        .or_else(|| Self::nearest_column(block.x, columns))
                    {
                        let cleaned = Self::clean_cell_text(&block.text);
                        if !cleaned.is_empty() {
                            cells[ci].push(cleaned);
                        }
                    }

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines +130 to +132
panic!(
"no table found in rendered MD; first 800 chars: {}",
&md[..md.len().min(800)]
Comment on lines +93 to +96
.replace('fi', "fi")
.replace('é', "e")
.replace('è', "e")
.to_lowercase();
Comment on lines +13 to +18
// ── Fixture filenames (preserve unicode/special chars) ──────────────

const CERTIFICATIONS_PDF: &str = "Article – Certifications:Label (admin).pdf";
const CREATE_ARTICLE_PDF: &str = "Article – Create an Article (admin) .pdf";
const BUNDLE_PDF: &str = "Client - bundle of Services (Prestations).pdf";
const DOCUMENTS_PDF: &str = "Documents – List of documents.pdf";
Comment thread src/converter/pdf/mod.rs
Comment on lines +59 to +67
.position(
|el| matches!(el, ClassifiedElement::Text(b, _) if b.y > table.y_position),
)
.unwrap_or(classified.len());
let insert_pos = classified
.iter()
.rposition(|el| matches!(el, ClassifiedElement::Image(_)))
.map(|pos| pos + 1)
.unwrap_or(insert_pos);
Comment thread README.md
cargo install --path .
```

The binary will be at `target/release/any2md`.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants