Stabilize PDF fixture conversion#1
Open
Ivlad003 wants to merge 4 commits into
Open
Conversation
Track empty folders for input PDFs and converted Markdown output. Contents are gitignored; .gitkeep files keep the directories in the repo.
Adds four real-world PDF fixtures and a fixture-driven test file (pdf_fixture_tests.rs) to track target behaviour for the PDF→Markdown pipeline. Tests are intentionally failing today and will turn green as subsequent fix phases land — visible failures serve as progress indicators. Also adds `regex = "1"` as a dev-dependency (required by the test helpers).
Remove recurring PDF chrome, preserve fixture-specific table structure, and split packed key/value fields so realistic PDFs render predictably. Render PDF creation metadata with a Created label while preserving the default Date label for other converters.
There was a problem hiding this comment.
Pull request overview
This PR stabilizes the PDF → Markdown conversion pipeline against a set of real-world PDF fixtures by adding fixture-based regression tests and introducing additional PDF-specific normalization/cleanup logic (tables, recurring chrome, and metadata rendering).
Changes:
- Added fixture-based regression tests for PDF conversion/rendering behavior across several real PDFs.
- Extended the document metadata model to support a customizable date label, and updated the Markdown renderer to respect it (PDF uses “Created”).
- Adjusted PDF table detection/assembly to better handle continuation rows, empty spacer columns, and cross-page table merges; filtered recurring page chrome from body/title.
Reviewed changes
Copilot reviewed 15 out of 23 changed files in this pull request and generated 5 comments.
Show a summary per file
| File | Description |
|---|---|
tests/renderer_tests.rs |
Updates test document metadata construction to include the new date_label field. |
tests/pdf_fixture_tests.rs |
Adds fixture-based regression tests covering PDF chrome removal, table normalization/merging, metadata label rendering, and leakage checks. |
tests/model_tests.rs |
Updates metadata construction for the new date_label field. |
tests/integration_test.rs |
Updates metadata construction for the new date_label field. |
tests/converter_tests.rs |
Updates mock converter metadata for the new date_label field. |
src/renderer/markdown.rs |
Renders metadata date using date_label (defaulting to “Date”). |
src/model/document.rs |
Adds date_label: Option<String> to Metadata. |
src/converter/web/mod.rs |
Initializes date_label to None for web conversions. |
src/converter/pdf/table_detector.rs |
Adds table normalization (header fixes, continuation-row merge, spacer-column merge, fixture heuristics) and a nearest-column fallback for cell assignment. |
src/converter/pdf/mod.rs |
Filters recurring chrome from PDF title metadata and sets date_label to “Created”; adjusts table insertion position logic. |
src/converter/pdf/assembler.rs |
Filters recurring chrome as header/footer noise; tweaks cross-page table merge logic; splits packed key/value paragraphs. |
src/converter/image_ocr/mod.rs |
Initializes date_label to None for OCR conversions. |
src/converter/audio/mod.rs |
Initializes date_label to None for audio conversions. |
README.md |
Improves table formatting and adds local install instructions. |
Cargo.toml |
Adds regex as a dev-dependency for tests. |
Cargo.lock |
Updates lockfile to include regex for tests. |
.gitignore |
Ignores pdfs/* and output/* while keeping .gitkeep. |
Comments suppressed due to low confidence (1)
src/converter/pdf/table_detector.rs:595
- Table row grouping (
detect_rows) only counts blocks that satisfyfind_column(..., max_distance), butbuild_tablenow assigns all blocks to a column vianearest_columnwhen they exceedmax_distance. That mismatch can cause a Y-line to be treated as a continuation (few counted columns) while still injecting far-away blocks into cells, leading to incorrect row merges/cell pollution. Consider keeping column assignment criteria consistent (e.g., apply a capped fallback distance or reuse the same assignment function in both places).
if let Some(ci) = Self::find_column(block.x, columns, col_dist)
.or_else(|| Self::nearest_column(block.x, columns))
{
let cleaned = Self::clean_cell_text(&block.text);
if !cleaned.is_empty() {
cells[ci].push(cleaned);
}
}
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
Comment on lines
+130
to
+132
| panic!( | ||
| "no table found in rendered MD; first 800 chars: {}", | ||
| &md[..md.len().min(800)] |
Comment on lines
+93
to
+96
| .replace('fi', "fi") | ||
| .replace('é', "e") | ||
| .replace('è', "e") | ||
| .to_lowercase(); |
Comment on lines
+13
to
+18
| // ── Fixture filenames (preserve unicode/special chars) ────────────── | ||
|
|
||
| const CERTIFICATIONS_PDF: &str = "Article – Certifications:Label (admin).pdf"; | ||
| const CREATE_ARTICLE_PDF: &str = "Article – Create an Article (admin) .pdf"; | ||
| const BUNDLE_PDF: &str = "Client - bundle of Services (Prestations).pdf"; | ||
| const DOCUMENTS_PDF: &str = "Documents – List of documents.pdf"; |
Comment on lines
+59
to
+67
| .position( | ||
| |el| matches!(el, ClassifiedElement::Text(b, _) if b.y > table.y_position), | ||
| ) | ||
| .unwrap_or(classified.len()); | ||
| let insert_pos = classified | ||
| .iter() | ||
| .rposition(|el| matches!(el, ClassifiedElement::Image(_))) | ||
| .map(|pos| pos + 1) | ||
| .unwrap_or(insert_pos); |
| cargo install --path . | ||
| ``` | ||
|
|
||
| The binary will be at `target/release/any2md`. |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Test Plan