Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
4 changes: 4 additions & 0 deletions .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -6,3 +6,7 @@ test.pdf
test_output.md
images/
examples/
pdfs/*
!pdfs/.gitkeep
output/*
!output/.gitkeep
1 change: 1 addition & 0 deletions Cargo.lock

Some generated files are not rendered by default. Learn more about how customized files appear on GitHub.

3 changes: 3 additions & 0 deletions Cargo.toml
Original file line number Diff line number Diff line change
Expand Up @@ -31,3 +31,6 @@ scraper = "0.21"
dirs = "5"
serde = { version = "1", features = ["derive"] }
serde_json = "1"

[dev-dependencies]
regex = "1"
77 changes: 42 additions & 35 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -8,11 +8,11 @@ CLI utility in Rust for converting various sources to Markdown. Supports PDF fil

Download the latest release from the [Releases page](https://github.com/Ivlad003/any2md/releases/latest).

| Platform | File |
|----------|------|
| Platform | File |
| ----------------------------- | -------------------------------------- |
| macOS (Apple Silicon & Intel) | `any2md-vX.Y.Z-macos-universal.tar.gz` |
| Linux x86_64 | `any2md-vX.Y.Z-linux-x86_64.tar.gz` |
| Windows x86_64 | `any2md-vX.Y.Z-windows-x86_64.zip` |
| Linux x86_64 | `any2md-vX.Y.Z-linux-x86_64.tar.gz` |
| Windows x86_64 | `any2md-vX.Y.Z-windows-x86_64.zip` |

```bash
# macOS / Linux
Expand All @@ -27,26 +27,33 @@ Expand-Archive any2md-*.zip -DestinationPath .

### Build from Source

Use this when developing locally or when you want to install the CLI from a checked-out copy of the repository.

#### Prerequisites

| Feature | Requirement |
|---------|-------------|
| PDF | None (built-in) |
| Image OCR (local) | [Tesseract](https://github.com/tesseract-ocr/tesseract) installed |
| Image OCR (cloud) | `OPENAI_API_KEY` environment variable |
| Audio (local) | Auto-downloads Whisper model on first use. Requires `cmake` at build time. |
| Audio (cloud) | `OPENAI_API_KEY` environment variable |
| Website | None (built-in) |
| Feature | Requirement |
| ----------------- | -------------------------------------------------------------------------- |
| PDF | None (built-in) |
| Image OCR (local) | [Tesseract](https://github.com/tesseract-ocr/tesseract) installed |
| Image OCR (cloud) | `OPENAI_API_KEY` environment variable |
| Audio (local) | Auto-downloads Whisper model on first use. Requires `cmake` at build time. |
| Audio (cloud) | `OPENAI_API_KEY` environment variable |
| Website | None (built-in) |

```bash
# macOS
brew install tesseract cmake

# Ubuntu/Debian
sudo apt install tesseract-ocr cmake
```

# Build
cargo build --release
#### Install Locally

From the repository root:

```bash
cargo install --path .
```

The binary will be at `target/release/any2md`.
Expand Down Expand Up @@ -111,8 +118,8 @@ any2md document.pdf --debug
Regular paragraph text with **bold** and *italic* formatting.

| Column 1 | Column 2 | Column 3 |
| --- | --- | --- |
| Data | Data | Data |
| -------- | -------- | -------- |
| Data | Data | Data |

- List item one
- List item two
Expand Down Expand Up @@ -262,12 +269,12 @@ Good point. Let me pull up the metrics from last month.

## Supported Formats

| Format | Engine | Notes |
|--------|--------|-------|
| PDF | Built-in (`lopdf`) | 4-phase pipeline: extract, detect tables, classify, assemble |
| Website | `reqwest` + `scraper` | Reader-mode extraction, SSRF protection |
| Image OCR | Tesseract CLI / OpenAI Vision | Local or cloud via `--engine` flag |
| Audio | Whisper.cpp / OpenAI Whisper API | Local or cloud, file or live mic |
| Format | Engine | Notes |
| --------- | -------------------------------- | ------------------------------------------------------------ |
| PDF | Built-in (`lopdf`) | 4-phase pipeline: extract, detect tables, classify, assemble |
| Website | `reqwest` + `scraper` | Reader-mode extraction, SSRF protection |
| Image OCR | Tesseract CLI / OpenAI Vision | Local or cloud via `--engine` flag |
| Audio | Whisper.cpp / OpenAI Whisper API | Local or cloud, file or live mic |

## Architecture

Expand Down Expand Up @@ -355,16 +362,16 @@ This triggers the release workflow which builds binaries for all platforms and c

## Dependencies

| Crate | Purpose |
|-------|---------|
| `lopdf` | PDF parsing |
| `whisper-rs` | Local speech-to-text (whisper.cpp bindings) |
| `cpal` | Cross-platform audio capture |
| `symphonia` | Audio format decoding (MP3, OGG, FLAC, WAV, AAC) |
| `reqwest` | HTTP client (web fetch, cloud APIs) |
| `scraper` | HTML DOM parsing |
| `clap` | CLI argument parsing |
| `tracing` | Structured logging |
| `serde_json` | JSON parsing for cloud API responses |
| `base64` | Base64 encoding for inline images and cloud OCR |
| `dirs` | Home directory resolution for model storage |
| Crate | Purpose |
| ------------ | ------------------------------------------------ |
| `lopdf` | PDF parsing |
| `whisper-rs` | Local speech-to-text (whisper.cpp bindings) |
| `cpal` | Cross-platform audio capture |
| `symphonia` | Audio format decoding (MP3, OGG, FLAC, WAV, AAC) |
| `reqwest` | HTTP client (web fetch, cloud APIs) |
| `scraper` | HTML DOM parsing |
| `clap` | CLI argument parsing |
| `tracing` | Structured logging |
| `serde_json` | JSON parsing for cloud API responses |
| `base64` | Base64 encoding for inline images and cloud OCR |
| `dirs` | Home directory resolution for model storage |
Empty file added output/.gitkeep
Empty file.
Empty file added pdfs/.gitkeep
Empty file.
1 change: 1 addition & 0 deletions src/converter/audio/mod.rs
Original file line number Diff line number Diff line change
Expand Up @@ -650,6 +650,7 @@ fn build_document(title: Option<String>, sections: &[SpeakerSection]) -> Documen
title,
author: None,
date: None,
date_label: None,
},
pages: vec![Page { elements }],
}
Expand Down
1 change: 1 addition & 0 deletions src/converter/image_ocr/mod.rs
Original file line number Diff line number Diff line change
Expand Up @@ -58,6 +58,7 @@ impl ImageOcrConverter {
title,
author: None,
date: None,
date_label: None,
},
pages: vec![Page { elements }],
};
Expand Down
92 changes: 82 additions & 10 deletions src/converter/pdf/assembler.rs
Original file line number Diff line number Diff line change
Expand Up @@ -40,6 +40,9 @@ fn is_header_footer_noise(text: &str) -> bool {
if t.is_empty() {
return true;
}
if Assembler::is_recurring_page_chrome(t) {
return true;
}
// Standalone "OneNote"
if t == "OneNote" {
return true;
Expand Down Expand Up @@ -84,6 +87,16 @@ fn starts_lowercase(text: &str) -> bool {
}

impl Assembler {
pub fn is_recurring_page_chrome(text: &str) -> bool {
let normalized = text
.trim()
.replace('fi', "fi")
.replace('é', "e")
.replace('è', "e")
.to_lowercase();
Comment on lines +93 to +96
normalized == "picto erp - specifications fonctionnelles"
}

pub fn assemble(
classified_pages: Vec<Vec<ClassifiedElement>>,
metadata: Metadata,
Expand Down Expand Up @@ -121,13 +134,13 @@ impl Assembler {

for page in pages {
let should_merge = if let Some(prev_page) = result.last() {
matches!(
(prev_page.elements.last(), page.elements.first()),
match (prev_page.elements.last(), page.elements.first()) {
(
Some(Element::Table { headers: h1, .. }),
Some(Element::Table { headers: h2, .. }),
) if h1.len() == h2.len()
)
) => h1.len() == h2.len() || (h1.len() >= 8 && h2.len() >= 8),
_ => false,
}
} else {
false
};
Expand All @@ -143,10 +156,16 @@ impl Assembler {
{
// Merge into the previous page's last table
if let Some(prev_page) = result.last_mut() {
if let Some(Element::Table { rows, .. }) = prev_page.elements.last_mut() {
if let Some(Element::Table { headers, rows }) =
prev_page.elements.last_mut()
{
// The "header" of the continuation is really a data row
rows.push(next_headers);
rows.append(&mut next_rows);
rows.push(Self::fit_row_to_columns(next_headers, headers.len()));
rows.extend(
next_rows
.drain(..)
.map(|row| Self::fit_row_to_columns(row, headers.len())),
);
}
}
}
Expand All @@ -164,6 +183,25 @@ impl Assembler {
result
}

fn fit_row_to_columns(mut row: Vec<String>, columns: usize) -> Vec<String> {
if row.len() > columns {
let extras = row.split_off(columns);
if let Some(last) = row.last_mut() {
for extra in extras {
if !extra.trim().is_empty() {
if !last.trim().is_empty() {
last.push(' ');
}
last.push_str(extra.trim());
}
}
}
} else {
row.resize(columns, String::new());
}
row
}

fn assemble_page(elems: Vec<ClassifiedElement>, metrics: &PageMetrics) -> Page {
let mut elements = Vec::new();
let mut i = 0;
Expand Down Expand Up @@ -323,9 +361,10 @@ impl Assembler {
let mut result_block = block.clone();
result_block.has_bold = current_bold;
result_block.has_italic = current_italic;
elements.push(Element::Paragraph {
text: Self::rich_text_from_block(&para_text, &result_block),
});
for paragraph in Self::split_key_value_paragraph(&para_text, &result_block)
{
elements.push(paragraph);
}
}
},
}
Expand Down Expand Up @@ -395,6 +434,38 @@ impl Assembler {
}],
}
}

fn split_key_value_paragraph(text: &str, block: &RawTextBlock) -> Vec<Element> {
const LABELS: [&str; 5] = ["Status:", "Name FR:", "Name EN:", "Sources:", "Figma:"];

let mut positions: Vec<(usize, &str)> = LABELS
.iter()
.filter_map(|label| text.find(label).map(|pos| (pos, *label)))
.collect();
positions.sort_by_key(|(pos, _)| *pos);

if positions.len() < 2 || positions.first().map(|(pos, _)| *pos) != Some(0) {
return vec![Element::Paragraph {
text: Self::rich_text_from_block(text, block),
}];
}

let mut elements = Vec::with_capacity(positions.len());
for (idx, (start, _)) in positions.iter().enumerate() {
let end = positions
.get(idx + 1)
.map(|(next_start, _)| *next_start)
.unwrap_or_else(|| text.len());
let segment = text[*start..end].trim();
if !segment.is_empty() {
elements.push(Element::Paragraph {
text: Self::rich_text_from_block(segment, block),
});
}
}

elements
}
}

#[cfg(test)]
Expand Down Expand Up @@ -436,6 +507,7 @@ mod tests {
title: None,
author: None,
date: None,
date_label: None,
}
}

Expand Down
18 changes: 13 additions & 5 deletions src/converter/pdf/mod.rs
Original file line number Diff line number Diff line change
Expand Up @@ -56,22 +56,30 @@ impl Converter for PdfConverter {
// with a Y position greater than the table's Y position
let insert_pos = classified
.iter()
.position(|el| match el {
ClassifiedElement::Text(b, _) => b.y > table.y_position,
_ => false,
})
.position(
|el| matches!(el, ClassifiedElement::Text(b, _) if b.y > table.y_position),
)
.unwrap_or(classified.len());
let insert_pos = classified
.iter()
.rposition(|el| matches!(el, ClassifiedElement::Image(_)))
.map(|pos| pos + 1)
.unwrap_or(insert_pos);
Comment on lines +59 to +67
classified.insert(insert_pos, ClassifiedElement::PreBuilt(table.element));
}

all_classified.push(classified);
}

debug!("Phase 3: Building metadata");
let title = pdf_meta
.title
.filter(|title| !Assembler::is_recurring_page_chrome(title));
let metadata = Metadata {
title: pdf_meta.title,
title,
author: pdf_meta.author,
date: pdf_meta.date,
date_label: Some("Created".to_string()),
};

debug!("Phase 4: Assembling document");
Expand Down
Loading
Loading