|
| 1 | +# Algorithmic Review: extradoc Diff/Reconcile System |
| 2 | + |
| 3 | +**Date**: 2026-04-11 |
| 4 | +**Scope**: `diffmerge/`, `reconcile_v3/`, `serde/` — the full pull→edit→diff→push pipeline |
| 5 | +**Lens**: "Lens correctness" — can we edit a view and apply it back without corrupting invisible features? |
| 6 | + |
| 7 | +--- |
| 8 | + |
| 9 | +## 1. System Architecture Overview |
| 10 | + |
| 11 | +The pipeline has three conceptually distinct layers: |
| 12 | + |
| 13 | +``` |
| 14 | +Pull (Google Docs API) |
| 15 | + ↓ |
| 16 | +Base Document (typed Pydantic models, lossless) |
| 17 | + ├── Serialize → Markdown (lossy projection) |
| 18 | + │ |
| 19 | + └─→ User edits Markdown on disk |
| 20 | + ├── Deserialize → Desired Document (Pydantic models) |
| 21 | + │ (3-way merge: ancestor = pristine parse, desired = edited parse, base = live) |
| 22 | + │ |
| 23 | + └─→ Reconcile: diff(base, desired) → batchUpdate requests |
| 24 | +``` |
| 25 | + |
| 26 | +**Key design principles:** |
| 27 | +- Element-level granularity: diffs operate on paragraphs/tables/rows, not characters |
| 28 | +- Stable-ID anchoring: tabs, headers, footnotes matched by API-assigned IDs |
| 29 | +- 3-way merge: the serde round-trip is never applied directly to `base`; instead `apply_ops(base, diff(ancestor, desired))` carries invisible features from `base` |
| 30 | +- Index arithmetic: the reconciler works in "base" coordinates and computes shifts for prior deletes/inserts within the same batch |
| 31 | + |
| 32 | +--- |
| 33 | + |
| 34 | +## 2. Diff Algorithm Analysis |
| 35 | + |
| 36 | +### 2.1 Content Alignment (`content_align.py`) |
| 37 | + |
| 38 | +**Algorithm**: Minimum-cost edit-distance DP (not Myers diff, not standard LCS) |
| 39 | + |
| 40 | +**Cost model**: |
| 41 | +- Delete a paragraph: `len(text) × 2.0 + inline_count × 50.0` |
| 42 | +- Insert a paragraph: same |
| 43 | +- Match two paragraphs: `(1.0 - similarity) × max_len` where similarity is word-level Jaccard |
| 44 | +- Terminal paragraphs: `INFINITE_PENALTY` (pre-matched, never deleted) |
| 45 | +- Tables: Fuzzy cell-text Jaccard; `MIN_TABLE_MATCH_SIMILARITY = 0.25` |
| 46 | + |
| 47 | +**Pre-match constraints** (applied before DP): |
| 48 | +1. **Terminal pre-match**: Segment-final elements are always matched; prevents deletion of trailing `\n` |
| 49 | +2. **Table-flank pinning** (`_pin_table_flanks`): Paragraphs immediately adjacent to tables are forced to match if a 1:1 opportunity exists — prevents structural orphaning |
| 50 | +3. **Positional fallback**: In 1:1 gaps (one base element, one desired element, same kind), they are promoted to a match regardless of similarity |
| 51 | + |
| 52 | +### 2.2 Table Diff (`table_diff.py`) |
| 53 | + |
| 54 | +**Row alignment**: Fuzzy LCS using **cell-text recall** (overlap / base_size), threshold = 0.5 |
| 55 | +**Column alignment**: LCS on per-column text hashes |
| 56 | + |
| 57 | +**Delete/insert ordering**: Row deletions emitted in **descending index order** to prevent index shifting during the API request sequence. |
| 58 | + |
| 59 | +### 2.3 Where Delete+Insert Pairs Arise |
| 60 | + |
| 61 | +| Trigger | Condition | Result | |
| 62 | +|---------|-----------|--------| |
| 63 | +| Paragraph similarity below threshold | word-Jaccard < 0.3 AND affix-ratio < 0.4 | Delete+Insert paragraph | |
| 64 | +| Table row recall below threshold | recall < 0.5 AND positional similarity < 0.5 | Delete+Insert row | |
| 65 | +| Unmatched table (similarity < 0.25) | Table text changes drastically | Delete+Insert whole table | |
| 66 | +| Narrowly-failed 1:1 pre-match | Both elements exist but similarity check fails | DP may delete+insert anyway | |
| 67 | + |
| 68 | +The threshold values are heuristics. With word-Jaccard = 0.3, a paragraph that changes 70% of its word tokens will be treated as a deletion, which may corrupt invisible features (comments, formatting on unchanged characters). |
| 69 | + |
| 70 | +--- |
| 71 | + |
| 72 | +## 3. Round-Trip Safety Analysis |
| 73 | + |
| 74 | +### 3.1 What Survives Untouched (Transparent Features) |
| 75 | + |
| 76 | +These survive `pull → edit → push` because the 3-way merge carries them from `base`: |
| 77 | + |
| 78 | +| Feature | Mechanism | |
| 79 | +|---------|-----------| |
| 80 | +| Named paragraph styles | Carried on matched paragraphs via `apply_ops` | |
| 81 | +| Text formatting (bold, italic, color) | Carried on matched text runs | |
| 82 | +| Comments / suggestions | Not touched; Google API handles separately | |
| 83 | +| Custom paragraph properties | Carried via `_carry_through_unmatched_raw` | |
| 84 | +| Footnote content (unedited) | Matched by footnoteId; carried through | |
| 85 | +| Header/footer structure | Matched by ID; carried through | |
| 86 | +| Table cells (unedited) | Unchanged cells carry through in table diff | |
| 87 | + |
| 88 | +### 3.2 What Can Be Corrupted (High-Risk Scenarios) |
| 89 | + |
| 90 | +#### Scenario A: Inline Passthrough Element in Edited Paragraph |
| 91 | + |
| 92 | +``` |
| 93 | +Base: "Hello <x-colbreak/> world" |
| 94 | +Edited: "Hi there <x-colbreak/> world" |
| 95 | +
|
| 96 | +Diff matches the paragraphs (similarity ≈ 0.6 → in-place update). |
| 97 | +The reconciler emits deleteContentRange + insertText for the text portion. |
| 98 | +But <x-colbreak/> is a named-range annotation with a base-coordinate index. |
| 99 | +After the text edit, the colbreak's base index is stale. |
| 100 | +``` |
| 101 | + |
| 102 | +**Status**: Bug #65, currently `xfail(strict=True)`. Confirmed broken. |
| 103 | + |
| 104 | +**Affects**: Column breaks, page breaks, equations, rich links — anything represented as a named-range marker in markdown, embedded in an edited paragraph. |
| 105 | + |
| 106 | +**Root cause**: The reconciler treats named-range markers as opaque passthrough with fixed base indices. It has no mechanism to adjust those indices when the surrounding paragraph is edited in-place. |
| 107 | + |
| 108 | +#### Scenario B: Delete+Insert Destroys Invisible Formatting |
| 109 | + |
| 110 | +``` |
| 111 | +Base: "The quick brown fox" (entire paragraph is italic via API-only style) |
| 112 | +Edited: "A slow red fox" (word-Jaccard ≈ 0.14 → Delete+Insert) |
| 113 | +
|
| 114 | +Result: New paragraph inserted, base paragraph deleted. |
| 115 | + Italic style from base is GONE. |
| 116 | +``` |
| 117 | + |
| 118 | +This is expected (by the system's design) but the threshold is a binary cliff: at 0.3 Jaccard similarity, a paragraph suddenly flips from "in-place update that preserves invisible styles" to "delete+insert that destroys them." |
| 119 | + |
| 120 | +**Affects**: Any document with invisible paragraph-level styles (background shading, border, custom spacing) on paragraphs that are substantially rewritten. |
| 121 | + |
| 122 | +#### Scenario C: Table Row Below Match Threshold |
| 123 | + |
| 124 | +``` |
| 125 | +Base table row: ["Product ID", "Description (en)", "Price USD"] |
| 126 | +Edited: ["Product ID", "Beschreibung", "Price USD"] |
| 127 | +
|
| 128 | +Recall = ({"Product ID", "Price USD"} ∩ {"Product ID", "Beschreibung", "Price USD"}) / 3 = 2/3 ≈ 0.67 → MATCH ✓ |
| 129 | +
|
| 130 | +But if a 4th cell with completely different text is added and the algorithm uses |
| 131 | +positional similarity: |
| 132 | + pos_sim = (1.0 + 0.0 + 1.0 + 0.0) / 4 = 0.5 → borderline |
| 133 | +``` |
| 134 | + |
| 135 | +Row deletions destroy any Google Docs formatting on the deleted row (merged cells, background color, custom borders). |
| 136 | + |
| 137 | +--- |
| 138 | + |
| 139 | +## 4. Core Algorithmic Gaps |
| 140 | + |
| 141 | +### Gap 1: Sub-Paragraph Diff — IMPLEMENTED |
| 142 | + |
| 143 | +~~When two paragraphs are matched, the reconciler emits a "replace entire paragraph text" operation.~~ |
| 144 | + |
| 145 | +`_diff_paragraph_runs` in `reconcile_v3/lower.py` already performs character-level diffing via `difflib.SequenceMatcher`. For matched paragraphs it emits fine-grained `deleteContentRange` + `insertText` pairs covering only changed runs, and emits `updateTextStyle` only for equal spans where the style changed. This gap is closed. |
| 146 | + |
| 147 | +### Gap 2: Hard Similarity Thresholds |
| 148 | + |
| 149 | +The match/no-match decision at Jaccard = 0.3 (paragraph) and recall = 0.5 (table row) is a cliff. A small change in text content can flip the outcome from an in-place update to a delete+insert — destroying invisible features. |
| 150 | + |
| 151 | +**Proposal A**: Increase the similarity threshold for paragraphs that contain inline passthrough elements (colbreak, equation markers). These paragraphs are "higher value" and should be matched even at lower similarity. |
| 152 | + |
| 153 | +**Proposal B**: Introduce a "forced match" mode: if the paragraph at the same structural position (same list nesting, same flanking table) has similarity > 0.1, always match it and emit a coarser in-place update rather than delete+insert. |
| 154 | + |
| 155 | +### Gap 3: Named-Range Index Tracking |
| 156 | + |
| 157 | +Named ranges (colbreaks, equations, codeblocks, callouts) have API-assigned `startIndex` / `endIndex` values from the base document. When a paragraph is edited in-place, these indices become stale. |
| 158 | + |
| 159 | +The reconciler must: |
| 160 | +1. Detect which named ranges overlap with paragraphs being updated in-place |
| 161 | +2. Compute the text delta for those paragraphs |
| 162 | +3. Adjust named-range indices by the delta, or re-create the named ranges at the correct position |
| 163 | + |
| 164 | +This is a significant architectural gap; it explains Bug #65. |
| 165 | + |
| 166 | +### Gap 4: Three-State Index Invariant Enforcement |
| 167 | + |
| 168 | +The desired document produced by `apply_ops` has three states for index fields: |
| 169 | +- **Concrete**: carried from base (safe to lower) |
| 170 | +- **None**: synthesized/mutated (must NOT be lowered as a raw index) |
| 171 | +- **Mixed**: propagates as None (invalid edit plan) |
| 172 | + |
| 173 | +There is no automated enforcement that all paths in `apply_ops` that emit new content set indices to `None`. A missed assignment here produces a silent corruption: the reconciler uses a stale base index as if it were valid. |
| 174 | + |
| 175 | +**Proposal**: Add an `__debug__`-guarded assertion in `lower.py` that all concrete indices consumed by the reconciler correspond to elements that are either (a) unchanged from base, or (b) have been adjusted by a verified shift computation. |
| 176 | + |
| 177 | +--- |
| 178 | + |
| 179 | +## 5. Recommendations |
| 180 | + |
| 181 | +### Immediate (Bug Fixes) |
| 182 | + |
| 183 | +1. **Bug #65 (colbreak/passthrough element index)**: In `reconcile_v3/lower.py`, when emitting in-place paragraph updates, scan the desired document's named ranges that overlap the paragraph range. For each named range with a base-coordinate index, compute the text delta and emit a `updateNamedRange` or recreate it at the adjusted position. |
| 184 | + |
| 185 | +2. **Bug #64 (table row match threshold)**: In `table_diff.py::_fuzzy_lcs_indices`, make `match_threshold` adaptive: `max(0.3, 0.5 - 0.03 * num_cells)`. For wide tables, lower the threshold slightly so that rows with many stable cells aren't split on account of one or two changed cells. |
| 186 | + |
| 187 | +### Short-term (Robustness) |
| 188 | + |
| 189 | +3. ~~**Sub-paragraph diff for matched paragraphs**~~: Already implemented via `_diff_paragraph_runs` in `reconcile_v3/lower.py`. |
| 190 | + |
| 191 | +4. **Inline passthrough boost**: Before the DP in `content_align.py`, detect paragraphs with inline passthrough elements (`_has_passthrough_inline(para)`). Boost their match score by multiplying the edit cost by 2.0 — this makes the DP prefer matching these paragraphs even at lower similarity. |
| 192 | + |
| 193 | +5. **Monotonicity assertion**: Add a post-`apply_ops` invariant check that base-coordinate indices in the desired document are monotonically increasing within each segment. This catches bugs in the 3-way merge before they reach the reconciler. |
| 194 | + |
| 195 | +### Medium-term (Architecture) |
| 196 | + |
| 197 | +6. **Named-range-aware reconciliation**: Redesign the named-range handling in `lower.py` to track all named ranges that are "anchored" to paragraph content. When emitting paragraph edits, compute the resulting index shifts and schedule named-range updates accordingly. |
| 198 | + |
| 199 | +7. **Parameterized similarity thresholds**: Expose `MIN_PARA_MATCH_SIMILARITY` and `MIN_TABLE_ROW_MATCH_SIMILARITY` as configuration options so that callers (e.g., the CLI) can tune them based on the document type (text-heavy vs. data-heavy). |
| 200 | + |
| 201 | +--- |
| 202 | + |
| 203 | +## 6. Specific Code Locations |
| 204 | + |
| 205 | +| File | Location | Issue | |
| 206 | +|------|----------|-------| |
| 207 | +| `table_diff.py` | `_fuzzy_lcs_indices` L193: `match_threshold = 0.5` | Hard-coded; bug #64 | |
| 208 | +| `content_align.py` | `MIN_PARA_MATCH_SIMILARITY = 0.3` | Binary cliff; should be boosted for passthrough-containing paragraphs | |
| 209 | +| `reconcile_v3/lower.py` | `_lower_story_content_update` | Does not adjust named-range indices on in-place paragraph edits (bug #65) | |
| 210 | +| `diffmerge/apply_ops.py` | `_carry_through_unmatched_raw` | Verify that None-index invariant is maintained on all synthetic elements | |
| 211 | +| `content_align.py` | `_pin_table_flanks` conflict resolution loop | No convergence bound; pathological graphs could loop | |
| 212 | + |
| 213 | +--- |
| 214 | + |
| 215 | +## 7. Testing Coverage Gaps |
| 216 | + |
| 217 | +| Area | Status | Notes | |
| 218 | +|------|--------|-------| |
| 219 | +| Delete+insert prevention | Partial | Bugs #64/#65 in xfail; no fuzzing | |
| 220 | +| Passthrough element round-trips | Poor | No tests for colbreak+paragraph edit, equation+text change | |
| 221 | +| Sub-paragraph formatting preservation | Absent | No test that italic on unchanged chars is preserved after edit | |
| 222 | +| Named-style propagation | Weak | Tests update but not preservation on matched edits | |
| 223 | +| Index monotonicity post-apply_ops | Absent | No explicit checker | |
| 224 | +| Wide table row matching | Absent | No test for tables with 6+ columns and mixed edits | |
| 225 | + |
| 226 | +--- |
| 227 | + |
| 228 | +## 8. Conclusion |
| 229 | + |
| 230 | +The architecture is sound: the 3-way merge is the right mechanism for preserving invisible features. The main correctness risks are: |
| 231 | + |
| 232 | +1. **Named-range index staleness** (Bug #65): A design gap, not a coding error. Requires architectural work in `lower.py`. |
| 233 | +2. **Hard similarity thresholds**: Produce unexpected delete+insert on moderate rewrites, silently destroying invisible formatting. Adaptive thresholds and passthrough-element boosts are the pragmatic fix. |
| 234 | +3. **No sub-paragraph diff**: Coarse in-place updates are safe but less accurate than character-level diffs; comments and per-character styles are at higher risk. |
| 235 | + |
| 236 | +The "lens guarantee" — edit the markdown representation, push back, unrepresented features survive — holds **only for paragraphs that are matched by the DP**. For paragraphs below the similarity threshold, the guarantee breaks unconditionally. For paragraphs above the threshold but containing inline passthrough elements, the guarantee breaks due to Bug #65. |
0 commit comments