Skip to content

Commit d4fbf71

Browse files
extradoc: add diff/reconcile system algorithmic review doc
Technical analysis of the full pull→edit→diff→push pipeline covering: - Diff algorithm (content alignment DP, table diff, similarity thresholds) - Round-trip safety: what survives vs. what can be corrupted - Known gaps: named-range index staleness (Bug #65), hard similarity thresholds, three-state index invariant enforcement - Testing coverage gaps and recommendations Gap 1 (sub-paragraph diff) is marked closed — _diff_paragraph_runs already implements character-level SequenceMatcher diffing. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
1 parent 280bee5 commit d4fbf71

1 file changed

Lines changed: 236 additions & 0 deletions

File tree

Lines changed: 236 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,236 @@
1+
# Algorithmic Review: extradoc Diff/Reconcile System
2+
3+
**Date**: 2026-04-11
4+
**Scope**: `diffmerge/`, `reconcile_v3/`, `serde/` — the full pull→edit→diff→push pipeline
5+
**Lens**: "Lens correctness" — can we edit a view and apply it back without corrupting invisible features?
6+
7+
---
8+
9+
## 1. System Architecture Overview
10+
11+
The pipeline has three conceptually distinct layers:
12+
13+
```
14+
Pull (Google Docs API)
15+
16+
Base Document (typed Pydantic models, lossless)
17+
├── Serialize → Markdown (lossy projection)
18+
19+
└─→ User edits Markdown on disk
20+
├── Deserialize → Desired Document (Pydantic models)
21+
│ (3-way merge: ancestor = pristine parse, desired = edited parse, base = live)
22+
23+
└─→ Reconcile: diff(base, desired) → batchUpdate requests
24+
```
25+
26+
**Key design principles:**
27+
- Element-level granularity: diffs operate on paragraphs/tables/rows, not characters
28+
- Stable-ID anchoring: tabs, headers, footnotes matched by API-assigned IDs
29+
- 3-way merge: the serde round-trip is never applied directly to `base`; instead `apply_ops(base, diff(ancestor, desired))` carries invisible features from `base`
30+
- Index arithmetic: the reconciler works in "base" coordinates and computes shifts for prior deletes/inserts within the same batch
31+
32+
---
33+
34+
## 2. Diff Algorithm Analysis
35+
36+
### 2.1 Content Alignment (`content_align.py`)
37+
38+
**Algorithm**: Minimum-cost edit-distance DP (not Myers diff, not standard LCS)
39+
40+
**Cost model**:
41+
- Delete a paragraph: `len(text) × 2.0 + inline_count × 50.0`
42+
- Insert a paragraph: same
43+
- Match two paragraphs: `(1.0 - similarity) × max_len` where similarity is word-level Jaccard
44+
- Terminal paragraphs: `INFINITE_PENALTY` (pre-matched, never deleted)
45+
- Tables: Fuzzy cell-text Jaccard; `MIN_TABLE_MATCH_SIMILARITY = 0.25`
46+
47+
**Pre-match constraints** (applied before DP):
48+
1. **Terminal pre-match**: Segment-final elements are always matched; prevents deletion of trailing `\n`
49+
2. **Table-flank pinning** (`_pin_table_flanks`): Paragraphs immediately adjacent to tables are forced to match if a 1:1 opportunity exists — prevents structural orphaning
50+
3. **Positional fallback**: In 1:1 gaps (one base element, one desired element, same kind), they are promoted to a match regardless of similarity
51+
52+
### 2.2 Table Diff (`table_diff.py`)
53+
54+
**Row alignment**: Fuzzy LCS using **cell-text recall** (overlap / base_size), threshold = 0.5
55+
**Column alignment**: LCS on per-column text hashes
56+
57+
**Delete/insert ordering**: Row deletions emitted in **descending index order** to prevent index shifting during the API request sequence.
58+
59+
### 2.3 Where Delete+Insert Pairs Arise
60+
61+
| Trigger | Condition | Result |
62+
|---------|-----------|--------|
63+
| Paragraph similarity below threshold | word-Jaccard < 0.3 AND affix-ratio < 0.4 | Delete+Insert paragraph |
64+
| Table row recall below threshold | recall < 0.5 AND positional similarity < 0.5 | Delete+Insert row |
65+
| Unmatched table (similarity < 0.25) | Table text changes drastically | Delete+Insert whole table |
66+
| Narrowly-failed 1:1 pre-match | Both elements exist but similarity check fails | DP may delete+insert anyway |
67+
68+
The threshold values are heuristics. With word-Jaccard = 0.3, a paragraph that changes 70% of its word tokens will be treated as a deletion, which may corrupt invisible features (comments, formatting on unchanged characters).
69+
70+
---
71+
72+
## 3. Round-Trip Safety Analysis
73+
74+
### 3.1 What Survives Untouched (Transparent Features)
75+
76+
These survive `pull → edit → push` because the 3-way merge carries them from `base`:
77+
78+
| Feature | Mechanism |
79+
|---------|-----------|
80+
| Named paragraph styles | Carried on matched paragraphs via `apply_ops` |
81+
| Text formatting (bold, italic, color) | Carried on matched text runs |
82+
| Comments / suggestions | Not touched; Google API handles separately |
83+
| Custom paragraph properties | Carried via `_carry_through_unmatched_raw` |
84+
| Footnote content (unedited) | Matched by footnoteId; carried through |
85+
| Header/footer structure | Matched by ID; carried through |
86+
| Table cells (unedited) | Unchanged cells carry through in table diff |
87+
88+
### 3.2 What Can Be Corrupted (High-Risk Scenarios)
89+
90+
#### Scenario A: Inline Passthrough Element in Edited Paragraph
91+
92+
```
93+
Base: "Hello <x-colbreak/> world"
94+
Edited: "Hi there <x-colbreak/> world"
95+
96+
Diff matches the paragraphs (similarity ≈ 0.6 → in-place update).
97+
The reconciler emits deleteContentRange + insertText for the text portion.
98+
But <x-colbreak/> is a named-range annotation with a base-coordinate index.
99+
After the text edit, the colbreak's base index is stale.
100+
```
101+
102+
**Status**: Bug #65, currently `xfail(strict=True)`. Confirmed broken.
103+
104+
**Affects**: Column breaks, page breaks, equations, rich links — anything represented as a named-range marker in markdown, embedded in an edited paragraph.
105+
106+
**Root cause**: The reconciler treats named-range markers as opaque passthrough with fixed base indices. It has no mechanism to adjust those indices when the surrounding paragraph is edited in-place.
107+
108+
#### Scenario B: Delete+Insert Destroys Invisible Formatting
109+
110+
```
111+
Base: "The quick brown fox" (entire paragraph is italic via API-only style)
112+
Edited: "A slow red fox" (word-Jaccard ≈ 0.14 → Delete+Insert)
113+
114+
Result: New paragraph inserted, base paragraph deleted.
115+
Italic style from base is GONE.
116+
```
117+
118+
This is expected (by the system's design) but the threshold is a binary cliff: at 0.3 Jaccard similarity, a paragraph suddenly flips from "in-place update that preserves invisible styles" to "delete+insert that destroys them."
119+
120+
**Affects**: Any document with invisible paragraph-level styles (background shading, border, custom spacing) on paragraphs that are substantially rewritten.
121+
122+
#### Scenario C: Table Row Below Match Threshold
123+
124+
```
125+
Base table row: ["Product ID", "Description (en)", "Price USD"]
126+
Edited: ["Product ID", "Beschreibung", "Price USD"]
127+
128+
Recall = ({"Product ID", "Price USD"} ∩ {"Product ID", "Beschreibung", "Price USD"}) / 3 = 2/3 ≈ 0.67 → MATCH ✓
129+
130+
But if a 4th cell with completely different text is added and the algorithm uses
131+
positional similarity:
132+
pos_sim = (1.0 + 0.0 + 1.0 + 0.0) / 4 = 0.5 → borderline
133+
```
134+
135+
Row deletions destroy any Google Docs formatting on the deleted row (merged cells, background color, custom borders).
136+
137+
---
138+
139+
## 4. Core Algorithmic Gaps
140+
141+
### Gap 1: Sub-Paragraph Diff — IMPLEMENTED
142+
143+
~~When two paragraphs are matched, the reconciler emits a "replace entire paragraph text" operation.~~
144+
145+
`_diff_paragraph_runs` in `reconcile_v3/lower.py` already performs character-level diffing via `difflib.SequenceMatcher`. For matched paragraphs it emits fine-grained `deleteContentRange` + `insertText` pairs covering only changed runs, and emits `updateTextStyle` only for equal spans where the style changed. This gap is closed.
146+
147+
### Gap 2: Hard Similarity Thresholds
148+
149+
The match/no-match decision at Jaccard = 0.3 (paragraph) and recall = 0.5 (table row) is a cliff. A small change in text content can flip the outcome from an in-place update to a delete+insert — destroying invisible features.
150+
151+
**Proposal A**: Increase the similarity threshold for paragraphs that contain inline passthrough elements (colbreak, equation markers). These paragraphs are "higher value" and should be matched even at lower similarity.
152+
153+
**Proposal B**: Introduce a "forced match" mode: if the paragraph at the same structural position (same list nesting, same flanking table) has similarity > 0.1, always match it and emit a coarser in-place update rather than delete+insert.
154+
155+
### Gap 3: Named-Range Index Tracking
156+
157+
Named ranges (colbreaks, equations, codeblocks, callouts) have API-assigned `startIndex` / `endIndex` values from the base document. When a paragraph is edited in-place, these indices become stale.
158+
159+
The reconciler must:
160+
1. Detect which named ranges overlap with paragraphs being updated in-place
161+
2. Compute the text delta for those paragraphs
162+
3. Adjust named-range indices by the delta, or re-create the named ranges at the correct position
163+
164+
This is a significant architectural gap; it explains Bug #65.
165+
166+
### Gap 4: Three-State Index Invariant Enforcement
167+
168+
The desired document produced by `apply_ops` has three states for index fields:
169+
- **Concrete**: carried from base (safe to lower)
170+
- **None**: synthesized/mutated (must NOT be lowered as a raw index)
171+
- **Mixed**: propagates as None (invalid edit plan)
172+
173+
There is no automated enforcement that all paths in `apply_ops` that emit new content set indices to `None`. A missed assignment here produces a silent corruption: the reconciler uses a stale base index as if it were valid.
174+
175+
**Proposal**: Add an `__debug__`-guarded assertion in `lower.py` that all concrete indices consumed by the reconciler correspond to elements that are either (a) unchanged from base, or (b) have been adjusted by a verified shift computation.
176+
177+
---
178+
179+
## 5. Recommendations
180+
181+
### Immediate (Bug Fixes)
182+
183+
1. **Bug #65 (colbreak/passthrough element index)**: In `reconcile_v3/lower.py`, when emitting in-place paragraph updates, scan the desired document's named ranges that overlap the paragraph range. For each named range with a base-coordinate index, compute the text delta and emit a `updateNamedRange` or recreate it at the adjusted position.
184+
185+
2. **Bug #64 (table row match threshold)**: In `table_diff.py::_fuzzy_lcs_indices`, make `match_threshold` adaptive: `max(0.3, 0.5 - 0.03 * num_cells)`. For wide tables, lower the threshold slightly so that rows with many stable cells aren't split on account of one or two changed cells.
186+
187+
### Short-term (Robustness)
188+
189+
3. ~~**Sub-paragraph diff for matched paragraphs**~~: Already implemented via `_diff_paragraph_runs` in `reconcile_v3/lower.py`.
190+
191+
4. **Inline passthrough boost**: Before the DP in `content_align.py`, detect paragraphs with inline passthrough elements (`_has_passthrough_inline(para)`). Boost their match score by multiplying the edit cost by 2.0 — this makes the DP prefer matching these paragraphs even at lower similarity.
192+
193+
5. **Monotonicity assertion**: Add a post-`apply_ops` invariant check that base-coordinate indices in the desired document are monotonically increasing within each segment. This catches bugs in the 3-way merge before they reach the reconciler.
194+
195+
### Medium-term (Architecture)
196+
197+
6. **Named-range-aware reconciliation**: Redesign the named-range handling in `lower.py` to track all named ranges that are "anchored" to paragraph content. When emitting paragraph edits, compute the resulting index shifts and schedule named-range updates accordingly.
198+
199+
7. **Parameterized similarity thresholds**: Expose `MIN_PARA_MATCH_SIMILARITY` and `MIN_TABLE_ROW_MATCH_SIMILARITY` as configuration options so that callers (e.g., the CLI) can tune them based on the document type (text-heavy vs. data-heavy).
200+
201+
---
202+
203+
## 6. Specific Code Locations
204+
205+
| File | Location | Issue |
206+
|------|----------|-------|
207+
| `table_diff.py` | `_fuzzy_lcs_indices` L193: `match_threshold = 0.5` | Hard-coded; bug #64 |
208+
| `content_align.py` | `MIN_PARA_MATCH_SIMILARITY = 0.3` | Binary cliff; should be boosted for passthrough-containing paragraphs |
209+
| `reconcile_v3/lower.py` | `_lower_story_content_update` | Does not adjust named-range indices on in-place paragraph edits (bug #65) |
210+
| `diffmerge/apply_ops.py` | `_carry_through_unmatched_raw` | Verify that None-index invariant is maintained on all synthetic elements |
211+
| `content_align.py` | `_pin_table_flanks` conflict resolution loop | No convergence bound; pathological graphs could loop |
212+
213+
---
214+
215+
## 7. Testing Coverage Gaps
216+
217+
| Area | Status | Notes |
218+
|------|--------|-------|
219+
| Delete+insert prevention | Partial | Bugs #64/#65 in xfail; no fuzzing |
220+
| Passthrough element round-trips | Poor | No tests for colbreak+paragraph edit, equation+text change |
221+
| Sub-paragraph formatting preservation | Absent | No test that italic on unchanged chars is preserved after edit |
222+
| Named-style propagation | Weak | Tests update but not preservation on matched edits |
223+
| Index monotonicity post-apply_ops | Absent | No explicit checker |
224+
| Wide table row matching | Absent | No test for tables with 6+ columns and mixed edits |
225+
226+
---
227+
228+
## 8. Conclusion
229+
230+
The architecture is sound: the 3-way merge is the right mechanism for preserving invisible features. The main correctness risks are:
231+
232+
1. **Named-range index staleness** (Bug #65): A design gap, not a coding error. Requires architectural work in `lower.py`.
233+
2. **Hard similarity thresholds**: Produce unexpected delete+insert on moderate rewrites, silently destroying invisible formatting. Adaptive thresholds and passthrough-element boosts are the pragmatic fix.
234+
3. **No sub-paragraph diff**: Coarse in-place updates are safe but less accurate than character-level diffs; comments and per-character styles are at higher risk.
235+
236+
The "lens guarantee" — edit the markdown representation, push back, unrepresented features survive — holds **only for paragraphs that are matched by the DP**. For paragraphs below the similarity threshold, the guarantee breaks unconditionally. For paragraphs above the threshold but containing inline passthrough elements, the guarantee breaks due to Bug #65.

0 commit comments

Comments
 (0)