Commit 2edb8bb
feat: add source fulltext extraction and chat citation references (teng-lin#15)
* feat: add source fulltext extraction and chat citation references
Add two major features:
1. Source fulltext extraction (`get_fulltext()` API and `source fulltext` CLI)
- Returns raw indexed text content from sources
- Includes metadata: title, source_type, url, char_count
- Validates responses and raises SourceNotFoundError if not found
- Warns on empty content
2. Chat citation references with cited text extraction
- ChatReference dataclass with source_id, cited_text, start_char, end_char
- AskResult.references contains parsed citations from answers
- `ask --json` includes full reference data
- Defensive error handling with logging for API structure changes
- Recursion depth limit for deeply nested structures
Also includes:
- Comprehensive unit, integration, and e2e tests
- Documentation updates for CLI and Python API
- SKILL.md updates for LLM agent usage
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
* fix: address PR code review feedback
- Remove incorrect source_id deduplication in citation parsing (critical fix)
- Refactor duplicated JSON parsing logic into process_chunk() helper
- Add max_depth=10 to _extract_uuid_from_nested() for safety
- Use dataclasses.asdict() for JSON output in CLI commands
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
* refactor: simplify code and add source_status_to_str helper
- Remove redundant assertion in _chat.py (already type-narrowed)
- Remove redundant effective_conv_id reassignment in cli/chat.py
- Add source_status_to_str() helper as single source of truth
- Replace 22 lines of duplicated status logic in cli/source.py
- Export helper in types.py for public API use
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
* fix: add explicit dict[int, str] type annotation for mypy
The _SOURCE_STATUS_MAP uses SourceStatus enum values as keys (which are ints),
but mypy inferred the type as dict[SourceStatus, str]. Adding explicit type
annotation dict[int, str] fixes the arg-type error when calling .get() with
int | SourceStatus.
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
* docs: add mypy to pre-commit checks in CLAUDE.md
Include mypy type checking in required pre-commit workflow to catch
type errors before CI. Updated both the step-by-step commands and
the one-liner.
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
* docs: update CHANGELOG with new features
Add entries for source fulltext extraction, chat citation references,
source status helper, and mypy pre-commit requirement.
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
* docs: document citation behavior and fulltext lookup
- SKILL.md: Add explanation that cited_text is often a snippet/header
- SKILL.md: Add code example for finding full context via text search
- python-api.md: Document that start_char/end_char reference chunked index
- python-api.md: Add example code for retrieving full citation context
The char positions from NotebookLM's API reference its internal chunked
index, not the raw fulltext. To get full context, search for the
cited_text within the source fulltext rather than using positions.
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
* docs: improve citation context examples with edge case handling
Address code review feedback:
- Handle short/missing cited_text with min() guard
- Add proper spacing around operators (pos - 100)
- Add else branch for "not found" case in python-api.md
- Add caching tip for multiple citations from same source
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
* feat: add find_citation_context() method to SourceFulltext
Adds a helper method for locating citations in source fulltext using
substring search. Returns all matches with surrounding context.
- Uses 40-char prefix for search (industry-reasonable heuristic)
- Returns list of (context, position) tuples for all matches
- Documents limitations clearly (best-effort, may have false positives)
- Updates docs to use the new method instead of manual search
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
* fix: correct context window in find_citation_context
- Use search_text length (not cited_text) for context window end
- Skip past match to avoid overlapping results
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
* test: add unit tests for find_citation_context
Covers:
- Single match with context
- Multiple non-overlapping matches
- No match found
- Empty cited_text/content
- Long citations (>40 chars) truncation
- Edge cases: match at start/end of content
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
---------
Co-authored-by: Claude Opus 4.5 <noreply@anthropic.com>1 parent eefac0f commit 2edb8bb
File tree
17 files changed
+1919
-37
lines changed- docs
- src/notebooklm
- cli
- data
- rpc
- tests
- e2e
- integration
- unit
17 files changed
+1919
-37
lines changed| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
7 | 7 | | |
8 | 8 | | |
9 | 9 | | |
| 10 | + | |
| 11 | + | |
| 12 | + | |
| 13 | + | |
| 14 | + | |
| 15 | + | |
| 16 | + | |
| 17 | + | |
| 18 | + | |
| 19 | + | |
| 20 | + | |
| 21 | + | |
| 22 | + | |
| 23 | + | |
10 | 24 | | |
11 | 25 | | |
12 | 26 | | |
| |||
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
45 | 45 | | |
46 | 46 | | |
47 | 47 | | |
| 48 | + | |
| 49 | + | |
| 50 | + | |
48 | 51 | | |
49 | 52 | | |
50 | 53 | | |
51 | 54 | | |
52 | 55 | | |
53 | 56 | | |
54 | | - | |
| 57 | + | |
55 | 58 | | |
56 | 59 | | |
57 | 60 | | |
| |||
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
61 | 61 | | |
62 | 62 | | |
63 | 63 | | |
| 64 | + | |
64 | 65 | | |
65 | 66 | | |
66 | 67 | | |
| |||
73 | 74 | | |
74 | 75 | | |
75 | 76 | | |
| 77 | + | |
| 78 | + | |
76 | 79 | | |
77 | 80 | | |
78 | 81 | | |
| |||
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
209 | 209 | | |
210 | 210 | | |
211 | 211 | | |
| 212 | + | |
| 213 | + | |
212 | 214 | | |
213 | 215 | | |
214 | 216 | | |
| |||
233 | 235 | | |
234 | 236 | | |
235 | 237 | | |
| 238 | + | |
| 239 | + | |
| 240 | + | |
| 241 | + | |
| 242 | + | |
| 243 | + | |
| 244 | + | |
| 245 | + | |
| 246 | + | |
236 | 247 | | |
237 | 248 | | |
238 | 249 | | |
| |||
444 | 455 | | |
445 | 456 | | |
446 | 457 | | |
| 458 | + | |
| 459 | + | |
| 460 | + | |
| 461 | + | |
447 | 462 | | |
448 | 463 | | |
449 | 464 | | |
| |||
620 | 635 | | |
621 | 636 | | |
622 | 637 | | |
623 | | - | |
624 | | - | |
625 | | - | |
| 638 | + | |
| 639 | + | |
| 640 | + | |
| 641 | + | |
| 642 | + | |
| 643 | + | |
| 644 | + | |
| 645 | + | |
| 646 | + | |
| 647 | + | |
| 648 | + | |
| 649 | + | |
| 650 | + | |
| 651 | + | |
| 652 | + | |
| 653 | + | |
| 654 | + | |
| 655 | + | |
| 656 | + | |
| 657 | + | |
| 658 | + | |
| 659 | + | |
| 660 | + | |
| 661 | + | |
| 662 | + | |
| 663 | + | |
| 664 | + | |
| 665 | + | |
| 666 | + | |
| 667 | + | |
| 668 | + | |
| 669 | + | |
| 670 | + | |
| 671 | + | |
| 672 | + | |
| 673 | + | |
| 674 | + | |
| 675 | + | |
| 676 | + | |
| 677 | + | |
| 678 | + | |
| 679 | + | |
| 680 | + | |
| 681 | + | |
| 682 | + | |
| 683 | + | |
| 684 | + | |
| 685 | + | |
| 686 | + | |
| 687 | + | |
| 688 | + | |
| 689 | + | |
| 690 | + | |
626 | 691 | | |
627 | 692 | | |
628 | 693 | | |
| |||
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
37 | 37 | | |
38 | 38 | | |
39 | 39 | | |
| 40 | + | |
40 | 41 | | |
41 | 42 | | |
42 | 43 | | |
| |||
56 | 57 | | |
57 | 58 | | |
58 | 59 | | |
| 60 | + | |
59 | 61 | | |
60 | 62 | | |
61 | 63 | | |
| |||
79 | 81 | | |
80 | 82 | | |
81 | 83 | | |
| 84 | + | |
82 | 85 | | |
83 | 86 | | |
84 | 87 | | |
85 | 88 | | |
86 | 89 | | |
| 90 | + | |
87 | 91 | | |
88 | 92 | | |
89 | 93 | | |
| |||
0 commit comments