Skip to content

Commit 2edb8bb

Browse files
teng-linclaude
andauthored
feat: add source fulltext extraction and chat citation references (teng-lin#15)
* feat: add source fulltext extraction and chat citation references Add two major features: 1. Source fulltext extraction (`get_fulltext()` API and `source fulltext` CLI) - Returns raw indexed text content from sources - Includes metadata: title, source_type, url, char_count - Validates responses and raises SourceNotFoundError if not found - Warns on empty content 2. Chat citation references with cited text extraction - ChatReference dataclass with source_id, cited_text, start_char, end_char - AskResult.references contains parsed citations from answers - `ask --json` includes full reference data - Defensive error handling with logging for API structure changes - Recursion depth limit for deeply nested structures Also includes: - Comprehensive unit, integration, and e2e tests - Documentation updates for CLI and Python API - SKILL.md updates for LLM agent usage Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com> * fix: address PR code review feedback - Remove incorrect source_id deduplication in citation parsing (critical fix) - Refactor duplicated JSON parsing logic into process_chunk() helper - Add max_depth=10 to _extract_uuid_from_nested() for safety - Use dataclasses.asdict() for JSON output in CLI commands Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com> * refactor: simplify code and add source_status_to_str helper - Remove redundant assertion in _chat.py (already type-narrowed) - Remove redundant effective_conv_id reassignment in cli/chat.py - Add source_status_to_str() helper as single source of truth - Replace 22 lines of duplicated status logic in cli/source.py - Export helper in types.py for public API use Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com> * fix: add explicit dict[int, str] type annotation for mypy The _SOURCE_STATUS_MAP uses SourceStatus enum values as keys (which are ints), but mypy inferred the type as dict[SourceStatus, str]. Adding explicit type annotation dict[int, str] fixes the arg-type error when calling .get() with int | SourceStatus. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com> * docs: add mypy to pre-commit checks in CLAUDE.md Include mypy type checking in required pre-commit workflow to catch type errors before CI. Updated both the step-by-step commands and the one-liner. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com> * docs: update CHANGELOG with new features Add entries for source fulltext extraction, chat citation references, source status helper, and mypy pre-commit requirement. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com> * docs: document citation behavior and fulltext lookup - SKILL.md: Add explanation that cited_text is often a snippet/header - SKILL.md: Add code example for finding full context via text search - python-api.md: Document that start_char/end_char reference chunked index - python-api.md: Add example code for retrieving full citation context The char positions from NotebookLM's API reference its internal chunked index, not the raw fulltext. To get full context, search for the cited_text within the source fulltext rather than using positions. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com> * docs: improve citation context examples with edge case handling Address code review feedback: - Handle short/missing cited_text with min() guard - Add proper spacing around operators (pos - 100) - Add else branch for "not found" case in python-api.md - Add caching tip for multiple citations from same source Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com> * feat: add find_citation_context() method to SourceFulltext Adds a helper method for locating citations in source fulltext using substring search. Returns all matches with surrounding context. - Uses 40-char prefix for search (industry-reasonable heuristic) - Returns list of (context, position) tuples for all matches - Documents limitations clearly (best-effort, may have false positives) - Updates docs to use the new method instead of manual search Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com> * fix: correct context window in find_citation_context - Use search_text length (not cited_text) for context window end - Skip past match to avoid overlapping results Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com> * test: add unit tests for find_citation_context Covers: - Single match with context - Multiple non-overlapping matches - No match found - Empty cited_text/content - Long citations (>40 chars) truncation - Edge cases: match at start/end of content Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com> --------- Co-authored-by: Claude Opus 4.5 <noreply@anthropic.com>
1 parent eefac0f commit 2edb8bb

File tree

17 files changed

+1919
-37
lines changed

17 files changed

+1919
-37
lines changed

CHANGELOG.md

Lines changed: 14 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -7,6 +7,20 @@ and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0
77

88
## [Unreleased]
99

10+
### Added
11+
- **Source fulltext extraction** - Retrieve the complete indexed text content of any source
12+
- New `client.sources.get_fulltext(notebook_id, source_id)` Python API
13+
- New `source fulltext <source_id>` CLI command with `--json` and `-o` output options
14+
- Returns `SourceFulltext` dataclass with content, title, URL, and character count
15+
- **Chat citation references** - Get detailed source references for chat answers
16+
- `AskResult.references` field contains list of `ChatReference` objects
17+
- Each reference includes `source_id`, `cited_text`, `start_char`, `end_char`, `chunk_id`
18+
- Use `notebooklm ask "question" --json` to see references in CLI output
19+
- **Source status helper** - New `source_status_to_str()` function for consistent status display
20+
21+
### Changed
22+
- **Pre-commit checks** - Added mypy type checking to required pre-commit workflow
23+
1024
## [0.1.4] - 2026-01-11
1125

1226
### Added

CLAUDE.md

Lines changed: 4 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -45,13 +45,16 @@ ruff format src/ tests/
4545
# Check for linting issues
4646
ruff check src/ tests/
4747

48+
# Type checking with mypy
49+
mypy src/notebooklm --ignore-missing-imports
50+
4851
# Run tests
4952
pytest
5053
```
5154

5255
Or use this one-liner:
5356
```bash
54-
ruff format src/ tests/ && ruff check src/ tests/ && pytest
57+
ruff format src/ tests/ && ruff check src/ tests/ && mypy src/notebooklm --ignore-missing-imports && pytest
5558
```
5659

5760
## Architecture

docs/cli-reference.md

Lines changed: 3 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -61,6 +61,7 @@ See [Configuration](configuration.md) for details on environment variables and C
6161
|---------|-------------|---------|
6262
| `ask <question>` | Ask a question | `notebooklm ask "What is this about?"` |
6363
| `ask -s <id>` | Ask using specific sources | `notebooklm ask "Summarize" -s src1 -s src2` |
64+
| `ask --json` | Get answer with source references | `notebooklm ask "Explain X" --json` |
6465
| `configure` | Set persona/mode | `notebooklm configure --mode learning-guide` |
6566
| `history` | View/clear history | `notebooklm history --clear` |
6667

@@ -73,6 +74,8 @@ See [Configuration](configuration.md) for details on environment variables and C
7374
| `add-drive <id> <title>` | Drive file ID | - | `source add-drive abc123 "Doc"` |
7475
| `add-research <query>` | Search query | `--mode [fast|deep]`, `--from [web|drive]`, `--import-all`, `--no-wait` | `source add-research "AI" --mode deep --no-wait` |
7576
| `get <id>` | Source ID | - | `source get src123` |
77+
| `fulltext <id>` | Source ID | `--json`, `-o FILE` | `source fulltext src123 -o content.txt` |
78+
| `guide <id>` | Source ID | `--json` | `source guide src123` |
7679
| `rename <id> <title>` | Source ID, new title | - | `source rename src123 "New Name"` |
7780
| `refresh <id>` | Source ID | - | `source refresh src123` |
7881
| `delete <id>` | Source ID | - | `source delete src123` |

docs/python-api.md

Lines changed: 68 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -209,6 +209,8 @@ await client.notebooks.share(nb.id, settings={"public": True})
209209
|--------|------------|---------|-------------|
210210
| `list(notebook_id)` | `notebook_id: str` | `list[Source]` | List sources |
211211
| `get(notebook_id, source_id)` | `str, str` | `Source` | Get source details |
212+
| `get_fulltext(notebook_id, source_id)` | `str, str` | `SourceFulltext` | Get full indexed text content |
213+
| `get_guide(notebook_id, source_id)` | `str, str` | `dict` | Get AI-generated summary and keywords |
212214
| `add_url(notebook_id, url)` | `str, str` | `Source` | Add URL source |
213215
| `add_youtube(notebook_id, url)` | `str, str` | `Source` | Add YouTube video |
214216
| `add_text(notebook_id, title, content)` | `str, str, str` | `Source` | Add text content |
@@ -233,6 +235,15 @@ for src in sources:
233235

234236
await client.sources.rename(nb_id, src.id, "Better Title")
235237
await client.sources.refresh(nb_id, src.id) # Re-fetch URL content
238+
239+
# Get full indexed content (what NotebookLM uses for answers)
240+
fulltext = await client.sources.get_fulltext(nb_id, src.id)
241+
print(f"Content ({fulltext.char_count} chars): {fulltext.content[:500]}...")
242+
243+
# Get AI-generated summary and keywords
244+
guide = await client.sources.get_guide(nb_id, src.id)
245+
print(f"Summary: {guide['summary']}")
246+
print(f"Keywords: {guide['keywords']}")
236247
```
237248

238249
---
@@ -444,6 +455,10 @@ async def ask(
444455
result = await client.chat.ask(nb_id, "What are the main themes?")
445456
print(result.answer)
446457

458+
# Access source references (cited in answer as [1], [2], etc.)
459+
for ref in result.references:
460+
print(f"Citation {ref.citation_number}: Source {ref.source_id}")
461+
447462
# Ask using only specific sources
448463
result = await client.chat.ask(
449464
nb_id,
@@ -620,9 +635,59 @@ class Artifact:
620635
```python
621636
@dataclass
622637
class AskResult:
623-
answer: str
624-
conversation_id: str
625-
sources_used: list[str]
638+
answer: str # The answer text with inline citations [1], [2], etc.
639+
conversation_id: str # ID for follow-up questions
640+
turn_number: int # Turn number in conversation
641+
is_follow_up: bool # Whether this was a follow-up question
642+
references: list[ChatReference] # Source references cited in the answer
643+
raw_response: str # First 1000 chars of raw API response
644+
645+
@dataclass
646+
class ChatReference:
647+
source_id: str # UUID of the source
648+
citation_number: int | None # Citation number in answer (1, 2, etc.)
649+
cited_text: str | None # Actual text passage being cited
650+
start_char: int | None # Start position in source content
651+
end_char: int | None # End position in source content
652+
chunk_id: str | None # Internal chunk ID (for debugging)
653+
```
654+
655+
**Important:** The `cited_text` field often contains only a snippet or section header, not the full quoted passage. The `start_char`/`end_char` positions reference NotebookLM's internal chunked index, which does not directly correspond to positions in the raw fulltext returned by `get_fulltext()`.
656+
657+
Use `SourceFulltext.find_citation_context()` to locate citations in the fulltext:
658+
659+
```python
660+
fulltext = await client.sources.get_fulltext(notebook_id, ref.source_id)
661+
matches = fulltext.find_citation_context(ref.cited_text) # Returns list[(context, position)]
662+
663+
if matches:
664+
context, pos = matches[0] # First match
665+
if len(matches) > 1:
666+
print(f"Warning: {len(matches)} matches found, using first")
667+
else:
668+
context = None # Not found - may occur if source was modified
669+
```
670+
671+
**Tip:** Cache `fulltext` when processing multiple citations from the same source to avoid repeated API calls.
672+
673+
### SourceFulltext
674+
675+
```python
676+
@dataclass
677+
class SourceFulltext:
678+
source_id: str # UUID of the source
679+
title: str # Source title
680+
content: str # Full indexed text content
681+
source_type: int | None # Source type code
682+
url: str | None # Original URL (if applicable)
683+
char_count: int # Character count
684+
685+
def find_citation_context(
686+
self,
687+
cited_text: str,
688+
context_chars: int = 200,
689+
) -> list[tuple[str, int]]:
690+
"""Search for citation text, return list of (context, position) tuples."""
626691
```
627692

628693
---

src/notebooklm/__init__.py

Lines changed: 4 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -37,6 +37,7 @@
3737
AudioLength,
3838
ChatGoal,
3939
ChatMode,
40+
ChatReference,
4041
ChatResponseLength,
4142
ConversationTurn,
4243
DriveMimeType,
@@ -56,6 +57,7 @@
5657
Source,
5758
# Exceptions
5859
SourceError,
60+
SourceFulltext,
5961
SourceNotFoundError,
6062
SourceProcessingError,
6163
SourceStatus,
@@ -79,11 +81,13 @@
7981
"NotebookDescription",
8082
"SuggestedTopic",
8183
"Source",
84+
"SourceFulltext",
8285
"Artifact",
8386
"GenerationStatus",
8487
"ReportSuggestion",
8588
"Note",
8689
"ConversationTurn",
90+
"ChatReference",
8791
"AskResult",
8892
"ChatMode",
8993
# Exceptions

0 commit comments

Comments
 (0)