improve: Improved citation system#834
Conversation
There was a problem hiding this comment.
Pull request overview
This pull request enhances the citation system in DeerFlow by migrating from dataclasses to Pydantic models, implementing multi-format citation parsing, and optimizing URL lookups with an O(1) cache.
Changes:
- Migrated
CitationMetadataandCitationfrom dataclasses to Pydantic BaseModel for improved validation and serialization - Added multi-format citation extraction supporting Markdown, numbered, footnote, and HTML link formats
- Implemented O(1) URL lookup optimization in
CitationCollectorusing a reverse index cache (_url_to_index) - Enhanced title extraction with a 5-tier priority system (HTML title, Markdown headings, JSON/YAML fields, first substantial line)
- Improved domain extraction with urllib and regex fallback strategy
- Comprehensive test coverage for all new features (467 lines for models, 425 for formatter, 254 for extractor, 292 for collector)
Reviewed changes
Copilot reviewed 8 out of 8 changed files in this pull request and generated 12 comments.
Show a summary per file
| File | Description |
|---|---|
| src/citations/models.py | Migrates CitationMetadata and Citation from dataclass to Pydantic BaseModel with Field definitions and model_validate |
| src/citations/formatter.py | Adds multi-format citation parsing with dedicated extraction functions for different citation styles and deduplication logic |
| src/citations/extractor.py | Implements intelligent title extraction with 5-tier priority system and enhanced domain extraction with fallback |
| src/citations/collector.py | Adds O(1) URL lookup cache (_url_to_index) for performance optimization, updated in all mutation methods |
| tests/unit/citations/test_models.py | Comprehensive tests for Pydantic model behavior, validation, serialization, and round-trip conversions |
| tests/unit/citations/test_formatter.py | Tests for all citation extraction formats, deduplication, and section parsing |
| tests/unit/citations/test_extractor.py | Tests for domain extraction edge cases and title extraction priority system |
| tests/unit/citations/test_collector.py | Tests for cache initialization, updates, consistency, and performance with large collections |
Comments suppressed due to low confidence (1)
src/citations/models.py:61
- Using
__init__override in Pydantic models is not recommended and can cause issues with validation and default values. The domain extraction logic should be moved to a@model_validator(mode='after')method instead. This is the standard Pydantic pattern for post-initialization logic, as seen in other parts of the codebase (e.g.,src/server/mcp_request.py:56).
The current implementation may bypass Pydantic's validation and could have unexpected behavior with field defaults and validation.
def __init__(self, **data):
"""Initialize and extract domain from URL if not provided."""
super().__init__(**data)
if not self.domain and self.url:
try:
parsed = urlparse(self.url)
self.domain = parsed.netloc
except Exception:
# If URL parsing fails for any reason, leave `domain` as None.
# This is a non-critical convenience field and failures here
# should not prevent citation metadata creation.
pass
| def test_metadata_pydantic_validation(self): | ||
| """Test that Pydantic validates required fields.""" | ||
| # URL and title are required | ||
| with pytest.raises(Exception): # ValidationError | ||
| CitationMetadata() # Missing required fields | ||
|
|
||
| with pytest.raises(Exception): | ||
| CitationMetadata(url="https://example.com") # Missing title |
There was a problem hiding this comment.
The test catches generic Exception instead of the specific ValidationError from Pydantic. This is overly broad and could hide unexpected errors. Import ValidationError from pydantic and use it instead:
from pydantic import ValidationError
with pytest.raises(ValidationError):
CitationMetadata() # Missing required fields| def test_citation_pydantic_validation(self): | ||
| """Test that Pydantic validates required fields.""" | ||
| # Number and metadata are required | ||
| with pytest.raises(Exception): # ValidationError | ||
| Citation() # Missing required fields | ||
|
|
||
| metadata = CitationMetadata( | ||
| url="https://example.com", | ||
| title="Example", | ||
| ) | ||
| with pytest.raises(Exception): | ||
| Citation(metadata=metadata) # Missing number |
There was a problem hiding this comment.
The test catches generic Exception instead of the specific ValidationError from Pydantic. This is overly broad and could hide unexpected errors. Import ValidationError from pydantic and use it instead.
| def test_extract_domain_empty_url(self): | ||
| """Test handling of empty URL.""" | ||
| assert _extract_domain("") == "" | ||
| assert _extract_domain(None) == "" or _extract_domain(None) is None |
There was a problem hiding this comment.
The test checks if _extract_domain(None) returns "" or None, but the function signature specifies url: str, which means it shouldn't accept None as a parameter. The test should either:
- Remove the
Nonecase if the function is never called withNone, or - Update the function signature to
url: Optional[str]ifNoneis a valid input.
Based on line 335 in the implementation (if not url:), it appears the function is designed to handle falsy values, so the type annotation should be Optional[str].
| assert _extract_domain(None) == "" or _extract_domain(None) is None |
| def _extract_domain(url: str) -> str: | ||
| """Extract domain from URL.""" | ||
| """ | ||
| Extract domain from URL using urllib with regex fallback. | ||
|
|
||
| Handles: | ||
| - Standard URLs: https://www.example.com/path | ||
| - Short URLs: example.com | ||
| - Invalid URLs: graceful fallback | ||
|
|
||
| Args: | ||
| url: The URL string to extract domain from | ||
|
|
||
| Returns: | ||
| The domain netloc (including port if present), or empty string if extraction fails | ||
| """ | ||
| if not url: | ||
| return "" |
There was a problem hiding this comment.
The function handles None input (line 335: if not url:), but the type signature declares url: str which doesn't allow None. This is a type annotation mismatch. The parameter should be typed as url: Optional[str] to accurately reflect that it accepts None values.
src/citations/extractor.py
Outdated
| def extract_title_from_content(content: str, max_length: int = 200) -> str: | ||
| """ | ||
| Intelligent title extraction supporting multiple formats. | ||
|
|
||
| Priority: | ||
| 1. HTML <title> tag | ||
| 2. Markdown h1 (# Title) | ||
| 3. Markdown h2-h6 (## Title, etc.) | ||
| 4. JSON/YAML title field | ||
| 5. First substantial non-empty line | ||
| 6. "Untitled" as fallback | ||
|
|
||
| Args: | ||
| content: The content to extract title from | ||
| max_length: Maximum title length (default: 200) | ||
|
|
||
| Returns: | ||
| Extracted title or "Untitled" | ||
| """ | ||
| if not content: | ||
| return "Untitled" |
There was a problem hiding this comment.
The function handles None input (line 228: if not content:), but the type signature declares content: str which doesn't allow None. The test at line 180 expects extract_title_from_content(None) to return "Untitled". The parameter should be typed as content: Optional[str] to match the actual implementation and test expectations.
| import pytest | ||
|
|
There was a problem hiding this comment.
Import of 'pytest' is not used.
| import pytest |
| import pytest | ||
|
|
||
| from src.citations.collector import CitationCollector | ||
| from src.citations.models import CitationMetadata |
There was a problem hiding this comment.
Import of 'CitationMetadata' is not used.
| from src.citations.models import CitationMetadata |
| import pytest | ||
|
|
There was a problem hiding this comment.
Import of 'pytest' is not used.
| import pytest |
| import pytest | ||
|
|
There was a problem hiding this comment.
Import of 'pytest' is not used.
| import pytest |
tests/unit/citations/test_models.py
Outdated
| """ | ||
|
|
||
| import json | ||
| from datetime import datetime |
There was a problem hiding this comment.
Import of 'datetime' is not used.
| from datetime import datetime |
* improve: Improved citation system * fix --------- Co-authored-by: Willem Jiang <willem.jiang@gmail.com>
Pull request overview
This pull request enhances the citation system in DeerFlow by migrating from dataclasses to Pydantic models, implementing multi-format citation parsing, and optimizing URL lookups with an O(1) cache.
Changes: