Skip to content

improve: Improved citation system#834

Merged
WillemJiang merged 4 commits intobytedance:mainfrom
xunliu:citation
Jan 25, 2026
Merged

improve: Improved citation system#834
WillemJiang merged 4 commits intobytedance:mainfrom
xunliu:citation

Conversation

@xunliu
Copy link
Contributor

@xunliu xunliu commented Jan 24, 2026

Pull request overview

This pull request enhances the citation system in DeerFlow by migrating from dataclasses to Pydantic models, implementing multi-format citation parsing, and optimizing URL lookups with an O(1) cache.

Changes:

  • Migrated CitationMetadata and Citation from dataclasses to Pydantic BaseModel for improved validation and serialization
  • Added multi-format citation extraction supporting Markdown, numbered, footnote, and HTML link formats
  • Implemented O(1) URL lookup optimization in CitationCollector using a reverse index cache (_url_to_index)
  • Enhanced title extraction with a 5-tier priority system (HTML title, Markdown headings, JSON/YAML fields, first substantial line)
  • Improved domain extraction with urllib and regex fallback strategy
  • Comprehensive test coverage for all new features (467 lines for models, 425 for formatter, 254 for extractor, 292 for collector)

Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This pull request enhances the citation system in DeerFlow by migrating from dataclasses to Pydantic models, implementing multi-format citation parsing, and optimizing URL lookups with an O(1) cache.

Changes:

  • Migrated CitationMetadata and Citation from dataclasses to Pydantic BaseModel for improved validation and serialization
  • Added multi-format citation extraction supporting Markdown, numbered, footnote, and HTML link formats
  • Implemented O(1) URL lookup optimization in CitationCollector using a reverse index cache (_url_to_index)
  • Enhanced title extraction with a 5-tier priority system (HTML title, Markdown headings, JSON/YAML fields, first substantial line)
  • Improved domain extraction with urllib and regex fallback strategy
  • Comprehensive test coverage for all new features (467 lines for models, 425 for formatter, 254 for extractor, 292 for collector)

Reviewed changes

Copilot reviewed 8 out of 8 changed files in this pull request and generated 12 comments.

Show a summary per file
File Description
src/citations/models.py Migrates CitationMetadata and Citation from dataclass to Pydantic BaseModel with Field definitions and model_validate
src/citations/formatter.py Adds multi-format citation parsing with dedicated extraction functions for different citation styles and deduplication logic
src/citations/extractor.py Implements intelligent title extraction with 5-tier priority system and enhanced domain extraction with fallback
src/citations/collector.py Adds O(1) URL lookup cache (_url_to_index) for performance optimization, updated in all mutation methods
tests/unit/citations/test_models.py Comprehensive tests for Pydantic model behavior, validation, serialization, and round-trip conversions
tests/unit/citations/test_formatter.py Tests for all citation extraction formats, deduplication, and section parsing
tests/unit/citations/test_extractor.py Tests for domain extraction edge cases and title extraction priority system
tests/unit/citations/test_collector.py Tests for cache initialization, updates, consistency, and performance with large collections
Comments suppressed due to low confidence (1)

src/citations/models.py:61

  • Using __init__ override in Pydantic models is not recommended and can cause issues with validation and default values. The domain extraction logic should be moved to a @model_validator(mode='after') method instead. This is the standard Pydantic pattern for post-initialization logic, as seen in other parts of the codebase (e.g., src/server/mcp_request.py:56).

The current implementation may bypass Pydantic's validation and could have unexpected behavior with field defaults and validation.

    def __init__(self, **data):
        """Initialize and extract domain from URL if not provided."""
        super().__init__(**data)
        if not self.domain and self.url:
            try:
                parsed = urlparse(self.url)
                self.domain = parsed.netloc
            except Exception:
                # If URL parsing fails for any reason, leave `domain` as None.
                # This is a non-critical convenience field and failures here
                # should not prevent citation metadata creation.
                pass

Comment on lines +156 to +163
def test_metadata_pydantic_validation(self):
"""Test that Pydantic validates required fields."""
# URL and title are required
with pytest.raises(Exception): # ValidationError
CitationMetadata() # Missing required fields

with pytest.raises(Exception):
CitationMetadata(url="https://example.com") # Missing title
Copy link

Copilot AI Jan 24, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The test catches generic Exception instead of the specific ValidationError from Pydantic. This is overly broad and could hide unexpected errors. Import ValidationError from pydantic and use it instead:

from pydantic import ValidationError

with pytest.raises(ValidationError):
    CitationMetadata()  # Missing required fields

Copilot uses AI. Check for mistakes.
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed

Comment on lines +379 to +390
def test_citation_pydantic_validation(self):
"""Test that Pydantic validates required fields."""
# Number and metadata are required
with pytest.raises(Exception): # ValidationError
Citation() # Missing required fields

metadata = CitationMetadata(
url="https://example.com",
title="Example",
)
with pytest.raises(Exception):
Citation(metadata=metadata) # Missing number
Copy link

Copilot AI Jan 24, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The test catches generic Exception instead of the specific ValidationError from Pydantic. This is overly broad and could hide unexpected errors. Import ValidationError from pydantic and use it instead.

Copilot uses AI. Check for mistakes.
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed

def test_extract_domain_empty_url(self):
"""Test handling of empty URL."""
assert _extract_domain("") == ""
assert _extract_domain(None) == "" or _extract_domain(None) is None
Copy link

Copilot AI Jan 24, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The test checks if _extract_domain(None) returns "" or None, but the function signature specifies url: str, which means it shouldn't accept None as a parameter. The test should either:

  1. Remove the None case if the function is never called with None, or
  2. Update the function signature to url: Optional[str] if None is a valid input.

Based on line 335 in the implementation (if not url:), it appears the function is designed to handle falsy values, so the type annotation should be Optional[str].

Suggested change
assert _extract_domain(None) == "" or _extract_domain(None) is None

Copilot uses AI. Check for mistakes.
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed

Comment on lines 251 to +336
def _extract_domain(url: str) -> str:
"""Extract domain from URL."""
"""
Extract domain from URL using urllib with regex fallback.

Handles:
- Standard URLs: https://www.example.com/path
- Short URLs: example.com
- Invalid URLs: graceful fallback

Args:
url: The URL string to extract domain from

Returns:
The domain netloc (including port if present), or empty string if extraction fails
"""
if not url:
return ""
Copy link

Copilot AI Jan 24, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The function handles None input (line 335: if not url:), but the type signature declares url: str which doesn't allow None. This is a type annotation mismatch. The parameter should be typed as url: Optional[str] to accurately reflect that it accepts None values.

Copilot uses AI. Check for mistakes.
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed

Comment on lines +209 to +229
def extract_title_from_content(content: str, max_length: int = 200) -> str:
"""
Intelligent title extraction supporting multiple formats.

Priority:
1. HTML <title> tag
2. Markdown h1 (# Title)
3. Markdown h2-h6 (## Title, etc.)
4. JSON/YAML title field
5. First substantial non-empty line
6. "Untitled" as fallback

Args:
content: The content to extract title from
max_length: Maximum title length (default: 200)

Returns:
Extracted title or "Untitled"
"""
if not content:
return "Untitled"
Copy link

Copilot AI Jan 24, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The function handles None input (line 228: if not content:), but the type signature declares content: str which doesn't allow None. The test at line 180 expects extract_title_from_content(None) to return "Untitled". The parameter should be typed as content: Optional[str] to match the actual implementation and test expectations.

Copilot uses AI. Check for mistakes.
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed

Comment on lines +10 to +11
import pytest

Copy link

Copilot AI Jan 24, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Import of 'pytest' is not used.

Suggested change
import pytest

Copilot uses AI. Check for mistakes.
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed

import pytest

from src.citations.collector import CitationCollector
from src.citations.models import CitationMetadata
Copy link

Copilot AI Jan 24, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Import of 'CitationMetadata' is not used.

Suggested change
from src.citations.models import CitationMetadata

Copilot uses AI. Check for mistakes.
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed

Comment on lines +10 to +11
import pytest

Copy link

Copilot AI Jan 24, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Import of 'pytest' is not used.

Suggested change
import pytest

Copilot uses AI. Check for mistakes.
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed

Comment on lines +10 to +11
import pytest

Copy link

Copilot AI Jan 24, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Import of 'pytest' is not used.

Suggested change
import pytest

Copilot uses AI. Check for mistakes.
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed

"""

import json
from datetime import datetime
Copy link

Copilot AI Jan 24, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Import of 'datetime' is not used.

Suggested change
from datetime import datetime

Copilot uses AI. Check for mistakes.
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed

@WillemJiang WillemJiang merged commit 9a34e32 into bytedance:main Jan 25, 2026
4 checks passed
maoruibin pushed a commit to maoruibin/deer-flow that referenced this pull request Mar 2, 2026
* improve: Improved citation system

* fix

---------

Co-authored-by: Willem Jiang <willem.jiang@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants