improve: Improved citation system by xunliu · Pull Request #834 · bytedance/deer-flow

xunliu · 2026-01-24T13:26:47Z

Pull request overview

This pull request enhances the citation system in DeerFlow by migrating from dataclasses to Pydantic models, implementing multi-format citation parsing, and optimizing URL lookups with an O(1) cache.

Changes:

Migrated CitationMetadata and Citation from dataclasses to Pydantic BaseModel for improved validation and serialization
Added multi-format citation extraction supporting Markdown, numbered, footnote, and HTML link formats
Implemented O(1) URL lookup optimization in CitationCollector using a reverse index cache (_url_to_index)
Enhanced title extraction with a 5-tier priority system (HTML title, Markdown headings, JSON/YAML fields, first substantial line)
Improved domain extraction with urllib and regex fallback strategy
Comprehensive test coverage for all new features (467 lines for models, 425 for formatter, 254 for extractor, 292 for collector)

Copilot

Pull request overview

This pull request enhances the citation system in DeerFlow by migrating from dataclasses to Pydantic models, implementing multi-format citation parsing, and optimizing URL lookups with an O(1) cache.

Changes:

Migrated CitationMetadata and Citation from dataclasses to Pydantic BaseModel for improved validation and serialization
Added multi-format citation extraction supporting Markdown, numbered, footnote, and HTML link formats
Implemented O(1) URL lookup optimization in CitationCollector using a reverse index cache (_url_to_index)
Enhanced title extraction with a 5-tier priority system (HTML title, Markdown headings, JSON/YAML fields, first substantial line)
Improved domain extraction with urllib and regex fallback strategy
Comprehensive test coverage for all new features (467 lines for models, 425 for formatter, 254 for extractor, 292 for collector)

Reviewed changes

Copilot reviewed 8 out of 8 changed files in this pull request and generated 12 comments.

Show a summary per file

File	Description
src/citations/models.py	Migrates CitationMetadata and Citation from dataclass to Pydantic BaseModel with Field definitions and model_validate
src/citations/formatter.py	Adds multi-format citation parsing with dedicated extraction functions for different citation styles and deduplication logic
src/citations/extractor.py	Implements intelligent title extraction with 5-tier priority system and enhanced domain extraction with fallback
src/citations/collector.py	Adds O(1) URL lookup cache (_url_to_index) for performance optimization, updated in all mutation methods
tests/unit/citations/test_models.py	Comprehensive tests for Pydantic model behavior, validation, serialization, and round-trip conversions
tests/unit/citations/test_formatter.py	Tests for all citation extraction formats, deduplication, and section parsing
tests/unit/citations/test_extractor.py	Tests for domain extraction edge cases and title extraction priority system
tests/unit/citations/test_collector.py	Tests for cache initialization, updates, consistency, and performance with large collections

Comments suppressed due to low confidence (1)

src/citations/models.py:61

Using __init__ override in Pydantic models is not recommended and can cause issues with validation and default values. The domain extraction logic should be moved to a @model_validator(mode='after') method instead. This is the standard Pydantic pattern for post-initialization logic, as seen in other parts of the codebase (e.g., src/server/mcp_request.py:56).

The current implementation may bypass Pydantic's validation and could have unexpected behavior with field defaults and validation.

    def __init__(self, **data):
        """Initialize and extract domain from URL if not provided."""
        super().__init__(**data)
        if not self.domain and self.url:
            try:
                parsed = urlparse(self.url)
                self.domain = parsed.netloc
            except Exception:
                # If URL parsing fails for any reason, leave `domain` as None.
                # This is a non-critical convenience field and failures here
                # should not prevent citation metadata creation.
                pass

Copilot · 2026-01-24T14:01:51Z

tests/unit/citations/test_models.py

+    def test_metadata_pydantic_validation(self):
+        """Test that Pydantic validates required fields."""
+        # URL and title are required
+        with pytest.raises(Exception):  # ValidationError
+            CitationMetadata()  # Missing required fields
+
+        with pytest.raises(Exception):
+            CitationMetadata(url="https://example.com")  # Missing title


The test catches generic Exception instead of the specific ValidationError from Pydantic. This is overly broad and could hide unexpected errors. Import ValidationError from pydantic and use it instead:

from pydantic import ValidationError with pytest.raises(ValidationError): CitationMetadata() # Missing required fields

Copilot · 2026-01-24T14:01:51Z

tests/unit/citations/test_models.py

+    def test_citation_pydantic_validation(self):
+        """Test that Pydantic validates required fields."""
+        # Number and metadata are required
+        with pytest.raises(Exception):  # ValidationError
+            Citation()  # Missing required fields
+
+        metadata = CitationMetadata(
+            url="https://example.com",
+            title="Example",
+        )
+        with pytest.raises(Exception):
+            Citation(metadata=metadata)  # Missing number


The test catches generic Exception instead of the specific ValidationError from Pydantic. This is overly broad and could hide unexpected errors. Import ValidationError from pydantic and use it instead.

Copilot · 2026-01-24T14:01:51Z

tests/unit/citations/test_extractor.py

+    def test_extract_domain_empty_url(self):
+        """Test handling of empty URL."""
+        assert _extract_domain("") == ""
+        assert _extract_domain(None) == "" or _extract_domain(None) is None


The test checks if _extract_domain(None) returns "" or None, but the function signature specifies url: str, which means it shouldn't accept None as a parameter. The test should either:

Remove the None case if the function is never called with None, or

Update the function signature to url: Optional[str] if None is a valid input.

Based on line 335 in the implementation (if not url:), it appears the function is designed to handle falsy values, so the type annotation should be Optional[str].

Suggested change

assert _extract_domain(None) == "" or _extract_domain(None) is None

Copilot · 2026-01-24T14:01:52Z

src/citations/extractor.py

 def _extract_domain(url: str) -> str:
-    """Extract domain from URL."""
+    """
+    Extract domain from URL using urllib with regex fallback.
+
+    Handles:
+    - Standard URLs: https://www.example.com/path
+    - Short URLs: example.com
+    - Invalid URLs: graceful fallback
+
+    Args:
+        url: The URL string to extract domain from
+
+    Returns:
+        The domain netloc (including port if present), or empty string if extraction fails
+    """
+    if not url:
+        return ""


The function handles None input (line 335: if not url:), but the type signature declares url: str which doesn't allow None. This is a type annotation mismatch. The parameter should be typed as url: Optional[str] to accurately reflect that it accepts None values.

Copilot · 2026-01-24T14:01:52Z

src/citations/extractor.py

+def extract_title_from_content(content: str, max_length: int = 200) -> str:
+    """
+    Intelligent title extraction supporting multiple formats.
+
+    Priority:
+    1. HTML <title> tag
+    2. Markdown h1 (# Title)
+    3. Markdown h2-h6 (## Title, etc.)
+    4. JSON/YAML title field
+    5. First substantial non-empty line
+    6. "Untitled" as fallback
+
+    Args:
+        content: The content to extract title from
+        max_length: Maximum title length (default: 200)
+
+    Returns:
+        Extracted title or "Untitled"
+    """
+    if not content:
+        return "Untitled"


The function handles None input (line 228: if not content:), but the type signature declares content: str which doesn't allow None. The test at line 180 expects extract_title_from_content(None) to return "Untitled". The parameter should be typed as content: Optional[str] to match the actual implementation and test expectations.

Copilot · 2026-01-24T14:01:52Z

tests/unit/citations/test_collector.py

+import pytest
+


Import of 'pytest' is not used.

Suggested change

import pytest

Copilot · 2026-01-24T14:01:53Z

tests/unit/citations/test_collector.py

+import pytest
+
+from src.citations.collector import CitationCollector
+from src.citations.models import CitationMetadata


Import of 'CitationMetadata' is not used.

Suggested change

from src.citations.models import CitationMetadata

Copilot · 2026-01-24T14:01:53Z

tests/unit/citations/test_extractor.py

+import pytest
+


Import of 'pytest' is not used.

Suggested change

import pytest

Copilot · 2026-01-24T14:01:53Z

tests/unit/citations/test_formatter.py

+import pytest
+


Import of 'pytest' is not used.

Suggested change

import pytest

Copilot · 2026-01-24T14:01:53Z

tests/unit/citations/test_models.py

+"""
+
+import json
+from datetime import datetime


Import of 'datetime' is not used.

Suggested change

from datetime import datetime

* improve: Improved citation system * fix --------- Co-authored-by: Willem Jiang <willem.jiang@gmail.com>

improve: Improved citation system

aa6ef8e

WillemJiang requested a review from Copilot January 24, 2026 13:56

Copilot started reviewing on behalf of WillemJiang January 24, 2026 13:57 View session

Copilot AI reviewed Jan 24, 2026

View reviewed changes

xunliu and others added 3 commits January 24, 2026 22:27

fix

fcc9a3d

Merge branch 'main' into citation

349b7b6

Merge branch 'main' into citation

8f0491f

WillemJiang approved these changes Jan 25, 2026

View reviewed changes

WillemJiang merged commit 9a34e32 into bytedance:main Jan 25, 2026
4 checks passed

maoruibin pushed a commit to maoruibin/deer-flow that referenced this pull request Mar 2, 2026

chore : Improved citation system (bytedance#834)

8890c56

* improve: Improved citation system * fix --------- Co-authored-by: Willem Jiang <willem.jiang@gmail.com>

Conversation

xunliu commented Jan 24, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Pull request overview

Changes:

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Copilot AI Jan 24, 2026

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Copilot AI Jan 24, 2026

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Copilot AI Jan 24, 2026

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Copilot AI Jan 24, 2026

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Copilot AI Jan 24, 2026

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Copilot AI Jan 24, 2026

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Copilot AI Jan 24, 2026

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Copilot AI Jan 24, 2026

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Copilot AI Jan 24, 2026

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Copilot AI Jan 24, 2026

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

xunliu commented Jan 24, 2026 •

edited

Loading