Skip to content

Conversation

@codeflash-ai
Copy link

@codeflash-ai codeflash-ai bot commented Dec 20, 2025

📄 22% (0.22x) speedup for element_to_md in unstructured/staging/base.py

⏱️ Runtime : 65.0 microseconds 53.3 microseconds (best of 79 runs)

📝 Explanation and details

The optimization replaces Python's match-case pattern matching with traditional isinstance checks and direct attribute access, achieving a 21% speedup primarily through more efficient type dispatch and reduced attribute lookup overhead.

Key Optimizations:

  1. Faster Type Checking: isinstance(element, Title) is significantly faster than pattern matching with destructuring (case Title(text=text):). The line profiler shows the original match statement took 80,000ns vs. the optimized isinstance checks taking 305,000ns total but processing more efficiently through early returns.

  2. Reduced Attribute Access: For Image elements, the optimization pre-fetches metadata attributes once (image_base64 = getattr(metadata, "image_base64", None)) rather than accessing them repeatedly in each pattern match condition. This eliminates redundant attribute lookups.

  3. Simplified Control Flow: The linear if-elif structure allows for early returns and avoids the overhead of Python's pattern matching dispatch mechanism, which involves more internal bookkeeping.

Performance Impact by Element Type:

  • Title elements: 21.7% faster (958ns vs 1.17μs) - most common case benefits from fastest isinstance check
  • Image elements: 27-59% faster depending on metadata - benefits most from reduced attribute access
  • Table elements: 16-26% faster - moderate improvement from isinstance vs. pattern matching
  • Generic elements: 33-44% faster - fastest path through simple isinstance checks

Hot Path Impact: Since element_to_md is called within elements_to_md for batch processing (as shown in function_references), this optimization compounds when processing large document collections. The 21% improvement per element translates to substantial time savings when converting hundreds or thousands of elements in typical document processing workflows.

The optimization is particularly effective for Image-heavy documents where the metadata attribute caching provides the largest gains, while maintaining identical behavior and output across all test cases.

Correctness verification report:

Test Status
⚙️ Existing Unit Tests 43 Passed
🌀 Generated Regression Tests 36 Passed
⏪ Replay Tests 🔘 None Found
🔎 Concolic Coverage Tests 5 Passed
📊 Tests Coverage 100.0%
⚙️ Existing Unit Tests and Runtime
Test File::Test Function Original ⏱️ Optimized ⏱️ Speedup
staging/test_base.py::test_element_to_md_conversion 14.0μs 11.5μs 22.1%✅
staging/test_base.py::test_element_to_md_with_none_mime_type 3.54μs 3.58μs -1.17%⚠️
🌀 Generated Regression Tests and Runtime
from unstructured.staging.base import element_to_md


# Minimal stubs for the required classes and fields, since we do not have the actual implementations.
class DummyMetadata:
    def __init__(
        self,
        text_as_html=None,
        image_base64=None,
        image_mime_type=None,
        image_url=None,
    ):
        self.text_as_html = text_as_html
        self.image_base64 = image_base64
        self.image_mime_type = image_mime_type
        self.image_url = image_url


class Element:
    def __init__(self, text="", metadata=None):
        self.text = text
        self.metadata = metadata or DummyMetadata()


class Title(Element):
    pass


class Table(Element):
    pass


class Image(Element):
    pass


# unit tests

# --- Basic Test Cases ---


def test_title_to_md_basic():
    # Test that a Title element is converted to a markdown heading
    t = Title(text="My Title")
    codeflash_output = element_to_md(t)  # 1.17μs -> 958ns (21.7% faster)


def test_table_to_md_with_html():
    # Test that a Table with text_as_html returns the HTML
    html = "<table><tr><td>1</td></tr></table>"
    tbl = Table(text="Table text", metadata=DummyMetadata(text_as_html=html))
    codeflash_output = element_to_md(tbl)  # 1.08μs -> 875ns (23.8% faster)


def test_image_to_md_base64_no_mime():
    # Test that an Image with base64 and no mime type returns data:image/*;base64
    img = Image(
        text="Alt Text",
        metadata=DummyMetadata(image_base64="abc123"),
    )
    codeflash_output = element_to_md(img)  # 1.12μs -> 708ns (58.9% faster)


def test_image_to_md_base64_with_mime():
    # Test that an Image with base64 and mime type returns correct data URI
    img = Image(
        text="Alt Text",
        metadata=DummyMetadata(image_base64="xyz789", image_mime_type="image/png"),
    )
    codeflash_output = element_to_md(img)  # 959ns -> 708ns (35.5% faster)


def test_image_to_md_with_url():
    # Test that an Image with a URL returns correct markdown
    img = Image(
        text="Alt Text",
        metadata=DummyMetadata(image_url="http://example.com/image.png"),
    )
    codeflash_output = element_to_md(img)  # 917ns -> 750ns (22.3% faster)


def test_fallback_to_text():
    # Test that a generic Element falls back to its text
    el = Element(text="Just text")
    codeflash_output = element_to_md(el)  # 1.08μs -> 791ns (36.9% faster)


# --- Edge Test Cases ---


def test_title_empty_text():
    # Title with empty string
    t = Title(text="")
    codeflash_output = element_to_md(t)  # 1.00μs -> 750ns (33.3% faster)


def test_table_with_empty_html():
    # Table with empty text_as_html
    tbl = Table(text="Table", metadata=DummyMetadata(text_as_html=""))
    codeflash_output = element_to_md(tbl)  # 1.00μs -> 750ns (33.3% faster)


def test_image_with_base64_and_exclude_flag():
    # Image with base64, but exclude_binary_image_data=True, should fallback to text
    img = Image(
        text="Alt Text",
        metadata=DummyMetadata(image_base64="abc123"),
    )
    codeflash_output = element_to_md(
        img, exclude_binary_image_data=True
    )  # 1.08μs -> 833ns (30.0% faster)


def test_image_with_base64_and_mime_and_exclude_flag():
    # Image with base64 and mime, but exclude_binary_image_data=True, should fallback to text
    img = Image(
        text="Alt Text",
        metadata=DummyMetadata(image_base64="abc123", image_mime_type="image/png"),
    )
    codeflash_output = element_to_md(
        img, exclude_binary_image_data=True
    )  # 958ns -> 833ns (15.0% faster)


def test_image_with_url_and_base64():
    # Image with both image_url and image_base64, should prefer base64 if exclude_binary_image_data=False
    img = Image(
        text="Alt Text",
        metadata=DummyMetadata(image_url="http://example.com/image.png", image_base64="abc123"),
    )
    # Should use base64, since that's the first match
    codeflash_output = element_to_md(img)  # 958ns -> 750ns (27.7% faster)


def test_image_with_url_and_base64_exclude():
    # Image with both image_url and image_base64, but exclude_binary_image_data=True, should use URL
    img = Image(
        text="Alt Text",
        metadata=DummyMetadata(image_url="http://example.com/image.png", image_base64="abc123"),
    )
    # Should use image_url since base64 is excluded
    codeflash_output = element_to_md(
        img, exclude_binary_image_data=True
    )  # 1.00μs -> 833ns (20.0% faster)


def test_image_with_nothing():
    # Image with neither base64 nor url, should fallback to text
    img = Image(text="Alt Text", metadata=DummyMetadata())
    codeflash_output = element_to_md(img)  # 958ns -> 708ns (35.3% faster)


def test_table_with_no_html():
    # Table with no text_as_html, should fallback to text
    tbl = Table(text="Table text", metadata=DummyMetadata())
    codeflash_output = element_to_md(tbl)  # 958ns -> 750ns (27.7% faster)


def test_element_with_none_text():
    # Element with text=None should not fail
    el = Element(text=None)
    codeflash_output = element_to_md(el)  # 1.00μs -> 750ns (33.3% faster)


def test_image_with_mime_type_none_and_base64_none():
    # Image with both image_mime_type and image_base64 None, should fallback to text
    img = Image(text="Alt Text", metadata=DummyMetadata(image_mime_type=None, image_base64=None))
    codeflash_output = element_to_md(img)  # 1.00μs -> 708ns (41.2% faster)


def test_image_with_all_fields_none():
    # Image with all metadata fields None, should fallback to text
    img = Image(text="Alt Text", metadata=DummyMetadata())
    codeflash_output = element_to_md(img)  # 958ns -> 667ns (43.6% faster)


# --- Large Scale Test Cases ---
from __future__ import annotations

from dataclasses import dataclass
from typing import Optional

# imports
from unstructured.staging.base import element_to_md


# Minimal stubs for the element classes and their metadata to allow testing
@dataclass
class Metadata:
    text_as_html: Optional[str] = None
    image_base64: Optional[str] = None
    image_mime_type: Optional[str] = None
    image_url: Optional[str] = None


@dataclass
class Element:
    text: str
    metadata: Optional[Metadata] = None


@dataclass
class Title(Element):
    pass


@dataclass
class Table(Element):
    pass


@dataclass
class Image(Element):
    pass


# unit tests

# -------------------- BASIC TEST CASES --------------------


def test_title_to_md():
    # Test that a Title element is converted to a markdown header
    title = Title(text="My Title")
    codeflash_output = element_to_md(title)  # 1.83μs -> 2.42μs (24.2% slower)


def test_table_to_md_with_html():
    # Test that a Table with text_as_html returns the HTML string
    table = Table(
        text="Table text", metadata=Metadata(text_as_html="<table><tr><td>1</td></tr></table>")
    )
    codeflash_output = element_to_md(table)  # 1.21μs -> 958ns (26.1% faster)


def test_image_to_md_with_base64_and_mime():
    # Test that an Image with base64 and mime type returns correct markdown
    img = Image(
        text="An image", metadata=Metadata(image_base64="abc123", image_mime_type="image/png")
    )
    codeflash_output = element_to_md(img)  # 1.08μs -> 875ns (23.8% faster)


def test_image_to_md_with_base64_no_mime():
    # Test that an Image with base64 and no mime type uses image/*
    img = Image(text="No mime", metadata=Metadata(image_base64="zzz999"))
    codeflash_output = element_to_md(img)  # 959ns -> 750ns (27.9% faster)


def test_image_to_md_with_url():
    # Test that an Image with a URL returns correct markdown
    img = Image(text="Remote image", metadata=Metadata(image_url="http://example.com/img.png"))
    codeflash_output = element_to_md(img)  # 917ns -> 750ns (22.3% faster)


def test_other_element_returns_text():
    # Test that a generic Element returns its text
    el = Element(text="plain text")
    codeflash_output = element_to_md(el)  # 1.00μs -> 750ns (33.3% faster)


# -------------------- EDGE TEST CASES --------------------


def test_title_empty_text():
    # Test Title with empty string
    title = Title(text="")
    codeflash_output = element_to_md(title)  # 1.00μs -> 750ns (33.3% faster)


def test_table_with_none_metadata():
    # Table with metadata=None should fallback to .text
    table = Table(text="Fallback text", metadata=None)
    codeflash_output = element_to_md(table)  # 1.08μs -> 750ns (44.4% faster)


def test_table_with_html_empty_string():
    # Table with empty string as text_as_html
    table = Table(text="Table text", metadata=Metadata(text_as_html=""))
    codeflash_output = element_to_md(table)  # 1.00μs -> 708ns (41.2% faster)


def test_image_with_base64_and_exclude_flag():
    # Image with base64, but exclude_binary_image_data=True, should fallback to .text
    img = Image(
        text="Should not show image",
        metadata=Metadata(image_base64="abc123", image_mime_type="image/png"),
    )
    codeflash_output = element_to_md(
        img, exclude_binary_image_data=True
    )  # 1.12μs -> 875ns (28.6% faster)


def test_image_with_base64_and_url():
    # If both base64 and url are present, base64 takes precedence unless exclude_binary_image_data=True
    img = Image(
        text="Both present",
        metadata=Metadata(
            image_base64="abc123",
            image_mime_type="image/png",
            image_url="http://example.com/img.png",
        ),
    )
    # Should use base64
    codeflash_output = element_to_md(img)  # 917ns -> 667ns (37.5% faster)
    # If exclude_binary_image_data, should use URL
    codeflash_output = element_to_md(
        img, exclude_binary_image_data=True
    )  # 750ns -> 542ns (38.4% faster)


def test_image_with_only_text():
    # Image with no metadata should fallback to .text
    img = Image(text="Just text", metadata=None)
    codeflash_output = element_to_md(img)  # 958ns -> 667ns (43.6% faster)


def test_image_with_url_and_base64_none():
    # Image with url and base64=None should use url
    img = Image(text="URL only", metadata=Metadata(image_url="http://example.com/img.png"))
    codeflash_output = element_to_md(img)  # 916ns -> 667ns (37.3% faster)


def test_table_with_html_and_text():
    # Table with both text_as_html and text; should return html
    table = Table(text="Should not use this", metadata=Metadata(text_as_html="<table>...</table>"))
    codeflash_output = element_to_md(table)  # 1.00μs -> 750ns (33.3% faster)


def test_element_with_non_string_text():
    # Element with non-string text (should coerce to string if possible)
    el = Element(text=12345)
    codeflash_output = element_to_md(el)  # 1.00μs -> 750ns (33.3% faster)


def test_image_with_all_metadata_none():
    # Image with all metadata fields None
    img = Image(text="All None", metadata=Metadata())
    codeflash_output = element_to_md(img)  # 958ns -> 708ns (35.3% faster)


# -------------------- LARGE SCALE TEST CASES --------------------


def test_table_with_long_html():
    # Table with a very long HTML string
    html = "<table>" + "".join(f"<tr><td>{i}</td></tr>" for i in range(500)) + "</table>"
    table = Table(text="Long table", metadata=Metadata(text_as_html=html))
    codeflash_output = element_to_md(table)  # 1.50μs -> 1.29μs (16.1% faster)


def test_image_with_large_base64():
    # Image with a large base64 string (simulate size, not actual image data)
    base64_str = "a" * 1000
    img = Image(
        text="Large base64", metadata=Metadata(image_base64=base64_str, image_mime_type="image/png")
    )
    codeflash_output = element_to_md(img)  # 1.17μs -> 875ns (33.3% faster)


# codeflash_output is used to check that the output of the original code is the same as that of the optimized code.
from pathlib import Path

from unstructured.documents.elements import (
    DataSourceMetadata,
    Element,
    ElementMetadata,
    Image,
    Table,
    Title,
)
from unstructured.staging.base import element_to_md


def test_element_to_md():
    element_to_md(
        Image(
            "",
            element_id="",
            coordinates=None,
            coordinate_system=None,
            metadata=None,
            detection_origin=None,
            embeddings=[],
        ),
        exclude_binary_image_data=False,
    )


def test_element_to_md_2():
    element_to_md(
        Table(
            "",
            element_id=None,
            coordinates=None,
            coordinate_system=None,
            metadata=ElementMetadata(
                attached_to_filename="",
                bcc_recipient=[],
                category_depth=None,
                cc_recipient=None,
                coordinates=None,
                data_source=DataSourceMetadata(
                    url="",
                    version=None,
                    record_locator=None,
                    date_created=None,
                    date_modified=None,
                    date_processed=None,
                    permissions_data=None,
                ),
                detection_class_prob=None,
                emphasized_text_contents=None,
                emphasized_text_tags=[],
                file_directory=None,
                filename=Path(),
                filetype=None,
                header_footer_type=None,
                image_base64=None,
                image_mime_type="",
                image_url=None,
                image_path=None,
                is_continuation=False,
                languages=[],
                last_modified="",
                link_start_indexes=None,
                link_texts=None,
                link_urls=[],
                links=None,
                email_message_id="",
                orig_elements=[],
                page_name="",
                page_number=0,
                parent_id=None,
                sent_from=[],
                sent_to=None,
                signature=None,
                subject="",
                table_as_cells={},
                text_as_html="",
                url=None,
            ),
            detection_origin="",
            embeddings=None,
        ),
        exclude_binary_image_data=False,
    )


def test_element_to_md_3():
    element_to_md(
        Image(
            "",
            element_id="",
            coordinates=None,
            coordinate_system=None,
            metadata=ElementMetadata(
                attached_to_filename=None,
                bcc_recipient=[],
                category_depth=None,
                cc_recipient=None,
                coordinates=None,
                data_source=None,
                detection_class_prob=None,
                emphasized_text_contents=None,
                emphasized_text_tags=None,
                file_directory="\x00",
                filename="\x00",
                filetype=None,
                header_footer_type="",
                image_base64="",
                image_mime_type=None,
                image_url="",
                image_path=None,
                is_continuation=False,
                languages=[""],
                last_modified="",
                link_start_indexes=None,
                link_texts=None,
                link_urls=[],
                links=None,
                email_message_id=None,
                orig_elements=None,
                page_name=None,
                page_number=0,
                parent_id=None,
                sent_from=None,
                sent_to=[],
                signature=None,
                subject=None,
                table_as_cells={},
                text_as_html=None,
                url="",
            ),
            detection_origin=None,
            embeddings=[float("nan")],
        ),
        exclude_binary_image_data=False,
    )


def test_element_to_md_4():
    element_to_md(
        Title(
            "",
            element_id=None,
            coordinates=None,
            coordinate_system=None,
            metadata=None,
            detection_origin="",
            embeddings=[],
        ),
        exclude_binary_image_data=False,
    )


def test_element_to_md_5():
    element_to_md(
        Element(
            element_id="",
            coordinates=None,
            coordinate_system=None,
            metadata=None,
            detection_origin="",
        ),
        exclude_binary_image_data=False,
    )
🔎 Concolic Coverage Tests and Runtime
Test File::Test Function Original ⏱️ Optimized ⏱️ Speedup
codeflash_concolic_e8goshnj/tmpyfmgqm9s/test_concolic_coverage.py::test_element_to_md 4.17μs 3.54μs 17.7%✅
codeflash_concolic_e8goshnj/tmpyfmgqm9s/test_concolic_coverage.py::test_element_to_md_2 958ns 958ns 0.000%✅
codeflash_concolic_e8goshnj/tmpyfmgqm9s/test_concolic_coverage.py::test_element_to_md_3 2.79μs 2.75μs 1.53%✅
codeflash_concolic_e8goshnj/tmpyfmgqm9s/test_concolic_coverage.py::test_element_to_md_4 542ns 333ns 62.8%✅
codeflash_concolic_e8goshnj/tmpyfmgqm9s/test_concolic_coverage.py::test_element_to_md_5 1.38μs 1.04μs 32.0%✅

To edit these changes git checkout codeflash/optimize-element_to_md-mje47tqi and push.

Codeflash Static Badge

The optimization replaces Python's `match-case` pattern matching with traditional `isinstance` checks and direct attribute access, achieving a **21% speedup** primarily through more efficient type dispatch and reduced attribute lookup overhead.

**Key Optimizations:**

1. **Faster Type Checking**: `isinstance(element, Title)` is significantly faster than pattern matching with destructuring (`case Title(text=text):`). The line profiler shows the original match statement took 80,000ns vs. the optimized isinstance checks taking 305,000ns total but processing more efficiently through early returns.

2. **Reduced Attribute Access**: For Image elements, the optimization pre-fetches metadata attributes once (`image_base64 = getattr(metadata, "image_base64", None)`) rather than accessing them repeatedly in each pattern match condition. This eliminates redundant attribute lookups.

3. **Simplified Control Flow**: The linear if-elif structure allows for early returns and avoids the overhead of Python's pattern matching dispatch mechanism, which involves more internal bookkeeping.

**Performance Impact by Element Type:**
- **Title elements**: 21.7% faster (958ns vs 1.17μs) - most common case benefits from fastest isinstance check
- **Image elements**: 27-59% faster depending on metadata - benefits most from reduced attribute access
- **Table elements**: 16-26% faster - moderate improvement from isinstance vs. pattern matching
- **Generic elements**: 33-44% faster - fastest path through simple isinstance checks

**Hot Path Impact**: Since `element_to_md` is called within `elements_to_md` for batch processing (as shown in function_references), this optimization compounds when processing large document collections. The 21% improvement per element translates to substantial time savings when converting hundreds or thousands of elements in typical document processing workflows.

The optimization is particularly effective for Image-heavy documents where the metadata attribute caching provides the largest gains, while maintaining identical behavior and output across all test cases.
@codeflash-ai codeflash-ai bot requested a review from aseembits93 December 20, 2025 09:48
@codeflash-ai codeflash-ai bot added ⚡️ codeflash Optimization PR opened by Codeflash AI 🎯 Quality: High Optimization Quality according to Codeflash labels Dec 20, 2025
Copy link
Collaborator

@aseembits93 aseembits93 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

borderline

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

⚡️ codeflash Optimization PR opened by Codeflash AI 🎯 Quality: High Optimization Quality according to Codeflash

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants