Skip to content

page.cluster_drawings extract a lot of small clusters once upgraded to 1.26 #4599

@klauswong

Description

@klauswong

Description of the bug

I have a function that extract the clustered drawings from a PDF.
This function takes much longer time after after upgraded to 1.26.0 (and 1.26.3)

Here is a simplified version of the function to isolate the problem

    def _get_clustered_drawings(
        self, page: fitz.Page
    ) -> List[ImageType]:
        for clip in page.cluster_drawings():
            print(clip)

How to reproduce the bug

This is a simple PDF exported from Perplexity (with images)
what is the tallest mountain on earth.pdf

When using getting the clustered drawings pymupdf<1.26.0, I get 3 clustered drawings and the speed feels 'normal'.

Rect(75.75, 222.75, 79.5, 226.5)
Rect(75.75, 243.75, 79.5, 247.5)
Rect(75.75, 296.25, 79.5, 300.0)

With version >= 1.26.0, the clusters, I get this long list of clusters with significantly longer time.
The problem magnifies for a longer PDF with more images.

Rect(68.42168426513672, 104.17657470703125, 114.84117889404297, 118.68190002441406)
Rect(140.42724609375, 104.17657470703125, 170.42828369140625, 118.72410583496094)
Rect(239.57530212402344, 103.84222412109375, 326.4324951171875, 118.72410583496094)
Rect(332.87890625, 107.958251953125, 354.9324951171875, 118.72410583496094)
Rect(361.37890625, 104.17657470703125, 409.683837890625, 118.72410583496094)
Rect(120.88815307617188, 103.84222412109375, 135.1599884033203, 118.734619140625)
Rect(175.67724609375, 104.17657470703125, 233.34117126464844, 118.734619140625)
...

PyMuPDF version

1.26.3

Operating system

MacOS

Python version

3.12

Metadata

Metadata

Assignees

No one assigned

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions