Skip to content

Do not create ZIM if seed URL is not HTML #442

@rgaudin

Description

@rgaudin

In this zimit run the seed URL is not HTML. The answer is a 200 with content-type: application/x-directory and no payload. warc2zim understood this and continued processing eventually failing on a missing attribute

We should either harden those processing or exit directly if the seed is not HTML.

[zimit::2025-02-14 22:25:15,583] INFO:Calling warc2zim with these args: ['--favicon=https://pixijs.com/images/logo.svg', '--name=pixijs.download_2b58a03e', '--zim-file=pixijs.download_2b58a03e.zim', '--publisher=openZIM', '--scraper-suffix', 'zimit 2.1.8', '--output', '/output', '--url', 'https://pixijs.download/release/docs/', '--description', 'PixiJS API Documentation', '-v', '--progress-file', '/output/warc2zim.json', '/output/.tmp3qz2ow61/collections/crawl-20250214222511797/archive']
[warc2zim::2025-02-14 22:25:15,589] DEBUG:Attempting to confirm output is writable in directory /output
[warc2zim::2025-02-14 22:25:15,590] DEBUG:Output is writable. Temporary file used for test: /output/tmpn43ov84l
[warc2zim::2025-02-14 22:25:15,591] DEBUG:Confirming ZIM file can be created using name: pixijs.download_2b58a03e.zim
[warc2zim::2025-02-14 22:25:15,592] DEBUG:1 WARC files found
[warc2zim::2025-02-14 22:25:15,598] WARNING:Main page is not an HTML Page, mime type is: application/x-directory - Skipping Favicon and Language detection
[warc2zim::2025-02-14 22:25:15,599] INFO:Expecting 1 ZIM entries to files
[warc2zim::2025-02-14 22:25:15,599] DEBUG:Preparing 0 redirections
[warc2zim::2025-02-14 22:25:15,599] DEBUG:0 redirections will be ignored
[warc2zim::2025-02-14 22:25:15,599] INFO:Expecting 1 ZIM entries including redirects
[warc2zim::2025-02-14 22:25:15,599] WARNING:No valid ZIM language, fallbacking to `eng`.
[zimit::2025-02-14 22:25:15,621] INFO:
[zimit::2025-02-14 22:25:15,621] INFO:
[zimit::2025-02-14 22:25:15,621] INFO:SIGINT/SIGTERM received, stopping zimit
[zimit::2025-02-14 22:25:15,621] INFO:
[zimit::2025-02-14 22:25:15,621] INFO:

Traceback (most recent call last):
  File "/usr/bin/zimit", line 8, in <module>
    sys.exit(zimit.zimit())
             ~~~~~~~~~~~^^
  File "/app/zimit/lib/python3.13/site-packages/zimit/zimit.py", line 688, in zimit
    sys.exit(run(sys.argv[1:]))
             ~~~^^^^^^^^^^^^^^
  File "/app/zimit/lib/python3.13/site-packages/zimit/zimit.py", line 609, in run
    return warc2zim(warc2zim_args)
  File "/app/zimit/lib/python3.13/site-packages/warc2zim/main.py", line 168, in main
    return converter.run()
           ~~~~~~~~~~~~~^^
  File "/app/zimit/lib/python3.13/site-packages/warc2zim/converter.py", line 307, in run
    self.retrieve_illustration()
    ~~~~~~~~~~~~~~~~~~~~~~~~~~^^
  File "/app/zimit/lib/python3.13/site-packages/warc2zim/converter.py", line 838, in retrieve_illustration
    favicon_url.value, self.favicon_contents[favicon_url]
                       ^^^^^^^^^^^^^^^^^^^^^
AttributeError: 'Converter' object has no attribute 'favicon_contents'

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working

    Type

    No type

    Projects

    No projects

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions