Skip to content

Enhance ZIM progress computation #440

@wsdookadr

Description

@wsdookadr

So I was looking at the update_stats method because I use --progress-file so I can know how much time it will take to build a ZIM

def update_stats(self):
"""write progress as JSON to self.stats_filename if requested"""
if not self.stats_filename:
return
self.written_records += 1
with open(self.stats_filename, "w") as fh:
json.dump(
{"written": self.written_records, "total": self.total_records}, fh
)

but when I look in the actual progress file I see:

user@dcrawl-1:~$ tail -f progress.txt
{"written": 911841, "total": 911841}tail: progress.txt: file truncated
{"written": 911842, "total": 911842}tail: progress.txt: file truncated
{"written": 911843, "total": 911843}tail: progress.txt: file truncated
{"written": 911844, "total": 911844}tail: progress.txt: file truncated
[..]

The "written" key is always equal to the "total" key. I was thinking that "total" should be a grand total counting the number of WARC records (across all input warc files) and "written" would be how many were written to the ZIM so far.

When I looked at the code it seems like total_records is updated every time a new WARC record is written and so is written_records

self.total_records += 1

self.written_records += 1

Shouldn't total_records be computed in gather_information_from_warc (the first pass) and stay constant all throughout add_items_for_warc_record . I'm looking for feedback on the above. Thanks!

Versions used:

  • warc2zim 2.2.1

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working

    Type

    No type

    Projects

    No projects

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions