So I was looking at the update_stats method because I use --progress-file so I can know how much time it will take to build a ZIM
|
def update_stats(self): |
|
"""write progress as JSON to self.stats_filename if requested""" |
|
if not self.stats_filename: |
|
return |
|
self.written_records += 1 |
|
with open(self.stats_filename, "w") as fh: |
|
json.dump( |
|
{"written": self.written_records, "total": self.total_records}, fh |
|
) |
but when I look in the actual progress file I see:
user@dcrawl-1:~$ tail -f progress.txt
{"written": 911841, "total": 911841}tail: progress.txt: file truncated
{"written": 911842, "total": 911842}tail: progress.txt: file truncated
{"written": 911843, "total": 911843}tail: progress.txt: file truncated
{"written": 911844, "total": 911844}tail: progress.txt: file truncated
[..]
The "written" key is always equal to the "total" key. I was thinking that "total" should be a grand total counting the number of WARC records (across all input warc files) and "written" would be how many were written to the ZIM so far.
When I looked at the code it seems like total_records is updated every time a new WARC record is written and so is written_records
|
self.written_records += 1 |
Shouldn't total_records be computed in gather_information_from_warc (the first pass) and stay constant all throughout add_items_for_warc_record . I'm looking for feedback on the above. Thanks!
Versions used:
So I was looking at the
update_statsmethod because I use--progress-fileso I can know how much time it will take to build a ZIMwarc2zim/src/warc2zim/converter.py
Lines 240 to 248 in 62d3fe5
but when I look in the actual progress file I see:
The "written" key is always equal to the "total" key. I was thinking that "total" should be a grand total counting the number of WARC records (across all input warc files) and "written" would be how many were written to the ZIM so far.
When I looked at the code it seems like
total_recordsis updated every time a new WARC record is written and so iswritten_recordswarc2zim/src/warc2zim/converter.py
Line 991 in 62d3fe5
warc2zim/src/warc2zim/converter.py
Line 244 in 62d3fe5
Shouldn't
total_recordsbe computed ingather_information_from_warc(the first pass) and stay constant all throughoutadd_items_for_warc_record. I'm looking for feedback on the above. Thanks!Versions used: