Skip to content

Multiframe ZSTD file: how to jump to and stream the second file? #4569

@leanhdung1994

Description

@leanhdung1994

TL;DR: This is not about bugs of ZSTD but about how to take advantage of its feature:


I compress two ndjson files into a multiframe ZST file where each ndjson is compressed into a frame. I have the following metadata meta_data (as a list) of the ZST file:

import zstandard as zstd
from pathlib import Path

input_file  = r"E:\Personal projects\tmp\test.zst"
input_file  = Path(output_file)

meta_data = [{'name'                : 'chunk_0.ndjson',
              'uncompressed_size'   : 2147473321,
              'compressed_offset'   : 0,
              'uncompressed_offset' : 0,
              'compressed_size'     : 175631248},
             {'name'                : 'chunk_1.ndjson',
              'uncompressed_size'   : 2147473321,
              'compressed_offset'   : 175631248,
              'uncompressed_offset' : 2147473321,
              'compressed_size'     : 175631248}]

In Python, how can we leverage the above meta_data to seek to chunk_1.ndjson, start decompressing, and stream it line-by-line? In this way, we don't need to

  • decompress chunk_0.ndjson,
  • load the whole compressed chunk_1.ndjson into the memory.

Thank your for your help.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions