Skip to content

[Feature Request]: Support LZMA compression in python I/O SDKs #25316

@wrossmorrow

Description

@wrossmorrow

What would you like to happen?

LZMA compression is standard in python but not one of the strategies in the beam.io.{Read,Write}FromText PTransforms. openwebtext, for example, uses this compression. I think this may be a pretty simple change. For example, I hacked up a naive "shim" here for use in Dataflow with a custom container by just overwriting apache_beam/io/filesystem.py in the site-packages. It's working (a) locally with decompression and compression (though the output filenames are malformed, the part schema follows the compression extension) and (b) in a DataflowRunner reading a GCS dump of all the openwebtext .xz archives. (Without this I've been having a hell of a time getting any horizontal scaling while reading openwebtext.) It may be this simple, but I haven't run any Beam tests on these minor changes. I will probably do a bit more research into that myself.

Issue Priority

Priority: 2 (default / most feature requests should be filed as P2)

Issue Components

  • Component: Python SDK
  • Component: Java SDK
  • Component: Go SDK
  • Component: Typescript SDK
  • Component: IO connector
  • Component: Beam examples
  • Component: Beam playground
  • Component: Beam katas
  • Component: Website
  • Component: Spark Runner
  • Component: Flink Runner
  • Component: Samza Runner
  • Component: Twister2 Runner
  • Component: Hazelcast Jet Runner
  • Component: Google Cloud Dataflow Runner

Metadata

Metadata

Assignees

No one assigned

    Labels

    P2done & doneIssue has been reviewed after it was closed for verification, followups, etc.new featurepython

    Type

    No type

    Projects

    No projects

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions