-
Notifications
You must be signed in to change notification settings - Fork 4.5k
Description
What would you like to happen?
LZMA compression is standard in python but not one of the strategies in the beam.io.{Read,Write}FromText PTransforms. openwebtext, for example, uses this compression. I think this may be a pretty simple change. For example, I hacked up a naive "shim" here for use in Dataflow with a custom container by just overwriting apache_beam/io/filesystem.py in the site-packages. It's working (a) locally with decompression and compression (though the output filenames are malformed, the part schema follows the compression extension) and (b) in a DataflowRunner reading a GCS dump of all the openwebtext .xz archives. (Without this I've been having a hell of a time getting any horizontal scaling while reading openwebtext.) It may be this simple, but I haven't run any Beam tests on these minor changes. I will probably do a bit more research into that myself.
Issue Priority
Priority: 2 (default / most feature requests should be filed as P2)
Issue Components
- Component: Python SDK
- Component: Java SDK
- Component: Go SDK
- Component: Typescript SDK
- Component: IO connector
- Component: Beam examples
- Component: Beam playground
- Component: Beam katas
- Component: Website
- Component: Spark Runner
- Component: Flink Runner
- Component: Samza Runner
- Component: Twister2 Runner
- Component: Hazelcast Jet Runner
- Component: Google Cloud Dataflow Runner