[BEAM-11460] Support reading Parquet files with unknown schema#13554
[BEAM-11460] Support reading Parquet files with unknown schema#13554anantdamle wants to merge 6 commits intoapache:masterfrom anantdamle:master
Conversation
|
@danielxjd @lgajowy @jbonofre Can you review the feature for reading Parquet files with unknown schema |
…arseFiles<T>` implementation for supporting files with unknown schema.
iemejia
left a comment
There was a problem hiding this comment.
Minor comments, looks pretty good thanks for this contribution @anantdamle
sdks/java/io/parquet/src/main/java/org/apache/beam/sdk/io/parquet/ParquetIO.java
Outdated
Show resolved
Hide resolved
sdks/java/io/parquet/src/main/java/org/apache/beam/sdk/io/parquet/ParquetIO.java
Outdated
Show resolved
Hide resolved
sdks/java/io/parquet/src/main/java/org/apache/beam/sdk/io/parquet/ParquetIO.java
Outdated
Show resolved
Hide resolved
sdks/java/io/parquet/src/main/java/org/apache/beam/sdk/io/parquet/ParquetIO.java
Outdated
Show resolved
Hide resolved
sdks/java/io/parquet/src/main/java/org/apache/beam/sdk/io/parquet/ParquetIO.java
Outdated
Show resolved
Hide resolved
1. Fix Javadoc example by using consistent words 1. Other indentation and space fixes
|
@anantdamle Thank you for contribution! Please, do your changes in the feature branch, not in your master. |
iemejia
left a comment
There was a problem hiding this comment.
Looks nice now. I think I am going to merge this eagerly once the last fix is done @aromanenko-dev. Don't hesitate to bring any extra comments afterwards that we consider we can still improve.
| * ReadFromMongoDB/WriteToMongoDB will mask password in display_data (Python) ([BEAM-11444](https://issues.apache.org/jira/browse/BEAM-11444).) | ||
| * Support for X source added (Java/Python) ([BEAM-X](https://issues.apache.org/jira/browse/BEAM-X)). | ||
| * There is a new transform `ReadAllFromBigQuery` that can receive multiple requests to read data from BigQuery at pipeline runtime. See [PR 13170](https://github.com/apache/beam/pull/13170), and [BEAM-9650](https://issues.apache.org/jira/browse/BEAM-9650). | ||
| * ParquetIO can now read files with an unknown schema. See [PR-13554](https://github.com/apache/beam/pull/13554) and ([BEAM-11460](https://issues.apache.org/jira/browse/BEAM-11460)) |
There was a problem hiding this comment.
Can you move this up to the 2.28.0 section. The 2.27.0 was already merged (sorry I missed this in previous check.
|
As the branch needs to change. Created new PR/13616 |
Data engineers encounter times when schema of Parquet file is unknown at the time of writing the pipeline or multiple schema may be present in different files. Reading Parquet files using ParquetIO requires providing an Avro (equivalent) schema, Many a times its not possible to know the schema of the Parquet files.
On the other hand AvroIO supports reading unknow schema files by providing a parse function :
#parseGenericRecords(SerializableFunction<GenericRecord,T>)Supporting this functionality in ParquetIO is simple and requires minimal changes to the ParquetIO surface.
Thank you for your contribution! Follow this checklist to help us incorporate your contribution quickly and easily:
ParquetIOTestR: @lgajowy and @jbonofre).CHANGES.mdwith noteworthy changes.Post-Commit Tests Status (on master branch)
Pre-Commit Tests Status (on master branch)
See .test-infra/jenkins/README for trigger phrase, status and link of all Jenkins jobs.
GitHub Actions Tests Status (on master branch)
See CI.md for more information about GitHub Actions CI.