[core] fix parquet can not read empty row with first column is array.#4711
[core] fix parquet can not read empty row with first column is array.#4711Stephen0421 wants to merge 2 commits into
Conversation
xccui
left a comment
There was a problem hiding this comment.
Thanks for your effort @Stephen0421! I tried the fix locally by reading some data with highly nested schemas. Some tests passed and some failed with the following exception. I'll share the table files with you for debugging.
Caused by: java.lang.ArrayIndexOutOfBoundsException: Index 2 out of bounds for length 2
at org.apache.paimon.data.columnar.heap.AbstractHeapVector.isNullAt(AbstractHeapVector.java:111)
at org.apache.paimon.format.parquet.reader.NestedColumnReader.readRow(NestedColumnReader.java:144)
at org.apache.paimon.format.parquet.reader.NestedColumnReader.readData(NestedColumnReader.java:102)
at org.apache.paimon.format.parquet.reader.NestedColumnReader.readRow(NestedColumnReader.java:124)
at org.apache.paimon.format.parquet.reader.NestedColumnReader.readData(NestedColumnReader.java:102)
at org.apache.paimon.format.parquet.reader.NestedColumnReader.readArray(NestedColumnReader.java:243)
at org.apache.paimon.format.parquet.reader.NestedColumnReader.readData(NestedColumnReader.java:106)
at org.apache.paimon.format.parquet.reader.NestedColumnReader.readRow(NestedColumnReader.java:124)
at org.apache.paimon.format.parquet.reader.NestedColumnReader.readData(NestedColumnReader.java:102)
at org.apache.paimon.format.parquet.reader.NestedColumnReader.readRow(NestedColumnReader.java:124)
at org.apache.paimon.format.parquet.reader.NestedColumnReader.readData(NestedColumnReader.java:102)
at org.apache.paimon.format.parquet.reader.NestedColumnReader.readRow(NestedColumnReader.java:124)
at org.apache.paimon.format.parquet.reader.NestedColumnReader.readData(NestedColumnReader.java:102)
at org.apache.paimon.format.parquet.reader.NestedColumnReader.readArray(NestedColumnReader.java:243)
at org.apache.paimon.format.parquet.reader.NestedColumnReader.readData(NestedColumnReader.java:106)
at org.apache.paimon.format.parquet.reader.NestedColumnReader.readRow(NestedColumnReader.java:124)
at org.apache.paimon.format.parquet.reader.NestedColumnReader.readData(NestedColumnReader.java:102)
at org.apache.paimon.format.parquet.reader.NestedColumnReader.readRow(NestedColumnReader.java:124)
at org.apache.paimon.format.parquet.reader.NestedColumnReader.readData(NestedColumnReader.java:102)
at org.apache.paimon.format.parquet.reader.NestedColumnReader.readRow(NestedColumnReader.java:124)
at org.apache.paimon.format.parquet.reader.NestedColumnReader.readData(NestedColumnReader.java:102)
at org.apache.paimon.format.parquet.reader.NestedColumnReader.readArray(NestedColumnReader.java:243)
at org.apache.paimon.format.parquet.reader.NestedColumnReader.readData(NestedColumnReader.java:106)
at org.apache.paimon.format.parquet.reader.NestedColumnReader.readRow(NestedColumnReader.java:124)
at org.apache.paimon.format.parquet.reader.NestedColumnReader.readData(NestedColumnReader.java:102)
at org.apache.paimon.format.parquet.reader.NestedColumnReader.readRow(NestedColumnReader.java:124)
at org.apache.paimon.format.parquet.reader.NestedColumnReader.readData(NestedColumnReader.java:102)
at org.apache.paimon.format.parquet.reader.NestedColumnReader.readRow(NestedColumnReader.java:124)
at org.apache.paimon.format.parquet.reader.NestedColumnReader.readData(NestedColumnReader.java:102)
at org.apache.paimon.format.parquet.reader.NestedColumnReader.readArray(NestedColumnReader.java:243)
at org.apache.paimon.format.parquet.reader.NestedColumnReader.readData(NestedColumnReader.java:106)
at org.apache.paimon.format.parquet.reader.NestedColumnReader.readRow(NestedColumnReader.java:124)
at org.apache.paimon.format.parquet.reader.NestedColumnReader.readData(NestedColumnReader.java:102)
at org.apache.paimon.format.parquet.reader.NestedColumnReader.readRow(NestedColumnReader.java:124)
at org.apache.paimon.format.parquet.reader.NestedColumnReader.readData(NestedColumnReader.java:102)
at org.apache.paimon.format.parquet.reader.NestedColumnReader.readRow(NestedColumnReader.java:124)
at org.apache.paimon.format.parquet.reader.NestedColumnReader.readData(NestedColumnReader.java:102)
at org.apache.paimon.format.parquet.reader.NestedColumnReader.readArray(NestedColumnReader.java:243)
at org.apache.paimon.format.parquet.reader.NestedColumnReader.readData(NestedColumnReader.java:106)
at org.apache.paimon.format.parquet.reader.NestedColumnReader.readRow(NestedColumnReader.java:124)
at org.apache.paimon.format.parquet.reader.NestedColumnReader.readData(NestedColumnReader.java:102)
at org.apache.paimon.format.parquet.reader.NestedColumnReader.readToVector(NestedColumnReader.java:90)
at org.apache.paimon.format.parquet.ParquetReaderFactory$ParquetReader.nextBatch(ParquetReaderFactory.java:406)
| ParquetSchemaConverter.convertToParquetMessageType( | ||
| "paimon-parquet", NESTED_ARRAY_MAP_TYPE); | ||
| String[] candidates = new String[] {"snappy", "zstd", "gzip"}; | ||
| String compress = candidates[new Random().nextInt(3)]; |
There was a problem hiding this comment.
I know this was intended to balance coverage and test running time, but using random in test cases is usually not ideal. Let's use a parameterized test here for these three formats.
There was a problem hiding this comment.
Sorry to reply so late, this is refer to the previous pr, i will change it to the parameterized test
| HeapBytesVector phbv = new HeapBytesVector(total, isNull); | ||
| return new ParquetDecimalVector(phbv, total); | ||
| } | ||
| default: |
There was a problem hiding this comment.
I wonder if all the existing types are covered.
There was a problem hiding this comment.
yes except the INT32 and INT64, other primitiveType should deserialize as HeapBytesVector
|
|
||
| private boolean isFirstRow = true; | ||
|
|
||
| private boolean cutLevel = false; |
There was a problem hiding this comment.
Could you add a comment for this boolean?
fa4cd3a to
e0554db
Compare
…is the same as other child vector.
|
Very thanks @Stephen0421 , but we still found some corner cases, we decided to revert old changes. see: #4745 |
|
Close this one now. |
Purpose
Linked issue: close #4710
Tests
add unit case and test in local.
API and Format
Parquet
Documentation
No