-
-
Notifications
You must be signed in to change notification settings - Fork 180
Description
Version: Parquet.Net latest source (v3.7.7)
Runtime Version: .Net Core v3.0, I presume - I am running the Parquet.Test tests in VS
OS: Windows
Expected behavior
ParquetReader.ReadEntireRowGroup to complete successfully with my test file that was output by Apache Spark 2.4.5. Sorry, cannot share this file.
Python's pandas read_parquet reads it successfully.
Actual behavior
Message:
System.IO.EndOfStreamException : Unable to read beyond the end of the stream.
Stack Trace:
BinaryReader.InternalRead(Int32 numBytes)
BinaryReader.ReadInt32()
RunLengthBitPackingHybridValuesReader.ReadRleBitpackedHybrid(BinaryReader reader, Int32 bitWidth, Int32 length, Int32[] dest, Int32 offset, Int32 pageSize) line 37
DataColumnReader.ReadPlainDictionary(BinaryReader reader, Int32 maxReadCount, Int32[] dest, Int32 offset) line 279
DataColumnReader.ReadColumn(BinaryReader reader, Encoding encoding, Int64 totalValues, Int32 maxReadCount, ColumnRawData cd) line 253
DataColumnReader.ReadDataPage(PageHeader ph, ColumnRawData cd, Int64 maxValues) line 216
DataColumnReader.Read() line 88
ParquetRowGroupReader.ReadColumn(DataField field) line 64
ParquetReader.ReadEntireRowGroup(Int32 rowGroupIndex) line 141
ParquetReaderTest.Reads_Exception() line 24
Here is Parquet.Thrift.PageHeader.ToString() for the last page of this column - formatted for readability:
PageHeader(, Type: DATA_PAGE,
Uncompressed_page_size: 8, Compressed_page_size: 28,
Crc: -680454176,
Data_page_header: DataPageHeader(, Num_values: 125, Encoding: PLAIN_DICTIONARY,
Definition_level_encoding: RLE, Repetition_level_encoding: BIT_PACKED,
Statistics: Statistics(Null_count: 125)
)
)
Call graph:
Parquet.File.DataColumnReader
\-> ReadDataPage
\-> ReadPageData - ungzips the page, produces 8 bytes
\-> ReadLevels - consumes 7 bytes.
\-> ReadColumn
\-> ReadPlainDictionary - consumes the last byte.
\-> GetRemainingLength - returns 0, as expected.
\-> RunLengthBitPackingHybridValuesReader.ReadRleBitpackedHybrid <boom>
That method interprets "length == 0" to mean "length is unknown, find it in the stream":
parquet-dotnet/src/Parquet/File/Values/RunLengthBitPackingHybridValuesReader.cs
Lines 35 to 37 in 60e4545
| public static int ReadRleBitpackedHybrid(BinaryReader reader, int bitWidth, int length, int[] dest, int offset, int pageSize) | |
| { | |
| if (length == 0) length = reader.ReadInt32(); |
But the length really is 0 and trying to read from the empty stream causes the boom.
If I just have ReadPlainDictionary skip the ReadRleBitpackedHybrid call in this case, then all is well - The file's remaining columns are read successfully. Is this a sensible solution?