Skip to content

EndOfStreamException - can't read empty data page #88

@ishepherd

Description

@ishepherd

Version: Parquet.Net latest source (v3.7.7)

Runtime Version: .Net Core v3.0, I presume - I am running the Parquet.Test tests in VS

OS: Windows

Expected behavior

ParquetReader.ReadEntireRowGroup to complete successfully with my test file that was output by Apache Spark 2.4.5. Sorry, cannot share this file.

Python's pandas read_parquet reads it successfully.

Actual behavior

Message: 
    System.IO.EndOfStreamException : Unable to read beyond the end of the stream.
  Stack Trace: 
    BinaryReader.InternalRead(Int32 numBytes)
    BinaryReader.ReadInt32()
    RunLengthBitPackingHybridValuesReader.ReadRleBitpackedHybrid(BinaryReader reader, Int32 bitWidth, Int32 length, Int32[] dest, Int32 offset, Int32 pageSize) line 37
    DataColumnReader.ReadPlainDictionary(BinaryReader reader, Int32 maxReadCount, Int32[] dest, Int32 offset) line 279
    DataColumnReader.ReadColumn(BinaryReader reader, Encoding encoding, Int64 totalValues, Int32 maxReadCount, ColumnRawData cd) line 253
    DataColumnReader.ReadDataPage(PageHeader ph, ColumnRawData cd, Int64 maxValues) line 216
    DataColumnReader.Read() line 88
    ParquetRowGroupReader.ReadColumn(DataField field) line 64
    ParquetReader.ReadEntireRowGroup(Int32 rowGroupIndex) line 141
    ParquetReaderTest.Reads_Exception() line 24

Here is Parquet.Thrift.PageHeader.ToString() for the last page of this column - formatted for readability:

PageHeader(, Type: DATA_PAGE,
    Uncompressed_page_size: 8, Compressed_page_size: 28,
    Crc: -680454176,
    Data_page_header: DataPageHeader(, Num_values: 125, Encoding: PLAIN_DICTIONARY,
        Definition_level_encoding: RLE, Repetition_level_encoding: BIT_PACKED,
        Statistics: Statistics(Null_count: 125)
    )
)

Call graph:

Parquet.File.DataColumnReader
\-> ReadDataPage
    \-> ReadPageData                - ungzips the page, produces 8 bytes
    \-> ReadLevels                  - consumes 7 bytes.
    \-> ReadColumn
        \-> ReadPlainDictionary     - consumes the last byte.
            \-> GetRemainingLength  - returns 0, as expected.
            \-> RunLengthBitPackingHybridValuesReader.ReadRleBitpackedHybrid    <boom>

That method interprets "length == 0" to mean "length is unknown, find it in the stream":

public static int ReadRleBitpackedHybrid(BinaryReader reader, int bitWidth, int length, int[] dest, int offset, int pageSize)
{
if (length == 0) length = reader.ReadInt32();

But the length really is 0 and trying to read from the empty stream causes the boom.

If I just have ReadPlainDictionary skip the ReadRleBitpackedHybrid call in this case, then all is well - The file's remaining columns are read successfully. Is this a sensible solution?

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions