EndOfStreamException - can't read empty data page

**Version:** Parquet.Net latest source (v3.7.7)

**Runtime Version:** .Net Core v3.0, I presume - I am running the Parquet.Test tests in VS

**OS:** Windows

#### Expected behavior

ParquetReader.ReadEntireRowGroup to complete successfully with my test file that was output by Apache Spark 2.4.5. Sorry, cannot share this file.

Python's pandas read_parquet reads it successfully.

#### Actual behavior

```
Message: 
    System.IO.EndOfStreamException : Unable to read beyond the end of the stream.
  Stack Trace: 
    BinaryReader.InternalRead(Int32 numBytes)
    BinaryReader.ReadInt32()
    RunLengthBitPackingHybridValuesReader.ReadRleBitpackedHybrid(BinaryReader reader, Int32 bitWidth, Int32 length, Int32[] dest, Int32 offset, Int32 pageSize) line 37
    DataColumnReader.ReadPlainDictionary(BinaryReader reader, Int32 maxReadCount, Int32[] dest, Int32 offset) line 279
    DataColumnReader.ReadColumn(BinaryReader reader, Encoding encoding, Int64 totalValues, Int32 maxReadCount, ColumnRawData cd) line 253
    DataColumnReader.ReadDataPage(PageHeader ph, ColumnRawData cd, Int64 maxValues) line 216
    DataColumnReader.Read() line 88
    ParquetRowGroupReader.ReadColumn(DataField field) line 64
    ParquetReader.ReadEntireRowGroup(Int32 rowGroupIndex) line 141
    ParquetReaderTest.Reads_Exception() line 24
```

Here is Parquet.Thrift.PageHeader.ToString() for the *last page* of this column - formatted for readability:

```
PageHeader(, Type: DATA_PAGE,
    Uncompressed_page_size: 8, Compressed_page_size: 28,
    Crc: -680454176,
    Data_page_header: DataPageHeader(, Num_values: 125, Encoding: PLAIN_DICTIONARY,
        Definition_level_encoding: RLE, Repetition_level_encoding: BIT_PACKED,
        Statistics: Statistics(Null_count: 125)
    )
)
```

Call graph:
```
Parquet.File.DataColumnReader
\-> ReadDataPage
    \-> ReadPageData                - ungzips the page, produces 8 bytes
    \-> ReadLevels                  - consumes 7 bytes.
    \-> ReadColumn
        \-> ReadPlainDictionary     - consumes the last byte.
            \-> GetRemainingLength  - returns 0, as expected.
            \-> RunLengthBitPackingHybridValuesReader.ReadRleBitpackedHybrid    <boom>
```

That method interprets "length == 0" to mean "length is unknown, find it in the stream":

https://github.com/aloneguid/parquet-dotnet/blob/60e454520eae7f7945bea471b0e9cb888c09cae9/src/Parquet/File/Values/RunLengthBitPackingHybridValuesReader.cs#L35-L37

But the length really is 0 and trying to read from the empty stream causes the boom.

If I just have ReadPlainDictionary skip the ReadRleBitpackedHybrid call in this case, then all is well - The file's remaining columns are read successfully. **Is this a sensible solution?**

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

EndOfStreamException - can't read empty data page #88

Expected behavior

Actual behavior

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

	public static int ReadRleBitpackedHybrid(BinaryReader reader, int bitWidth, int length, int[] dest, int offset, int pageSize)
	{
	if (length == 0) length = reader.ReadInt32();

Uh oh!

EndOfStreamException - can't read empty data page #88

Description

Expected behavior

Actual behavior

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions