Skip to content

[BUG]: Very large parquet files cannot be read #635

@NeilMacMullen

Description

@NeilMacMullen

Library Version

4.5.1

OS

Windows

OS Architecture

64 bit

How to reproduce?

Create a very large parquet file with a single rowgroup, For example I have a parquet file with 50M rows and a dozen columns.

Attempt to read the file using

 using var fileReader = await ParquetReader.CreateAsync(fileStream);
var rowGroup = await fileReader.ReadEntireRowGroupAsync(0);
foreach (var column in rowGroup
{
..get data using column.Data
}

The following exception is thrown..

sourceIndex ('-2147483520') must be greater than or equal to '0'. (Parameter 'sourceIndex')
Actual value was -2147483520

-2147483520 is 0x80000080 so it looks like sourceIndex is an Int32 that has wrapped around.

It's possible to workaround this issue by saving the file in multiple rowgroups so the assumption seems to be that no single rowgroup will be larger than 0x80000000 bytes(?).

If the reader wants to maintain this assumption it would be useful if the Write function could throw when attempting to write too large a rowgroup so as to avoid accidentally building up a library of unreadable files!

Failing test

Metadata

Metadata

Assignees

No one assigned

    Labels

    🔮future improvementThis issue will take some time to integrate.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions