Skip to content

offset overflow at large (>2.4GB) string or binary columns#294

Closed
skirpichenko wants to merge 3 commits into
wesm:masterfrom
skirpichenko:master
Closed

offset overflow at large (>2.4GB) string or binary columns#294
skirpichenko wants to merge 3 commits into
wesm:masterfrom
skirpichenko:master

Conversation

@skirpichenko

Copy link
Copy Markdown

This PR closes the issue #232.

Feather saves string and binary columns as a flat buffer preposed by an array of offsets to the beginning and ending point of each field. The types of offsets are int32_t (32-bit signed integer). This leads to overflow when column size is bigger than 2.4GB and makes impossible to work with large data tables. The problem can be fixed without changing offset type and binary file format. The difference between the ending and beginning points still gives correct field length even with overflowed offsets. Thus the correct location of each field can be reconstructed summing up the lengths of all previous fields.

@codecov-io

Copy link
Copy Markdown

Codecov Report

Merging #294 into master will decrease coverage by 27.82%.
The diff coverage is n/a.

Impacted file tree graph

@@             Coverage Diff             @@
##           master     #294       +/-   ##
===========================================
- Coverage   87.19%   59.37%   -27.83%     
===========================================
  Files           5        6        +1     
  Lines         414     2449     +2035     
===========================================
+ Hits          361     1454     +1093     
- Misses         53      995      +942
Impacted Files Coverage Δ
python/feather/__init__.py 55.23% <0%> (-44.77%) ⬇️
python/feather/api.py 71.42% <0%> (-26.2%) ⬇️
integration-tests/util.py 22.13% <0%> (ø)
python/feather/compat.py 70.65% <0%> (+9.36%) ⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 8872294...eb44c65. Read the comment docs.

@wesm

wesm commented Apr 1, 2017

Copy link
Copy Markdown
Owner

I'll have a closer look at this.

I just opened https://issues.apache.org/jira/browse/ARROW-750 -- note that the C++ code base here is effectively deprecated at this point. I'm looking for R developers to help with building more general Arrow Rcpp bindings, and continuing to support Feather in the Apache Arrow repo. Would you or anyone else from the R community like to get involved? There will be a lot of long term value in having high quality R bindings to the Arrow libraries (e.g. it would give you Parquet file support with relatively little effort)

@jameslamb jameslamb mentioned this pull request Apr 18, 2017
@terrytangyuan

Copy link
Copy Markdown

@wesm Have you got a chance to look at this yet?

@terrytangyuan

Copy link
Copy Markdown

@wesm Regarding your comment on high quality R bindings, do you guys have a list of potential items or roadmap? We should probably open a new ticket with more details so more people from R community can comment on it and discuss designs. cc: @hadley @kevinushey @krlmlr

@wesm

wesm commented Jun 8, 2017

Copy link
Copy Markdown
Owner

Hi,

We should start with R bindings for the Arrow C++ shared library that provide the same Feather support we have now (using the Rcpp code here), then expand to Arrow's more general stream and file formats (which are like "chunked Feather supporting nested data and more logical data type"). From there, it should be use case driven, e.g. supporting better Spark interoperability for sparklyr and SparkR. Does that make sense?

@wesm

wesm commented Jun 11, 2017

Copy link
Copy Markdown
Owner

Closing this as Won't Fix. There is a JIRA about adding LargeBinary and LargeString (UTF-8) types to the Arrow metadata, so that will be the approach to fix this issue: https://issues.apache.org/jira/browse/ARROW-750. I would love to have some R developers get involved with the Apache Arrow project to carry on Feather development.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants