AWS: Change S3FileIO to use SHA1 based checksums#10293
Conversation
singhpk234
left a comment
There was a problem hiding this comment.
couple of questions :
- Are there scenarios in which we would prefer checksum to be MD5 instead of SHA1 from S3 perspective ?
- Was reading MD5 is faster than SHA1 so does it makes sense to expose the algorithm selection as a parameter, rather than making it default ? And specify in iceberg docs that when using express use SHA1 ?
amogh-jahagirdar
left a comment
There was a problem hiding this comment.
I have similar questions to @singhpk234. I think changing the the checksum default to a more expensive SHA1 validation may be more expensive without any significant benefits. TLS already would have some more modern HMAC integrity check anyways.
I also think we should avoid adding any net new configuration to S3FileIO just to keep things simpler.
One possible path may be to deprecate the checksumEnabled path, and add a new configuration for checksumAlgorithm where the possible values are none, md5, sha1 etc. The default would be none to preserve the existing behavior and that way the number of configurations remain the same in the long run, and we could also support the S3 express case.
What do you think @muddyfish @singhpk234?
|
I think that's a reasonable comment, and I'd be happy going for that path forward. |
Sounds good, @amogh-jahagirdar ! |
|
checksumming input streams on the v2 sdk kills performance, not matter what you use. might be best to turn off. see apache/hadoop#6441 |
|
This pull request has been marked as stale due to 30 days of inactivity. It will be closed in 1 week if no further activity occurs. If you think that’s incorrect or this pull request requires a review, please simply write any comment. If closed, you can revive the PR at any time and @mention a reviewer or discuss it on the dev@iceberg.apache.org list. Thank you for your contributions. |
Issue
Using S3 Express for Iceberg was previously failing multipart upload integration tests as well as when
s3.checksum-enabledwas set. This was because S3 Express doesn't support using MD5 based checksums.Fix
Changes S3FileIO to use SHA1 based checksums when checksumming is enabled.
Testing
Ran S3FileIO integration tests against both a regular S3 bucket as well as an S3 Express bucket. For regular S3, all tests pass, and for S3 Express, the checksumming errors are now resolved.