Skip to content

AWS: Change S3FileIO to use SHA1 based checksums#10293

Closed
muddyfish wants to merge 1 commit into
apache:mainfrom
muddyfish:fix/s3fileio-express-mpu
Closed

AWS: Change S3FileIO to use SHA1 based checksums#10293
muddyfish wants to merge 1 commit into
apache:mainfrom
muddyfish:fix/s3fileio-express-mpu

Conversation

@muddyfish

Copy link
Copy Markdown

Issue

Using S3 Express for Iceberg was previously failing multipart upload integration tests as well as when s3.checksum-enabled was set. This was because S3 Express doesn't support using MD5 based checksums.

Fix

Changes S3FileIO to use SHA1 based checksums when checksumming is enabled.

Testing

Ran S3FileIO integration tests against both a regular S3 bucket as well as an S3 Express bucket. For regular S3, all tests pass, and for S3 Express, the checksumming errors are now resolved.

@github-actions github-actions Bot added the AWS label May 9, 2024

@singhpk234 singhpk234 left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

couple of questions :

  1. Are there scenarios in which we would prefer checksum to be MD5 instead of SHA1 from S3 perspective ?
  2. Was reading MD5 is faster than SHA1 so does it makes sense to expose the algorithm selection as a parameter, rather than making it default ? And specify in iceberg docs that when using express use SHA1 ?

@amogh-jahagirdar amogh-jahagirdar left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have similar questions to @singhpk234. I think changing the the checksum default to a more expensive SHA1 validation may be more expensive without any significant benefits. TLS already would have some more modern HMAC integrity check anyways.

I also think we should avoid adding any net new configuration to S3FileIO just to keep things simpler.

One possible path may be to deprecate the checksumEnabled path, and add a new configuration for checksumAlgorithm where the possible values are none, md5, sha1 etc. The default would be none to preserve the existing behavior and that way the number of configurations remain the same in the long run, and we could also support the S3 express case.

What do you think @muddyfish @singhpk234?

@muddyfish

Copy link
Copy Markdown
Author

I think that's a reasonable comment, and I'd be happy going for that path forward.

@singhpk234

Copy link
Copy Markdown
Contributor

What do you think @muddyfish @singhpk234?

Sounds good, @amogh-jahagirdar !

@steveloughran

Copy link
Copy Markdown
Contributor

checksumming input streams on the v2 sdk kills performance, not matter what you use. might be best to turn off. see apache/hadoop#6441

@github-actions

github-actions Bot commented Nov 2, 2024

Copy link
Copy Markdown

This pull request has been marked as stale due to 30 days of inactivity. It will be closed in 1 week if no further activity occurs. If you think that’s incorrect or this pull request requires a review, please simply write any comment. If closed, you can revive the PR at any time and @mention a reviewer or discuss it on the dev@iceberg.apache.org list. Thank you for your contributions.

@github-actions github-actions Bot added the stale label Nov 2, 2024
@muddyfish muddyfish closed this Nov 6, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants