Skip to content

ARROW-14893: [C++] Allow creating GCS filesystem from URI#12475

Closed
emkornfield wants to merge 8 commits into
apache:masterfrom
emkornfield:gcs_python
Closed

ARROW-14893: [C++] Allow creating GCS filesystem from URI#12475
emkornfield wants to merge 8 commits into
apache:masterfrom
emkornfield:gcs_python

Conversation

@emkornfield

Copy link
Copy Markdown
Contributor

This is mostly adapted from s3. One place that it differs
is it doesn't allow for username and password. The only thing
that I could see supporting in the future is taking username as
user to impersonate, but we can see if there is demand for that feature.

I also added in two useful features in s3:

  • Option for setting a default location to create a bucket in.
  • Default key-value metadata to use for object creation if none
    is specified.

Also fixed a typo in s3fs error message.

This is mostly adapted from s3. One place that it differs
is it doesn't allow for username and password. The only thing
that I could see supporting in the future is taking username as
user to impersonate, but we can see if there is demand for that feature.

I also added in two useful features in s3:
- Option for setting a default location to create a bucket in.
- Default key-value metadata to use for object creation if none
  is specified.

Also fixed a typo in s3fs error message.
@emkornfield

Copy link
Copy Markdown
Contributor Author

CC @coryan

@github-actions

Copy link
Copy Markdown

@github-actions

Copy link
Copy Markdown

⚠️ Ticket has not been started in JIRA, please click 'Start Progress'.

@coryan coryan left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, thanks for all the fixes.

@emkornfield emkornfield requested a review from pitrou February 22, 2022 03:39

@pitrou pitrou left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @emkornfield . A couple minor comments and questions.

Comment thread cpp/src/arrow/filesystem/filesystem.cc Outdated
}
return std::make_shared<LocalFileSystem>(options, io_context);
}
if (scheme == "gs") {

@pitrou pitrou Feb 22, 2022

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can/should we also accept "gcs" for compatibility with https://gcsfs.readthedocs.io/en/latest/#integration ?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems reasonable to me.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done.

Comment thread cpp/src/arrow/filesystem/gcsfs.cc Outdated
"https"};
"https",
/*default_bucket_location=*/{},
/*default_metadata=*/nullptr};

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Instead of spelling all default parameter values explicitly, it would probably more readable and maintainable to set the non-default attributes individually e.g.;

GcsOptions options{};
options.credentials = std::make_shared<GcsCredentials>(google::cloud::MakeInsecureCredentials());
options.schema = "http";
return options;

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done.

Comment thread cpp/src/arrow/filesystem/gcsfs.cc
Comment thread cpp/src/arrow/filesystem/gcsfs.h Outdated
Comment thread cpp/src/arrow/filesystem/gcsfs.h Outdated

std::string endpoint_override;
std::string scheme;
// \brief Location to use for creating buckets.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What kind of value can a location have? Is it a "region" like in S3?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, they are regions. I went back and forth on whether to call this location or region, but chose location because it corresponded with the API. I don't have a strong preference. @coryan any thoughts on this?

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are locations that are not regions:

https://cloud.google.com/storage/docs/locations#location-mr

Probably not of much interest for analytics workloads, but who knows.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've always thought of multi-regions as magic regions instead of a different concept. Unfortunately, people do use multi-regions for analytics workloads and experience the associated pain points.

Comment thread cpp/src/arrow/filesystem/gcsfs.h Outdated
emkornfield and others added 4 commits February 22, 2022 09:55
Co-authored-by: Antoine Pitrou <pitrou@free.fr>
Co-authored-by: Antoine Pitrou <pitrou@free.fr>
This reverts commit 30bd844.

@pitrou pitrou left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1, thanks for the update @emkornfield

@pitrou pitrou closed this in 8244568 Feb 23, 2022
@ursabot

ursabot commented Feb 23, 2022

Copy link
Copy Markdown

Benchmark runs are scheduled for baseline = 53e1296 and contender = 8244568. 8244568 is a master commit associated with this PR. Results will be available as each benchmark for each run completes.
Conbench compare runs links:
[Finished ⬇️0.0% ⬆️0.0%] ec2-t3-xlarge-us-east-2
[Finished ⬇️0.42% ⬆️0.0%] test-mac-arm
[Failed ⬇️0.0% ⬆️0.0%] ursa-i9-9960x
[Finished ⬇️0.68% ⬆️0.34%] ursa-thinkcentre-m75q
Supported benchmarks:
ec2-t3-xlarge-us-east-2: Supported benchmark langs: Python. Runs only benchmarks with cloud = True
test-mac-arm: Supported benchmark langs: C++, Python, R
ursa-i9-9960x: Supported benchmark langs: Python, R, JavaScript
ursa-thinkcentre-m75q: Supported benchmark langs: C++, Java

@grisaitis

Copy link
Copy Markdown

Is any additional work required to make this accessible from pyarrow / pyarrow.fs.GcsFileSystem? Would love to start using this from pyarrow.

@emkornfield

Copy link
Copy Markdown
Contributor Author

This change hadn't been released. I think with this change datsets should work. I'll hopefully have a PR up this week for pyarrow wrappers

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants