Add MongoToGCSOperator to copy MongoDB collections to GCS#66013
Conversation
|
Congratulations on your first Pull Request and welcome to the Apache Airflow community! If you have any issues or are unsure about any anything please check our Contributors' Guide
|
|
I don't think these test failures are related to my changes. could you please take a look? |
|
Yeah they don't. Pls rebase with main |
7a82ce4 to
f31d000
Compare
|
Sorry, rebased onto latest main |
df9e79f to
2a22b0e
Compare
2a22b0e to
b3b8215
Compare
|
@seungoh-lee A few things need addressing before review — see our Pull Request quality criteria.
No rush. Note: This comment was drafted by an AI-assisted triage tool and may contain mistakes. Once you have addressed the points above, an Apache Airflow maintainer — a real person — will take the next look at your PR. We use this two-stage triage process so that our maintainers' limited time is spent where it matters most: the conversation with you. Drafted-by: Claude Code (Opus 4.7); reviewed by @potiuk before posting |
b3b8215 to
85705b3
Compare
I'm done reflecting what you said Can you check it one more time? |
shahar1
left a comment
There was a problem hiding this comment.
Apologies that it took too long to review.
Looks solid overall, couple of comments to resolve and I think that we're good :)
Introduces a new transfer operator under the Google provider that exports MongoDB documents to Google Cloud Storage in JSON, CSV, or Parquet format. The operator extends BaseSQLToGCSOperator and accepts either a find filter or an aggregation pipeline through mongo_query, with optional projection and allowDiskUse controls. A small cursor adapter wraps the pymongo cursor as a DB-API style cursor so the parent operator chunking, schema inference, and upload flow can be reused. BSON-specific values (ObjectId, Decimal128, bytes) are converted to BigQuery-friendly types. Includes provider.yaml registration, mongo provider as a transitive extra, unit tests, system test example, and how-to documentation.
- mongo_to_gcs.py: declare cursor as Any so the find/aggregate branches don't trip mypy's stricter type narrowing. - test_selective_checks: add mongo to google-related expected provider lists, now that mongo is a google cross-provider dep. - get_provider_info.py: register the mongo_to_gcs transfer so the provider build-files check stays in sync with provider.yaml. - uv.lock: refresh to add the mongo extra and pick up an unrelated dev/registry tomli dep that was already on main.
The update-providers-build-files prek hook also regenerates the google provider docs/index.rst from provider.yaml. The previous fixup commit added mongo to provider.yaml/pyproject.toml but did not include the docs/index.rst line, so the CI hook kept failing.
85705b3 to
1597207
Compare
- Document why MongoToGCSOperator reuses the SQL-to-GCS base class despite MongoDB being a NoSQL store, and note a possible future BaseNoSQLToGCSOperator. - Drop the inherited but unused `sql` template field (and the `.sql` template extension / sql renderer) so the empty `sql` value is no longer exposed in rendered templates; this operator is driven by `mongo_query`. - Use `autospec=True` when patching MongoHook/GCSHook in the unit tests. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
1597207 to
c2f7d71
Compare
|
Awesome work, congrats on your first merged pull request! You are invited to check our Issue Tracker for additional contributions. |
Introduces a new transfer operator under the Google provider that exports MongoDB documents to Google Cloud Storage in JSON, CSV, or Parquet format. The operator extends BaseSQLToGCSOperator and accepts either a find filter or an aggregation pipeline through mongo_query, with optional projection and allowDiskUse controls.
A small cursor adapter wraps the pymongo cursor as a DB-API style cursor so the parent operator chunking, schema inference, and upload flow can be reused. BSON-specific values (ObjectId, Decimal128, bytes) are converted to BigQuery-friendly types.
Includes provider.yaml registration, mongo provider as a transitive extra, unit tests, system test example, and how-to documentation.
Was generative AI tooling used to co-author this PR?
Generated-by: Claude Code (Opus 4.7) following the guidelines
{pr_number}.significant.rst, in airflow-core/newsfragments. You can add this file in a follow-up commit after the PR is created so you know the PR number.