Skip to content

Add MongoToGCSOperator to copy MongoDB collections to GCS#66013

Merged
potiuk merged 4 commits into
apache:mainfrom
seungoh-lee:feature/mongo-to-gcs-operator
Jun 9, 2026
Merged

Add MongoToGCSOperator to copy MongoDB collections to GCS#66013
potiuk merged 4 commits into
apache:mainfrom
seungoh-lee:feature/mongo-to-gcs-operator

Conversation

@seungoh-lee

@seungoh-lee seungoh-lee commented Apr 28, 2026

Copy link
Copy Markdown
Contributor

Introduces a new transfer operator under the Google provider that exports MongoDB documents to Google Cloud Storage in JSON, CSV, or Parquet format. The operator extends BaseSQLToGCSOperator and accepts either a find filter or an aggregation pipeline through mongo_query, with optional projection and allowDiskUse controls.

A small cursor adapter wraps the pymongo cursor as a DB-API style cursor so the parent operator chunking, schema inference, and upload flow can be reused. BSON-specific values (ObjectId, Decimal128, bytes) are converted to BigQuery-friendly types.

Includes provider.yaml registration, mongo provider as a transitive extra, unit tests, system test example, and how-to documentation.


Was generative AI tooling used to co-author this PR?
  • Yes — Claude Code (Opus 4.7)

Generated-by: Claude Code (Opus 4.7) following the guidelines


  • Read the Pull Request Guidelines for more information. Note: commit author/co-author name and email in commits become permanently public when merged.
  • For fundamental code changes, an Airflow Improvement Proposal (AIP) is needed.
  • When adding dependency, check compliance with the ASF 3rd Party License Policy.
  • For significant user-facing changes create newsfragment: {pr_number}.significant.rst, in airflow-core/newsfragments. You can add this file in a follow-up commit after the PR is created so you know the PR number.

@seungoh-lee seungoh-lee requested a review from shahar1 as a code owner April 28, 2026 10:24
@boring-cyborg boring-cyborg Bot added area:providers kind:documentation provider:google Google (including GCP) related issues labels Apr 28, 2026
@boring-cyborg

boring-cyborg Bot commented Apr 28, 2026

Copy link
Copy Markdown

Congratulations on your first Pull Request and welcome to the Apache Airflow community! If you have any issues or are unsure about any anything please check our Contributors' Guide
Here are some useful points:

  • Pay attention to the quality of your code (ruff, mypy and type annotations). Our prek-hooks will help you with that.
  • In case of a new feature add useful documentation (in docstrings or in docs/ directory). Adding a new operator? Check this short guide Consider adding an example DAG that shows how users should use it.
  • Consider using Breeze environment for testing locally, it's a heavy docker but it ships with a working Airflow and a lot of integrations.
  • Be patient and persistent. It might take some time to get a review or get the final approval from Committers.
  • Please follow ASF Code of Conduct for all communication including (but not limited to) comments on Pull Requests, Mailing list and Slack.
  • Be sure to read the Airflow Coding style.
  • Always keep your Pull Requests rebased, otherwise your build might fail due to changes not related to your commits.
    Apache Airflow is a community-driven project and together we are making it better 🚀.
    In case of doubts contact the developers at:
    Mailing List: dev@airflow.apache.org
    Slack: https://s.apache.org/airflow-slack

@seungoh-lee

Copy link
Copy Markdown
Contributor Author

I don't think these test failures are related to my changes. could you please take a look?

@amoghrajesh

Copy link
Copy Markdown
Contributor

Yeah they don't. Pls rebase with main

@seungoh-lee seungoh-lee force-pushed the feature/mongo-to-gcs-operator branch from 7a82ce4 to f31d000 Compare May 7, 2026 06:18
@seungoh-lee

Copy link
Copy Markdown
Contributor Author

Sorry, rebased onto latest main

@seungoh-lee seungoh-lee force-pushed the feature/mongo-to-gcs-operator branch from df9e79f to 2a22b0e Compare May 13, 2026 09:16
@seungoh-lee seungoh-lee force-pushed the feature/mongo-to-gcs-operator branch from 2a22b0e to b3b8215 Compare May 21, 2026 09:20
@potiuk

potiuk commented May 24, 2026

Copy link
Copy Markdown
Member

@seungoh-lee A few things need addressing before review — see our Pull Request quality criteria.

  • CI fails: CI image checks / Build documentation (--spellcheck-only) (and possibly other checks — see the Checks tab for the full list).

No rush.


Note: This comment was drafted by an AI-assisted triage tool and may contain mistakes. Once you have addressed the points above, an Apache Airflow maintainer — a real person — will take the next look at your PR. We use this two-stage triage process so that our maintainers' limited time is spent where it matters most: the conversation with you.


Drafted-by: Claude Code (Opus 4.7); reviewed by @potiuk before posting

@seungoh-lee seungoh-lee force-pushed the feature/mongo-to-gcs-operator branch from b3b8215 to 85705b3 Compare May 26, 2026 02:02
@seungoh-lee

Copy link
Copy Markdown
Contributor Author

@seungoh-lee A few things need addressing before review — see our Pull Request quality criteria.

  • CI fails: CI image checks / Build documentation (--spellcheck-only) (and possibly other checks — see the Checks tab for the full list).

No rush.

Note: This comment was drafted by an AI-assisted triage tool and may contain mistakes. Once you have addressed the points above, an Apache Airflow maintainer — a real person — will take the next look at your PR. We use this two-stage triage process so that our maintainers' limited time is spent where it matters most: the conversation with you.

Drafted-by: Claude Code (Opus 4.7); reviewed by @potiuk before posting

I'm done reflecting what you said

Can you check it one more time?

@shahar1 shahar1 left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Apologies that it took too long to review.
Looks solid overall, couple of comments to resolve and I think that we're good :)

Comment thread providers/google/src/airflow/providers/google/cloud/transfers/mongo_to_gcs.py Outdated
Comment thread providers/google/tests/unit/google/cloud/transfers/test_mongo_to_gcs.py Outdated
Introduces a new transfer operator under the Google provider that exports
MongoDB documents to Google Cloud Storage in JSON, CSV, or Parquet format.
The operator extends BaseSQLToGCSOperator and accepts either a find filter
or an aggregation pipeline through mongo_query, with optional projection
and allowDiskUse controls.

A small cursor adapter wraps the pymongo cursor as a DB-API style cursor
so the parent operator chunking, schema inference, and upload flow can
be reused. BSON-specific values (ObjectId, Decimal128, bytes) are
converted to BigQuery-friendly types.

Includes provider.yaml registration, mongo provider as a transitive extra,
unit tests, system test example, and how-to documentation.
- mongo_to_gcs.py: declare cursor as Any so the find/aggregate
  branches don't trip mypy's stricter type narrowing.
- test_selective_checks: add mongo to google-related expected
  provider lists, now that mongo is a google cross-provider dep.
- get_provider_info.py: register the mongo_to_gcs transfer so the
  provider build-files check stays in sync with provider.yaml.
- uv.lock: refresh to add the mongo extra and pick up an
  unrelated dev/registry tomli dep that was already on main.
The update-providers-build-files prek hook also regenerates the
google provider docs/index.rst from provider.yaml. The previous
fixup commit added mongo to provider.yaml/pyproject.toml but did
not include the docs/index.rst line, so the CI hook kept failing.
@seungoh-lee seungoh-lee force-pushed the feature/mongo-to-gcs-operator branch from 85705b3 to 1597207 Compare May 29, 2026 15:09
- Document why MongoToGCSOperator reuses the SQL-to-GCS base class despite
  MongoDB being a NoSQL store, and note a possible future BaseNoSQLToGCSOperator.
- Drop the inherited but unused `sql` template field (and the `.sql` template
  extension / sql renderer) so the empty `sql` value is no longer exposed in
  rendered templates; this operator is driven by `mongo_query`.
- Use `autospec=True` when patching MongoHook/GCSHook in the unit tests.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@seungoh-lee seungoh-lee force-pushed the feature/mongo-to-gcs-operator branch from 1597207 to c2f7d71 Compare May 30, 2026 03:50

@shahar1 shahar1 left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well done!!!

@potiuk potiuk added the ready for maintainer review Set after triaging when all criteria pass. label Jun 8, 2026
@potiuk potiuk merged commit d896110 into apache:main Jun 9, 2026
109 checks passed
@boring-cyborg

boring-cyborg Bot commented Jun 9, 2026

Copy link
Copy Markdown

Awesome work, congrats on your first merged pull request! You are invited to check our Issue Tracker for additional contributions.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

area:providers kind:documentation provider:google Google (including GCP) related issues ready for maintainer review Set after triaging when all criteria pass.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants