Skip to content

[DCP Ingestion] Automate the DB initialization, Dockerize Ingestion Helper, and fix lock acquisition issue#1971

Merged
gmechali merged 9 commits intodatacommonsorg:masterfrom
gmechali:dockerize
Apr 24, 2026
Merged

[DCP Ingestion] Automate the DB initialization, Dockerize Ingestion Helper, and fix lock acquisition issue#1971
gmechali merged 9 commits intodatacommonsorg:masterfrom
gmechali:dockerize

Conversation

@gmechali
Copy link
Copy Markdown
Contributor

This PR does 3 main things:

  • It dockerizes the ingestion helper such that we can run it in a Cloud Run Server. Today for base DC we deploy it as a Cloud Run Function, but that isnt ideal for DCP. For DCP, we will deploy a Cloud run server on the datacommon-ingestion-helper image. I have added a cloud build trigger in datcom-ci to rebuild the image everytime we make a change to the master branch on the ingestion-helper directory.
  • Adds the Initialize Database action to the IngestionHelper function, which initializes all the tables in spanner_schema.sql (Edge, Node, Observation, ImportStatus, IngestionLock, ImportVersionHistory, IngestionHistory).
  • Modifies the Lock Acquisition step to insert the lock row for the first time this is run.

As a follow up we should do the following:

  • Migrate the Base DC deployed cloud run function to use this dockerized imaged and deploy it in cloud run server.
  • Remove all the logic from the ingestion pipeline that initializes the database.

Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces a database initialization feature for the ingestion helper, including Docker and Cloud Build configurations, a Spanner schema, and logic to apply DDL statements. The locking mechanism was also updated to handle the creation of new lock rows. Feedback provides a suggestion for better DatabaseAdminClient configuration.

Comment thread import-automation/workflow/ingestion-helper/spanner_client.py
Comment thread import-automation/workflow/ingestion-helper/schema.sql
@gmechali gmechali requested review from dwnoble, keyurva and vish-cs April 20, 2026 17:06
Comment thread import-automation/workflow/ingestion-helper/main.py
Comment thread import-automation/workflow/ingestion-helper/schema.sql
@gmechali gmechali requested a review from vish-cs April 21, 2026 12:36
Copy link
Copy Markdown

@dwnoble dwnoble left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks Gabe!

I think this tooling may be better suited to live in the DCP repo at https://github.com/datacommonsorg/datacommons

This PR converts the import automation cloud functions to a REST service, but if we're going to do that, I think it makes sense to consolidate it with the DCP FastAPI service so we don't end up adding a separate process.

Then this action could look something like: POST https://<dcp-service>/api/admin/initialize-database

Regardless, i think we should also:

  • convert this toolset from requirements.txt files to uv so we can have consistent builds with the uv.lock files
  • right now the functions in main.py are tied to cloud run functions. we should migrate these to a plain python module, and the cloud run part should be minimal wrapper on top of it
  • we should expose these functions via a command line tool. If we migrate to the dcp repo, this might look like: $ datacommons admin initialize-database or $ datacommons admin release-ingestion-lock.

cc @vish-cs

Comment thread import-automation/workflow/ingestion-helper/spanner_client.py
Comment thread import-automation/workflow/ingestion-helper/Dockerfile
Comment thread import-automation/workflow/ingestion-helper/README.md
@gmechali
Copy link
Copy Markdown
Contributor Author

Thanks Gabe!

I think this tooling may be better suited to live in the DCP repo at https://github.com/datacommonsorg/datacommons

This PR converts the import automation cloud functions to a REST service, but if we're going to do that, I think it makes sense to consolidate it with the DCP FastAPI service so we don't end up adding a separate process.

Then this action could look something like: POST https://<dcp-service>/api/admin/initialize-database

Regardless, i think we should also:

  • convert this toolset from requirements.txt files to uv so we can have consistent builds with the uv.lock files
  • right now the functions in main.py are tied to cloud run functions. we should migrate these to a plain python module, and the cloud run part should be minimal wrapper on top of it
  • we should expose these functions via a command line tool. If we migrate to the dcp repo, this might look like: $ datacommons admin initialize-database or $ datacommons admin release-ingestion-lock.

cc @vish-cs

Other comment kind of replies about moving to DCP repo. I think one day maybe but it's premature, and splitting up ingestion code also seems controversial.
Converting to UV - done!
Migrating from cloud functions to a plain python module, agreed but I dont think there's an immediate need for it? Can we track this post-M1?
Command line tool, again I think this is way past M1 right? All nice to haves, but are we aligned that these are future improvements, not for MVP?

@gmechali gmechali requested a review from dwnoble April 22, 2026 17:22
Copy link
Copy Markdown

@dwnoble dwnoble left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Other comment kind of replies about moving to DCP repo. I think one day maybe but it's premature, and splitting up ingestion code also seems controversial.
Converting to UV - done!
Migrating from cloud functions to a plain python module, agreed but I dont think there's an immediate need for it? Can we track this post-M1?
Command line tool, again I think this is way past M1 right? All nice to haves, but are we aligned that these are future improvements, not for MVP?

Adding the python module & CLI later SGTM.

For converting to uv, we should also migrate from requirements.txt to pyproject.toml (example: https://github.com/datacommonsorg/datacommons/blob/main/packages/datacommons-db/pyproject.toml)

but that can also happen in a follow-up!

Comment thread import-automation/workflow/ingestion-helper/README.md
Comment thread import-automation/workflow/ingestion-helper/Dockerfile
@gmechali
Copy link
Copy Markdown
Contributor Author

Other comment kind of replies about moving to DCP repo. I think one day maybe but it's premature, and splitting up ingestion code also seems controversial.
Converting to UV - done!
Migrating from cloud functions to a plain python module, agreed but I dont think there's an immediate need for it? Can we track this post-M1?
Command line tool, again I think this is way past M1 right? All nice to haves, but are we aligned that these are future improvements, not for MVP?

Adding the python module & CLI later SGTM.

For converting to uv, we should also migrate from requirements.txt to pyproject.toml (example: https://github.com/datacommonsorg/datacommons/blob/main/packages/datacommons-db/pyproject.toml)

but that can also happen in a follow-up!

Changed to pyproject.toml!

@gmechali gmechali merged commit 3d7705b into datacommonsorg:master Apr 24, 2026
9 checks passed
@gmechali gmechali deleted the dockerize branch April 24, 2026 18:44
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants