Broke infra/dcp terraform script up into smaller modules by dwnoble · Pull Request #36 · datacommonsorg/datacommons

dwnoble · 2026-04-27T04:47:09Z

Refactored the infra/dcp terraform folders by flattening the directory structure and breaking down the cdc and dcp modules into a set of smaller sub-modules.

Changes

Flattening: Moved all sub-modules directly under infra/dcp/modules/.
stack: Orchestrates sub-modules based on feature toggles (modules/stack).
cdc_data_ingestion_job: Ingestion Cloud Run v2 Job.
cdc_iam: IAM and Secret Manager config for CDC.
cdc_mysql: Cloud SQL MySQL instance and databases.
cdc_network: VPC and serverless access connectors.
cdc_redis: Memorystore Redis instance.
cdc_services: CDC Cloud Run v2 web services (modules/cdc_services).
cdc_storage: GCS buckets for I/O.
dcp_dataflow_ingestion: Dataflow and helper service for DCP.
dcp_ingestion_workflow: Cloud Workflows for orchestration.
dcp_service: DCP Cloud Run service.
dcp_storage: GCS bucket for DCP pipeline.
spanner: Shared Cloud Spanner instance and databases.

Tested in the datcom-website-dev project under the danmod namespace

gemini-code-assist

Code Review

This pull request refactors the Data Commons Platform infrastructure into a modular architecture, separating the DCP and Custom Data Commons (CDC) stacks and introducing a central orchestration module. Feedback focuses on improving infrastructure stability and security, specifically recommending against the use of 'force_destroy' on data buckets, the ':latest' container tag, and authoritative IAM bindings. Additionally, the use of 'timestamp()' in environment variables was identified as a cause for unnecessary resource redeployments.

gemini-code-assist · 2026-04-27T04:49:09Z

+resource "google_storage_bucket" "data_bucket" {
+  name          = local.gcs_data_bucket_name
+  location      = var.gcs_data_bucket_location
+  force_destroy = true


Setting force_destroy = true on a data storage bucket is risky as it allows Terraform to delete the bucket even if it contains data. It is safer to set this to false or tie it to a deletion protection toggle to prevent accidental data loss.

force_destroy = false

gemini-code-assist · 2026-04-27T04:49:09Z

+
+  template {
+    containers {
+      image = "gcr.io/datcom-ci/datacommons-ingestion-helper:latest"


Using the :latest tag for container images can lead to non-deterministic deployments and makes it difficult to track which version of the code is currently running. It is recommended to use a specific version tag or an image digest for production-ready infrastructure.

image = "gcr.io/datcom-ci/datacommons-ingestion-helper:stable"

gemini-code-assist · 2026-04-27T04:49:09Z

+resource "google_cloud_run_service_iam_binding" "public_invoker" {
+  count    = var.make_service_public ? 1 : 0
+  location = google_cloud_run_v2_service.dcp_service.location
+  service  = google_cloud_run_v2_service.dcp_service.name
+  role     = "roles/run.invoker"
+  members  = ["allUsers"]
 }


Using google_cloud_run_service_iam_binding is authoritative and will overwrite any existing IAM policy on the service. It is generally safer to use google_cloud_run_v2_service_iam_member (non-authoritative) to manage access. Additionally, per repository rules, public access to Cloud Run services must be controlled by the make_dc_web_service_public boolean variable.

resource "google_cloud_run_v2_service_iam_member" "public_invoker" { count = var.make_dc_web_service_public ? 1 : 0 location = google_cloud_run_v2_service.dcp_service.location name = google_cloud_run_v2_service.dcp_service.name role = "roles/run.invoker" member = "allUsers" }

References

Public access to Cloud Run services should be controlled by a boolean variable (e.g., make_dc_web_service_public) that defaults to false to avoid unintentional public exposure.

gemini-code-assist · 2026-04-27T04:49:09Z

+    },
+    {
+      name  = "FORCE_RESTART"
+      value = "${timestamp()}"


Using timestamp() in an environment variable causes the resource to be updated on every terraform apply, even if no other changes are present. This leads to unnecessary redeployments of the Cloud Run services. Consider using a build ID or a specific version variable to trigger redeployments only when the image actually changes.

gmechali

Left a couple comments - thanks for the modularization this is such an improvement!

gmechali · 2026-04-28T14:10:35Z

+  location_id             = var.redis_location_id
+  alternative_location_id = var.redis_alternative_location_id
+  redis_version           = "REDIS_6_X"
+  display_name            = "Data Commons Redis Instance"


should we use the name prefix here?

gmechali · 2026-04-28T14:12:59Z

@@ -0,0 +1,107 @@
+locals {


super nit: can this module be dcp_ingestion_dataflow
to keep the parallel name from dcp_ingestion_workflow

gmechali · 2026-04-28T14:13:19Z

+resource "google_service_account" "dcp_ingestion_runner" {
+  count        = var.deploy ? 1 : 0
+  account_id   = "${local.name_prefix}dcp-ingestion-sa"
+  display_name = "Data Commons Platform Ingestion Runner"


nit: Add name prefix?

gmechali · 2026-04-28T14:14:45Z

+  member = "serviceAccount:${google_service_account.dcp_ingestion_runner[0].email}"
+}
+
+resource "google_cloud_run_v2_service" "ingestion_helper" {


Im curious what you think here about further modularization.

Should we have an ingestion module, with submodules for:

Dataflow

Ingestion helper

Workflow

Right now you have 2 modules

Dataflow & Ingestion helper

Workflow

gmechali · 2026-04-28T14:17:15Z

@@ -0,0 +1,3 @@
+output "ingestion_orchestrator_id" {


Should the ingestion helper URL also be an output here?

gmechali · 2026-04-28T14:18:54Z

@@ -1,3 +1,18 @@
+locals {


Fine to leave as is, but do you think we should leave this with the other terraforms for now, given this DCP runner wont be in the first deliverables?

Fine to leave in this PR, but we may want to move it out later to give customers only things that are ready.

gmechali · 2026-04-28T14:20:18Z

@@ -0,0 +1,11 @@
+locals {


Should we have a separate DCP Storage and CDC storage module? This feels redundant, and we could just use the same one right?

gmechali · 2026-04-28T14:28:00Z

@@ -0,0 +1,283 @@
+data "google_compute_network" "default" {


Im trying towrap my head around what this stack module represents, and I think what im seeing is that the stack is the control for how we recommend setting it up on the basis for the CDC or DCP setup.

Im trying to think through if there's a cleaner way to put the modules together, but I dont have an answer. Would love to chat a bit more about this specific part!

dwnoble added 2 commits April 26, 2026 20:42

wip

f551364

Terraform fixes

c73d658

dwnoble requested a review from gmechali April 27, 2026 04:47

gemini-code-assist Bot reviewed Apr 27, 2026

View reviewed changes

dwnoble added 2 commits April 26, 2026 21:51

readme changes

047b68b

readme

68452ce

gmechali reviewed Apr 28, 2026

View reviewed changes

Conversation

dwnoble commented Apr 27, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Changes

Uh oh!

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist Bot Apr 27, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist Bot Apr 27, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist Bot Apr 27, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist Bot Apr 27, 2026

Choose a reason for hiding this comment

Uh oh!

gmechali left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

dwnoble commented Apr 27, 2026 •

edited

Loading