[Do not submit] Timeseries backfill avro by rohitkumarbhagat · Pull Request #501 · datacommonsorg/import

rohitkumarbhagat · 2026-04-24T04:13:26Z

No description provided.

…alidating observation data

… processing of serialized observations

codacy-production · 2026-04-24T04:15:34Z

Not up to standards ⛔

🔴 Issues 1 high · 9 medium · 47 minor

Alerts:
⚠ 57 issues (≤ 0 issues of at least minor severity)

Results:
57 new issues

Category Results

BestPractice 5 medium

Security 1 high

CodeStyle 47 minor

Complexity 4 medium

View in Codacy

🟢 Metrics 249 complexity · 22 duplication

Metric Results

Complexity 249

Duplication 22

View in Codacy

_{NEW Get contextual insights on your PRs based on Codacy's metrics, along with PR and Jira context, without leaving GitHub. Enable AI reviewer}
_{TIP This summary will be updated as you push new changes.}

gemini-code-assist

Code Review

This pull request introduces the timeseries-backfill module, which provides Beam/Dataflow pipelines and a standalone validator for backfilling normalized timeseries tables from legacy Spanner data. The implementation includes support for both live Spanner sources and Avro exports. Review feedback focuses on improving the portability and maintainability of the module by addressing hardcoded local file paths in the documentation and scripts, replacing non-standard shell utilities with more common alternatives, and consolidating redundant command-line examples in the README.

gemini-code-assist · 2026-04-24T04:16:51Z

+
+Destination schema:
+
+- [rk-experiments/mixer/spanner/bq_spanner_ingestion/timeseries_schema.sql](/home/rohitrkumar_google_com/Documents/dc/github/rohitkumarbhagat/dc_local/rk-experiments/mixer/spanner/bq_spanner_ingestion/timeseries_schema.sql)


The link to timeseries_schema.sql points to a local file path. This link will be broken for other users. Please update it to a relative path within the repository if the file is checked in, or to a public URL. This same issue is present on line 259.

gemini-code-assist · 2026-04-24T04:16:51Z

+
+  mvn -Pgit-worktree compile exec:java \
+  -pl timeseries-backfill -am \
+  -Dexec.mainClass=org.datacommons.ingestion.timeseries.TimeseriesBackfillAvroPipeline \
+  -Dexec.args="--project=datcom-store --spannerInstanceId=dc-kg-test --spannerDatabaseId=dc_graph_2026_01_27 --inputFiles=/usr/local/google/home/rohitrkumar/Documents/dc/github/rohitkumarbhagat/import/pipeline/Observation.avro-00042-of-00303 --destinationTimeSeriesTableName=TimeSeries_rk --destinationTimeSeriesAttributeTableName=TimeSeriesAttribute_rk --destinationStatVarObservationTableName=StatVarObservation_rk  --runner=DirectRunner"
+


This code block contains a duplicated and incorrectly indented mvn command that uses a local file path. This appears to be a leftover from local testing and could be confusing. It would be best to remove it to keep the documentation clean and clear.

gemini-code-assist · 2026-04-24T04:16:51Z

+```bash
+mvn -Pgit-worktree compile exec:java \
+  -pl timeseries-backfill -am \
+  -Dexec.mainClass=org.datacommons.ingestion.timeseries.TimeseriesBackfillAvroPipeline \
+  -Dexec.args="--project=datcom-store --spannerInstanceId=dc-kg-test --spannerDatabaseId=dc_graph_2026_01_27 --inputExportDir=gs://rohitrkumar-dataflow/spanner_obs_dump_2026_04_21/<export_subdir> --destinationTimeSeriesTableName=TimeSeries_rk --destinationTimeSeriesAttributeTableName=TimeSeriesAttribute_rk --destinationStatVarObservationTableName=StatVarObservation_rk --variableMeasured=Count_Person --runner=DataflowRunner --region=us-central1 --tempLocation=gs://keyurs-dataflow/temp --stagingLocation=gs://keyurs-dataflow/temp --numWorkers=20 --maxNumWorkers=100 --workerMachineType=n2-custom-4-32768 --numberOfWorkerHarnessThreads=2"
+
+  mvn -Pgit-worktree compile exec:java \
+  -pl timeseries-backfill -am \
+  -Dexec.mainClass=org.datacommons.ingestion.timeseries.TimeseriesBackfillAvroPipeline \
+  -Dexec.args="--project=datcom-store --spannerInstanceId=dc-kg-test --spannerDatabaseId=dc_graph_2026_01_27 --inputFiles=gs://rohitrkumar-dataflow/spanner_obs_dump_2026_04_21/dc-kg-test-dc_graph_2026_01_27-2026-04-23_05_47_24-8439747614048276587/Observation.avro-00042-of-00303 --destinationTimeSeriesTableName=TimeSeries_rk --destinationTimeSeriesAttributeTableName=TimeSeriesAttribute_rk --destinationStatVarObservationTableName=StatVarObservation_rk  --runner=DataflowRunner --region=us-central1 --tempLocation=gs://keyurs-dataflow/temp --stagingLocation=gs://keyurs-dataflow/temp --numWorkers=20 --maxNumWorkers=100 --workerMachineType=n2-custom-4-32768 --numberOfWorkerHarnessThreads=2"
+
+  mvn -Pgit-worktree compile exec:java \
+  -pl timeseries-backfill -am \
+  -Dexec.mainClass=org.datacommons.ingestion.timeseries.TimeseriesBackfillAvroPipeline \
+  -Dexec.args="--project=datcom-store --spannerInstanceId=dc-kg-test --spannerDatabaseId=dc_graph_2026_01_27 --inputExportDir=gs://rohitrkumar-dataflow/spanner_obs_dump_2026_04_21/dc-kg-test-dc_graph_2026_01_27-2026-04-23_05_47_24-8439747614048276587 --destinationTimeSeriesTableName=TimeSeries_rk --destinationTimeSeriesAttributeTableName=TimeSeriesAttribute_rk --destinationStatVarObservationTableName=StatVarObservation_rk  --runner=DataflowRunner --region=us-central1 --tempLocation=gs://keyurs-dataflow/temp --stagingLocation=gs://keyurs-dataflow/temp --numWorkers=20 --maxNumWorkers=100 --workerMachineType=n2-custom-4-32768 --numberOfWorkerHarnessThreads=2"
+
+  mvn -Pgit-worktree compile exec:java \
+        -pl timeseries-backfill -am \
+        -Dexec.mainClass=org.datacommons.ingestion.timeseries.TimeseriesBackfillAvroPipeline \
+        -Dexec.args="--project=datcom-store --spannerInstanceId=dc-kg-test --spannerDatabaseId=dc_graph_2026_01_27 --inputExportDir=gs://rohitrkumar-dataflow/spanner_obs_dump_2026_04_21/dc-kg-test-dc_graph_2026_01_27-2026-04-23_05_47_24-8439747614048276587 --destinationTimeSeriesTableName=TimeSeries_rk --destinationTimeSeriesAttributeTableName=TimeSeriesAttribute_rk --destinationStatVarObservationTableName=StatVarObservation_rk  --runner=DataflowRunner --region=us-central1 --tempLocation=gs://keyurs-dataflow/temp --stagingLocation=gs://keyurs-dataflow/temp --numWorkers=20 --maxNumWorkers=100 --workerMachineType=n2-custom-4-32768 --numberOfWorkerHarnessThreads=2
+
+```


This section contains multiple, duplicated, and inconsistently formatted command examples, which can be confusing. It's recommended to consolidate these into one or two clear examples for running with --inputExportDir and --inputFiles. Also, please use placeholders for user-specific values like GCS paths.

Here is a suggested cleaned-up version for the --inputExportDir case:

mvn -Pgit-worktree compile exec:java \ -pl timeseries-backfill -am \ -Dexec.mainClass=org.datacommons.ingestion.timeseries.TimeseriesBackfillAvroPipeline \ -Dexec.args="--project=<gcp-project> \ --spannerInstanceId=<instance-id> \ --spannerDatabaseId=<database-id> \ --inputExportDir=gs://<your-bucket>/<path-to-export-dir> \ --destinationTimeSeriesTableName=TimeSeries_rk \ --destinationTimeSeriesAttributeTableName=TimeSeriesAttribute_rk \ --destinationStatVarObservationTableName=StatVarObservation_rk \ --variableMeasured=Count_Person \ --runner=DataflowRunner \ --region=us-central1 \ --tempLocation=gs://<your-bucket>/temp \ --stagingLocation=gs://<your-bucket>/staging \ --numWorkers=20 \ --maxNumWorkers=100 \ --workerMachineType=n2-custom-4-32768 \ --numberOfWorkerHarnessThreads=2"

gemini-code-assist · 2026-04-24T04:16:51Z

+append_if_present() {
+  local pattern="$1"
+  local ddl="$2"
+  if rg -q "${pattern}" "${current_schema_file}"; then


The script uses rg (ripgrep), which is not a standard command-line utility and may not be installed on all systems. For better portability, please use a more common tool like grep. The -q flag in grep provides the same "quiet" functionality.

Suggested change

if rg -q "${pattern}" "${current_schema_file}"; then

if grep -q "${pattern}" "${current_schema_file}"; then

…rt instructions, and add sample Avro files

…o data files

rohitkumarbhagat added 2 commits April 23, 2026 14:45

feat: avro-implement timeseries-backfill pipeline for importing and v…

6743118

…alidating observation data

feat: add CompactSourceObservationRow support to pipeline to optimize…

ca781f1

… processing of serialized observations

gemini-code-assist Bot reviewed Apr 24, 2026

View reviewed changes

rohitkumarbhagat added 2 commits April 24, 2026 04:52

docs: update timeseries-backfill README with execution paths and expo…

88e461c

…rt instructions, and add sample Avro files

chore: remove obsolete timeseries table recreation script and add Avr…

813e8ae

…o data files

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Do not submit] Timeseries backfill avro#501

[Do not submit] Timeseries backfill avro#501
rohitkumarbhagat wants to merge 4 commits intodatacommonsorg:masterfrom
rohitkumarbhagat:timeseries_backfill_avro

rohitkumarbhagat commented Apr 24, 2026

Uh oh!

codacy-production Bot commented Apr 24, 2026 •

edited

Loading

Uh oh!

gemini-code-assist Bot left a comment

Uh oh!

Uh oh!

gemini-code-assist Bot Apr 24, 2026

Uh oh!

gemini-code-assist Bot Apr 24, 2026

Uh oh!

gemini-code-assist Bot Apr 24, 2026

Uh oh!

gemini-code-assist Bot Apr 24, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant


		Destination schema:

		- [rk-experiments/mixer/spanner/bq_spanner_ingestion/timeseries_schema.sql](/home/rohitrkumar_google_com/Documents/dc/github/rohitkumarbhagat/dc_local/rk-experiments/mixer/spanner/bq_spanner_ingestion/timeseries_schema.sql)

	if rg -q "${pattern}" "${current_schema_file}"; then
	if grep -q "${pattern}" "${current_schema_file}"; then

Conversation

rohitkumarbhagat commented Apr 24, 2026

Uh oh!

codacy-production Bot commented Apr 24, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Not up to standards ⛔

Uh oh!

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

gemini-code-assist Bot Apr 24, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist Bot Apr 24, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist Bot Apr 24, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist Bot Apr 24, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

codacy-production Bot commented Apr 24, 2026 •

edited

Loading