Skip to content

[Do not submit] Timeseries backfill avro#501

Draft
rohitkumarbhagat wants to merge 4 commits intodatacommonsorg:masterfrom
rohitkumarbhagat:timeseries_backfill_avro
Draft

[Do not submit] Timeseries backfill avro#501
rohitkumarbhagat wants to merge 4 commits intodatacommonsorg:masterfrom
rohitkumarbhagat:timeseries_backfill_avro

Conversation

@rohitkumarbhagat
Copy link
Copy Markdown
Contributor

No description provided.

@codacy-production
Copy link
Copy Markdown

codacy-production Bot commented Apr 24, 2026

Not up to standards ⛔

🔴 Issues 1 high · 9 medium · 47 minor

Alerts:
⚠ 57 issues (≤ 0 issues of at least minor severity)

Results:
57 new issues

Category Results
BestPractice 5 medium
Security 1 high
CodeStyle 47 minor
Complexity 4 medium

View in Codacy

🟢 Metrics 249 complexity · 22 duplication

Metric Results
Complexity 249
Duplication 22

View in Codacy

NEW Get contextual insights on your PRs based on Codacy's metrics, along with PR and Jira context, without leaving GitHub. Enable AI reviewer
TIP This summary will be updated as you push new changes.

Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces the timeseries-backfill module, which provides Beam/Dataflow pipelines and a standalone validator for backfilling normalized timeseries tables from legacy Spanner data. The implementation includes support for both live Spanner sources and Avro exports. Review feedback focuses on improving the portability and maintainability of the module by addressing hardcoded local file paths in the documentation and scripts, replacing non-standard shell utilities with more common alternatives, and consolidating redundant command-line examples in the README.

Comment thread pipeline/timeseries-backfill/recreate_timeseries_tables.sh Outdated
Comment thread pipeline/timeseries-backfill/README.md Outdated

Destination schema:

- [rk-experiments/mixer/spanner/bq_spanner_ingestion/timeseries_schema.sql](/home/rohitrkumar_google_com/Documents/dc/github/rohitkumarbhagat/dc_local/rk-experiments/mixer/spanner/bq_spanner_ingestion/timeseries_schema.sql)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The link to timeseries_schema.sql points to a local file path. This link will be broken for other users. Please update it to a relative path within the repository if the file is checked in, or to a public URL. This same issue is present on line 259.

Comment on lines +196 to +201

mvn -Pgit-worktree compile exec:java \
-pl timeseries-backfill -am \
-Dexec.mainClass=org.datacommons.ingestion.timeseries.TimeseriesBackfillAvroPipeline \
-Dexec.args="--project=datcom-store --spannerInstanceId=dc-kg-test --spannerDatabaseId=dc_graph_2026_01_27 --inputFiles=/usr/local/google/home/rohitrkumar/Documents/dc/github/rohitkumarbhagat/import/pipeline/Observation.avro-00042-of-00303 --destinationTimeSeriesTableName=TimeSeries_rk --destinationTimeSeriesAttributeTableName=TimeSeriesAttribute_rk --destinationStatVarObservationTableName=StatVarObservation_rk --runner=DirectRunner"

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

This code block contains a duplicated and incorrectly indented mvn command that uses a local file path. This appears to be a leftover from local testing and could be confusing. It would be best to remove it to keep the documentation clean and clear.

Comment on lines +228 to +249
```bash
mvn -Pgit-worktree compile exec:java \
-pl timeseries-backfill -am \
-Dexec.mainClass=org.datacommons.ingestion.timeseries.TimeseriesBackfillAvroPipeline \
-Dexec.args="--project=datcom-store --spannerInstanceId=dc-kg-test --spannerDatabaseId=dc_graph_2026_01_27 --inputExportDir=gs://rohitrkumar-dataflow/spanner_obs_dump_2026_04_21/<export_subdir> --destinationTimeSeriesTableName=TimeSeries_rk --destinationTimeSeriesAttributeTableName=TimeSeriesAttribute_rk --destinationStatVarObservationTableName=StatVarObservation_rk --variableMeasured=Count_Person --runner=DataflowRunner --region=us-central1 --tempLocation=gs://keyurs-dataflow/temp --stagingLocation=gs://keyurs-dataflow/temp --numWorkers=20 --maxNumWorkers=100 --workerMachineType=n2-custom-4-32768 --numberOfWorkerHarnessThreads=2"

mvn -Pgit-worktree compile exec:java \
-pl timeseries-backfill -am \
-Dexec.mainClass=org.datacommons.ingestion.timeseries.TimeseriesBackfillAvroPipeline \
-Dexec.args="--project=datcom-store --spannerInstanceId=dc-kg-test --spannerDatabaseId=dc_graph_2026_01_27 --inputFiles=gs://rohitrkumar-dataflow/spanner_obs_dump_2026_04_21/dc-kg-test-dc_graph_2026_01_27-2026-04-23_05_47_24-8439747614048276587/Observation.avro-00042-of-00303 --destinationTimeSeriesTableName=TimeSeries_rk --destinationTimeSeriesAttributeTableName=TimeSeriesAttribute_rk --destinationStatVarObservationTableName=StatVarObservation_rk --runner=DataflowRunner --region=us-central1 --tempLocation=gs://keyurs-dataflow/temp --stagingLocation=gs://keyurs-dataflow/temp --numWorkers=20 --maxNumWorkers=100 --workerMachineType=n2-custom-4-32768 --numberOfWorkerHarnessThreads=2"

mvn -Pgit-worktree compile exec:java \
-pl timeseries-backfill -am \
-Dexec.mainClass=org.datacommons.ingestion.timeseries.TimeseriesBackfillAvroPipeline \
-Dexec.args="--project=datcom-store --spannerInstanceId=dc-kg-test --spannerDatabaseId=dc_graph_2026_01_27 --inputExportDir=gs://rohitrkumar-dataflow/spanner_obs_dump_2026_04_21/dc-kg-test-dc_graph_2026_01_27-2026-04-23_05_47_24-8439747614048276587 --destinationTimeSeriesTableName=TimeSeries_rk --destinationTimeSeriesAttributeTableName=TimeSeriesAttribute_rk --destinationStatVarObservationTableName=StatVarObservation_rk --runner=DataflowRunner --region=us-central1 --tempLocation=gs://keyurs-dataflow/temp --stagingLocation=gs://keyurs-dataflow/temp --numWorkers=20 --maxNumWorkers=100 --workerMachineType=n2-custom-4-32768 --numberOfWorkerHarnessThreads=2"

mvn -Pgit-worktree compile exec:java \
-pl timeseries-backfill -am \
-Dexec.mainClass=org.datacommons.ingestion.timeseries.TimeseriesBackfillAvroPipeline \
-Dexec.args="--project=datcom-store --spannerInstanceId=dc-kg-test --spannerDatabaseId=dc_graph_2026_01_27 --inputExportDir=gs://rohitrkumar-dataflow/spanner_obs_dump_2026_04_21/dc-kg-test-dc_graph_2026_01_27-2026-04-23_05_47_24-8439747614048276587 --destinationTimeSeriesTableName=TimeSeries_rk --destinationTimeSeriesAttributeTableName=TimeSeriesAttribute_rk --destinationStatVarObservationTableName=StatVarObservation_rk --runner=DataflowRunner --region=us-central1 --tempLocation=gs://keyurs-dataflow/temp --stagingLocation=gs://keyurs-dataflow/temp --numWorkers=20 --maxNumWorkers=100 --workerMachineType=n2-custom-4-32768 --numberOfWorkerHarnessThreads=2

```
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

This section contains multiple, duplicated, and inconsistently formatted command examples, which can be confusing. It's recommended to consolidate these into one or two clear examples for running with --inputExportDir and --inputFiles. Also, please use placeholders for user-specific values like GCS paths.

Here is a suggested cleaned-up version for the --inputExportDir case:

mvn -Pgit-worktree compile exec:java \
  -pl timeseries-backfill -am \
  -Dexec.mainClass=org.datacommons.ingestion.timeseries.TimeseriesBackfillAvroPipeline \
  -Dexec.args="--project=<gcp-project> \
    --spannerInstanceId=<instance-id> \
    --spannerDatabaseId=<database-id> \
    --inputExportDir=gs://<your-bucket>/<path-to-export-dir> \
    --destinationTimeSeriesTableName=TimeSeries_rk \
    --destinationTimeSeriesAttributeTableName=TimeSeriesAttribute_rk \
    --destinationStatVarObservationTableName=StatVarObservation_rk \
    --variableMeasured=Count_Person \
    --runner=DataflowRunner \
    --region=us-central1 \
    --tempLocation=gs://<your-bucket>/temp \
    --stagingLocation=gs://<your-bucket>/staging \
    --numWorkers=20 \
    --maxNumWorkers=100 \
    --workerMachineType=n2-custom-4-32768 \
    --numberOfWorkerHarnessThreads=2"

append_if_present() {
local pattern="$1"
local ddl="$2"
if rg -q "${pattern}" "${current_schema_file}"; then
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The script uses rg (ripgrep), which is not a standard command-line utility and may not be installed on all systems. For better portability, please use a more common tool like grep. The -q flag in grep provides the same "quiet" functionality.

Suggested change
if rg -q "${pattern}" "${current_schema_file}"; then
if grep -q "${pattern}" "${current_schema_file}"; then

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant