feat: Add counter-based validations and tests by SandeepTuniki · Pull Request #1977 · datacommonsorg/data

SandeepTuniki · 2026-04-23T06:45:12Z

No description provided.

gemini-code-assist

Code Review

This pull request introduces a new 'counters' data source and associated validation rules to the import validation tool. Key changes include updating the ValidationRunner to load JSON counter reports, adding command-line flags, and implementing five new validation methods in the Validator class: zero check, max threshold, ratio threshold, sum integrity, and min yield. Feedback focuses on improving error handling by raising exceptions for malformed JSON instead of logging, and enhancing validation robustness by removing defensive default values in favor of explicit key access and numeric casting.

gemini-code-assist · 2026-04-23T06:48:45Z

+            try:
+                with open(counters_report, 'r') as f:
+                    self.data_sources['counters'] = json.load(f)
+            except Exception as e:
+                logging.error(
+                    f"JSON parse error while reading counters report at {counters_report}: {e}"
+                )


It is recommended to raise an exception if the counters report fails to parse, rather than just logging an error. Continuing with an empty counters dictionary could lead to incorrect validation results (e.g., COUNTER_ZERO_CHECK passing when it should have failed because the data was missing). Raising a ValueError ensures the user is aware that the validation process is incomplete due to corrupted input data, enforcing the correctness of the validation process.

Suggested change

try:

with open(counters_report, 'r') as f:

self.data_sources['counters'] = json.load(f)

except Exception as e:

logging.error(

f"JSON parse error while reading counters report at {counters_report}: {e}"

)

try:

with open(counters_report, 'r') as f:

self.data_sources['counters'] = json.load(f)

except (json.JSONDecodeError, OSError) as e:

raise ValueError(

f"Failed to parse counters report at {counters_report}: {e}"

) from e

References

When a configuration dictionary like resource_limits is provided, it is acceptable to assume it contains all required keys and allow it to fail on a KeyError if a key is missing, rather than defensively using default values. This enforces configuration correctness.

gemini-code-assist · 2026-04-23T06:48:45Z

+            )
+
+        counter_name = params['counter_name']
+        value = counters.get(counter_name, 0)


Consider casting the counter value to a numeric type (e.g., float) to ensure robustness against JSON files where counts might be represented as strings. In alignment with repository practices, we avoid using defensive default values (like 0) for missing keys to ensure that missing required data is caught as an error. This is consistent with how other validators in this class handle report data (e.g., validate_lint_errors_count).

Suggested change

value = counters.get(counter_name, 0)

value = float(counters[counter_name])

References

When a configuration dictionary like resource_limits is provided, it is acceptable to assume it contains all required keys and allow it to fail on a KeyError if a key is missing, rather than defensively using default values. This enforces configuration correctness.

When naming a variable for a count of items, prefer the pattern plural_noun_count (e.g., records_count) over singular_noun_counts (e.g., record_counts).

gemini-code-assist · 2026-04-23T06:48:45Z

+
+        counter_name = params['counter_name']
+        threshold = params['threshold']
+        value = counters.get(counter_name, 0)


Consider casting the counter value to a numeric type for robustness against string values in the JSON report. Avoid using defensive default values for missing keys to ensure input correctness.

Suggested change

value = counters.get(counter_name, 0)

value = float(counters[counter_name])

References

When a configuration dictionary like resource_limits is provided, it is acceptable to assume it contains all required keys and allow it to fail on a KeyError if a key is missing, rather than defensively using default values. This enforces configuration correctness.

gemini-code-assist · 2026-04-23T06:48:45Z

+        subset_value = counters.get(subset_counter, 0)
+        total_value = counters.get(total_counter, 0)


Consider casting the counter values to numeric types for robustness and to ensure correct division and comparison. Avoid using defensive default values for missing keys to ensure the validation fails if expected data is absent.

Suggested change

subset_value = counters.get(subset_counter, 0)

total_value = counters.get(total_counter, 0)

subset_value = float(counters[subset_counter])

total_value = float(counters[total_counter])

References

When a configuration dictionary like resource_limits is provided, it is acceptable to assume it contains all required keys and allow it to fail on a KeyError if a key is missing, rather than defensively using default values. This enforces configuration correctness.

gemini-code-assist · 2026-04-23T06:48:45Z

+        constituent_sum = sum(
+            counters.get(c, 0) for c in constituent_counters)
+


Consider casting the counter values to numeric types to ensure the sum and comparison operations are robust against string values in the JSON report. Avoid using defensive default values for missing keys.

Suggested change

constituent_sum = sum(

counters.get(c, 0) for c in constituent_counters)

total_value = float(counters[total_counter])

constituent_sum = sum(

float(counters[c]) for c in constituent_counters)

References

When a configuration dictionary like resource_limits is provided, it is acceptable to assume it contains all required keys and allow it to fail on a KeyError if a key is missing, rather than defensively using default values. This enforces configuration correctness.

gemini-code-assist · 2026-04-23T06:48:45Z

+
+        counter_name = params['counter_name']
+        min_yield = params['min_yield']
+        value = counters.get(counter_name, 0)


Consider casting the counter value to a numeric type for robustness. Avoid using defensive default values for missing keys.

Suggested change

value = counters.get(counter_name, 0)

value = float(counters[counter_name])

References

When a configuration dictionary like resource_limits is provided, it is acceptable to assume it contains all required keys and allow it to fail on a KeyError if a key is missing, rather than defensively using default values. This enforces configuration correctness.

SandeepTuniki · 2026-04-23T08:19:06Z

/gemini review

gemini-code-assist

Code Review

This pull request introduces counter-based validation rules to the import validation tool, including functionality to load counter data from CSV files and several new validation checks such as zero-value, threshold, ratio, and sum integrity checks. The feedback identifies a flaw in the key cleaning logic that incorrectly truncates counter names containing underscores and suggests using more specific exception handling when parsing CSV files. Additionally, a logic improvement was suggested for the zero-check validator to ensure it correctly identifies any non-zero value, including negative ones.

gemini-code-assist · 2026-04-23T08:25:21Z

+                def clean_key(x):
+                    if not isinstance(x, str):
+                        return x
+                    if ':' in x:
+                        x = x.split(':', 1)[1]
+                    if '_' in x:
+                        x = x.rsplit('_', 1)[-1]
+                    return x


The clean_key logic is too aggressive and will break counter names that contain underscores. For example, a counter named records_count (following the recommended naming convention) would be incorrectly truncated to count. Additionally, rsplit('_', 1) only works if the counter name itself has no underscores. A more robust approach would be to use a regular expression to strip the specific stage prefix pattern (e.g., digit:string_).

def clean_key(x): if not isinstance(x, str): return x import re # Strip stage prefix like "2:prepare_output_" return re.sub(r'^\d+:[^:]+_', '', x)

References

When naming a variable for a count of items, prefer the pattern plural_noun_count (e.g., records_count) over singular_noun_counts (e.g., record_counts).

gemini-code-assist · 2026-04-23T08:25:21Z

+            except Exception as e:
+                logging.error(
+                    f"CSV parse error while reading counters report at {counters_report}: {e}"
+                )


Avoid catching broad exceptions. It is better to catch specific errors that pd.read_csv might raise, such as pd.errors.ParserError or pd.errors.EmptyDataError.

Suggested change

except Exception as e:

logging.error(

f"CSV parse error while reading counters report at {counters_report}: {e}"

)

except pd.errors.ParserError as e:

logging.error(

f"CSV parse error while reading counters report at {counters_report}: {e}"

)

gemini-code-assist · 2026-04-23T08:25:21Z

+        counter_name = params['counter_name']
+        value = counters.get(counter_name, 0)
+
+        if value > 0:


The check value > 0 might pass if a counter somehow becomes negative, which would still violate the 'strictly zero' requirement. Using value != 0 is safer and more accurate for a zero check.

Suggested change

if value > 0:

if value != 0:

Add counter-based validations and tests

48caecf

gemini-code-assist Bot reviewed Apr 23, 2026

View reviewed changes

Update counters integration to use string split and fix missing import

b3dea6b

gemini-code-assist Bot reviewed Apr 23, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: Add counter-based validations and tests#1977

feat: Add counter-based validations and tests#1977
SandeepTuniki wants to merge 2 commits intomasterfrom
counter-validations

SandeepTuniki commented Apr 23, 2026

Uh oh!

gemini-code-assist Bot left a comment

Uh oh!

gemini-code-assist Bot Apr 23, 2026

Uh oh!

gemini-code-assist Bot Apr 23, 2026

Uh oh!

gemini-code-assist Bot Apr 23, 2026

Uh oh!

gemini-code-assist Bot Apr 23, 2026

Uh oh!

gemini-code-assist Bot Apr 23, 2026

Uh oh!

gemini-code-assist Bot Apr 23, 2026

Uh oh!

SandeepTuniki commented Apr 23, 2026

Uh oh!

gemini-code-assist Bot left a comment

Uh oh!

gemini-code-assist Bot Apr 23, 2026

Uh oh!

gemini-code-assist Bot Apr 23, 2026

Uh oh!

gemini-code-assist Bot Apr 23, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

	value = counters.get(counter_name, 0)
	value = float(counters[counter_name])

		subset_value = counters.get(subset_counter, 0)
		total_value = counters.get(total_counter, 0)

		constituent_sum = sum(
		counters.get(c, 0) for c in constituent_counters)

Conversation

SandeepTuniki commented Apr 23, 2026

Uh oh!

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist Bot Apr 23, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist Bot Apr 23, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist Bot Apr 23, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist Bot Apr 23, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist Bot Apr 23, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist Bot Apr 23, 2026

Choose a reason for hiding this comment

Uh oh!

SandeepTuniki commented Apr 23, 2026

Uh oh!

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist Bot Apr 23, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist Bot Apr 23, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist Bot Apr 23, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant