Skip to content

feat: Add counter-based validations and tests#1977

Draft
SandeepTuniki wants to merge 2 commits intomasterfrom
counter-validations
Draft

feat: Add counter-based validations and tests#1977
SandeepTuniki wants to merge 2 commits intomasterfrom
counter-validations

Conversation

@SandeepTuniki
Copy link
Copy Markdown
Contributor

No description provided.

Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces a new 'counters' data source and associated validation rules to the import validation tool. Key changes include updating the ValidationRunner to load JSON counter reports, adding command-line flags, and implementing five new validation methods in the Validator class: zero check, max threshold, ratio threshold, sum integrity, and min yield. Feedback focuses on improving error handling by raising exceptions for malformed JSON instead of logging, and enhancing validation robustness by removing defensive default values in favor of explicit key access and numeric casting.

Comment on lines +184 to +190
try:
with open(counters_report, 'r') as f:
self.data_sources['counters'] = json.load(f)
except Exception as e:
logging.error(
f"JSON parse error while reading counters report at {counters_report}: {e}"
)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

It is recommended to raise an exception if the counters report fails to parse, rather than just logging an error. Continuing with an empty counters dictionary could lead to incorrect validation results (e.g., COUNTER_ZERO_CHECK passing when it should have failed because the data was missing). Raising a ValueError ensures the user is aware that the validation process is incomplete due to corrupted input data, enforcing the correctness of the validation process.

Suggested change
try:
with open(counters_report, 'r') as f:
self.data_sources['counters'] = json.load(f)
except Exception as e:
logging.error(
f"JSON parse error while reading counters report at {counters_report}: {e}"
)
try:
with open(counters_report, 'r') as f:
self.data_sources['counters'] = json.load(f)
except (json.JSONDecodeError, OSError) as e:
raise ValueError(
f"Failed to parse counters report at {counters_report}: {e}"
) from e
References
  1. When a configuration dictionary like resource_limits is provided, it is acceptable to assume it contains all required keys and allow it to fail on a KeyError if a key is missing, rather than defensively using default values. This enforces configuration correctness.

)

counter_name = params['counter_name']
value = counters.get(counter_name, 0)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

Consider casting the counter value to a numeric type (e.g., float) to ensure robustness against JSON files where counts might be represented as strings. In alignment with repository practices, we avoid using defensive default values (like 0) for missing keys to ensure that missing required data is caught as an error. This is consistent with how other validators in this class handle report data (e.g., validate_lint_errors_count).

Suggested change
value = counters.get(counter_name, 0)
value = float(counters[counter_name])
References
  1. When a configuration dictionary like resource_limits is provided, it is acceptable to assume it contains all required keys and allow it to fail on a KeyError if a key is missing, rather than defensively using default values. This enforces configuration correctness.
  2. When naming a variable for a count of items, prefer the pattern plural_noun_count (e.g., records_count) over singular_noun_counts (e.g., record_counts).


counter_name = params['counter_name']
threshold = params['threshold']
value = counters.get(counter_name, 0)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

Consider casting the counter value to a numeric type for robustness against string values in the JSON report. Avoid using defensive default values for missing keys to ensure input correctness.

Suggested change
value = counters.get(counter_name, 0)
value = float(counters[counter_name])
References
  1. When a configuration dictionary like resource_limits is provided, it is acceptable to assume it contains all required keys and allow it to fail on a KeyError if a key is missing, rather than defensively using default values. This enforces configuration correctness.

Comment on lines +1144 to +1145
subset_value = counters.get(subset_counter, 0)
total_value = counters.get(total_counter, 0)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

Consider casting the counter values to numeric types for robustness and to ensure correct division and comparison. Avoid using defensive default values for missing keys to ensure the validation fails if expected data is absent.

Suggested change
subset_value = counters.get(subset_counter, 0)
total_value = counters.get(total_counter, 0)
subset_value = float(counters[subset_counter])
total_value = float(counters[total_counter])
References
  1. When a configuration dictionary like resource_limits is provided, it is acceptable to assume it contains all required keys and allow it to fail on a KeyError if a key is missing, rather than defensively using default values. This enforces configuration correctness.

Comment on lines +1205 to +1207
constituent_sum = sum(
counters.get(c, 0) for c in constituent_counters)

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

Consider casting the counter values to numeric types to ensure the sum and comparison operations are robust against string values in the JSON report. Avoid using defensive default values for missing keys.

Suggested change
constituent_sum = sum(
counters.get(c, 0) for c in constituent_counters)
total_value = float(counters[total_counter])
constituent_sum = sum(
float(counters[c]) for c in constituent_counters)
References
  1. When a configuration dictionary like resource_limits is provided, it is acceptable to assume it contains all required keys and allow it to fail on a KeyError if a key is missing, rather than defensively using default values. This enforces configuration correctness.


counter_name = params['counter_name']
min_yield = params['min_yield']
value = counters.get(counter_name, 0)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

Consider casting the counter value to a numeric type for robustness. Avoid using defensive default values for missing keys.

Suggested change
value = counters.get(counter_name, 0)
value = float(counters[counter_name])
References
  1. When a configuration dictionary like resource_limits is provided, it is acceptable to assume it contains all required keys and allow it to fail on a KeyError if a key is missing, rather than defensively using default values. This enforces configuration correctness.

@SandeepTuniki
Copy link
Copy Markdown
Contributor Author

/gemini review

Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces counter-based validation rules to the import validation tool, including functionality to load counter data from CSV files and several new validation checks such as zero-value, threshold, ratio, and sum integrity checks. The feedback identifies a flaw in the key cleaning logic that incorrectly truncates counter names containing underscores and suggests using more specific exception handling when parsing CSV files. Additionally, a logic improvement was suggested for the zero-check validator to ensure it correctly identifies any non-zero value, including negative ones.

Comment on lines +191 to +198
def clean_key(x):
if not isinstance(x, str):
return x
if ':' in x:
x = x.split(':', 1)[1]
if '_' in x:
x = x.rsplit('_', 1)[-1]
return x
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

The clean_key logic is too aggressive and will break counter names that contain underscores. For example, a counter named records_count (following the recommended naming convention) would be incorrectly truncated to count. Additionally, rsplit('_', 1) only works if the counter name itself has no underscores. A more robust approach would be to use a regular expression to strip the specific stage prefix pattern (e.g., digit:string_).

                def clean_key(x):
                    if not isinstance(x, str):
                        return x
                    import re
                    # Strip stage prefix like "2:prepare_output_"
                    return re.sub(r'^\d+:[^:]+_', '', x)
References
  1. When naming a variable for a count of items, prefer the pattern plural_noun_count (e.g., records_count) over singular_noun_counts (e.g., record_counts).

Comment on lines +205 to +208
except Exception as e:
logging.error(
f"CSV parse error while reading counters report at {counters_report}: {e}"
)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

Avoid catching broad exceptions. It is better to catch specific errors that pd.read_csv might raise, such as pd.errors.ParserError or pd.errors.EmptyDataError.

Suggested change
except Exception as e:
logging.error(
f"CSV parse error while reading counters report at {counters_report}: {e}"
)
except pd.errors.ParserError as e:
logging.error(
f"CSV parse error while reading counters report at {counters_report}: {e}"
)

counter_name = params['counter_name']
value = counters.get(counter_name, 0)

if value > 0:
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The check value > 0 might pass if a counter somehow becomes negative, which would still violate the 'strictly zero' requirement. Using value != 0 is safer and more accurate for a zero check.

Suggested change
if value > 0:
if value != 0:

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant