feat: Add counter-based validations and tests#1977
feat: Add counter-based validations and tests#1977SandeepTuniki wants to merge 2 commits intomasterfrom
Conversation
There was a problem hiding this comment.
Code Review
This pull request introduces a new 'counters' data source and associated validation rules to the import validation tool. Key changes include updating the ValidationRunner to load JSON counter reports, adding command-line flags, and implementing five new validation methods in the Validator class: zero check, max threshold, ratio threshold, sum integrity, and min yield. Feedback focuses on improving error handling by raising exceptions for malformed JSON instead of logging, and enhancing validation robustness by removing defensive default values in favor of explicit key access and numeric casting.
| try: | ||
| with open(counters_report, 'r') as f: | ||
| self.data_sources['counters'] = json.load(f) | ||
| except Exception as e: | ||
| logging.error( | ||
| f"JSON parse error while reading counters report at {counters_report}: {e}" | ||
| ) |
There was a problem hiding this comment.
It is recommended to raise an exception if the counters report fails to parse, rather than just logging an error. Continuing with an empty counters dictionary could lead to incorrect validation results (e.g., COUNTER_ZERO_CHECK passing when it should have failed because the data was missing). Raising a ValueError ensures the user is aware that the validation process is incomplete due to corrupted input data, enforcing the correctness of the validation process.
| try: | |
| with open(counters_report, 'r') as f: | |
| self.data_sources['counters'] = json.load(f) | |
| except Exception as e: | |
| logging.error( | |
| f"JSON parse error while reading counters report at {counters_report}: {e}" | |
| ) | |
| try: | |
| with open(counters_report, 'r') as f: | |
| self.data_sources['counters'] = json.load(f) | |
| except (json.JSONDecodeError, OSError) as e: | |
| raise ValueError( | |
| f"Failed to parse counters report at {counters_report}: {e}" | |
| ) from e |
References
- When a configuration dictionary like resource_limits is provided, it is acceptable to assume it contains all required keys and allow it to fail on a KeyError if a key is missing, rather than defensively using default values. This enforces configuration correctness.
| ) | ||
|
|
||
| counter_name = params['counter_name'] | ||
| value = counters.get(counter_name, 0) |
There was a problem hiding this comment.
Consider casting the counter value to a numeric type (e.g., float) to ensure robustness against JSON files where counts might be represented as strings. In alignment with repository practices, we avoid using defensive default values (like 0) for missing keys to ensure that missing required data is caught as an error. This is consistent with how other validators in this class handle report data (e.g., validate_lint_errors_count).
| value = counters.get(counter_name, 0) | |
| value = float(counters[counter_name]) |
References
- When a configuration dictionary like resource_limits is provided, it is acceptable to assume it contains all required keys and allow it to fail on a KeyError if a key is missing, rather than defensively using default values. This enforces configuration correctness.
- When naming a variable for a count of items, prefer the pattern plural_noun_count (e.g., records_count) over singular_noun_counts (e.g., record_counts).
|
|
||
| counter_name = params['counter_name'] | ||
| threshold = params['threshold'] | ||
| value = counters.get(counter_name, 0) |
There was a problem hiding this comment.
Consider casting the counter value to a numeric type for robustness against string values in the JSON report. Avoid using defensive default values for missing keys to ensure input correctness.
| value = counters.get(counter_name, 0) | |
| value = float(counters[counter_name]) |
References
- When a configuration dictionary like resource_limits is provided, it is acceptable to assume it contains all required keys and allow it to fail on a KeyError if a key is missing, rather than defensively using default values. This enforces configuration correctness.
| subset_value = counters.get(subset_counter, 0) | ||
| total_value = counters.get(total_counter, 0) |
There was a problem hiding this comment.
Consider casting the counter values to numeric types for robustness and to ensure correct division and comparison. Avoid using defensive default values for missing keys to ensure the validation fails if expected data is absent.
| subset_value = counters.get(subset_counter, 0) | |
| total_value = counters.get(total_counter, 0) | |
| subset_value = float(counters[subset_counter]) | |
| total_value = float(counters[total_counter]) |
References
- When a configuration dictionary like resource_limits is provided, it is acceptable to assume it contains all required keys and allow it to fail on a KeyError if a key is missing, rather than defensively using default values. This enforces configuration correctness.
| constituent_sum = sum( | ||
| counters.get(c, 0) for c in constituent_counters) | ||
|
|
There was a problem hiding this comment.
Consider casting the counter values to numeric types to ensure the sum and comparison operations are robust against string values in the JSON report. Avoid using defensive default values for missing keys.
| constituent_sum = sum( | |
| counters.get(c, 0) for c in constituent_counters) | |
| total_value = float(counters[total_counter]) | |
| constituent_sum = sum( | |
| float(counters[c]) for c in constituent_counters) |
References
- When a configuration dictionary like resource_limits is provided, it is acceptable to assume it contains all required keys and allow it to fail on a KeyError if a key is missing, rather than defensively using default values. This enforces configuration correctness.
|
|
||
| counter_name = params['counter_name'] | ||
| min_yield = params['min_yield'] | ||
| value = counters.get(counter_name, 0) |
There was a problem hiding this comment.
Consider casting the counter value to a numeric type for robustness. Avoid using defensive default values for missing keys.
| value = counters.get(counter_name, 0) | |
| value = float(counters[counter_name]) |
References
- When a configuration dictionary like resource_limits is provided, it is acceptable to assume it contains all required keys and allow it to fail on a KeyError if a key is missing, rather than defensively using default values. This enforces configuration correctness.
|
/gemini review |
There was a problem hiding this comment.
Code Review
This pull request introduces counter-based validation rules to the import validation tool, including functionality to load counter data from CSV files and several new validation checks such as zero-value, threshold, ratio, and sum integrity checks. The feedback identifies a flaw in the key cleaning logic that incorrectly truncates counter names containing underscores and suggests using more specific exception handling when parsing CSV files. Additionally, a logic improvement was suggested for the zero-check validator to ensure it correctly identifies any non-zero value, including negative ones.
| def clean_key(x): | ||
| if not isinstance(x, str): | ||
| return x | ||
| if ':' in x: | ||
| x = x.split(':', 1)[1] | ||
| if '_' in x: | ||
| x = x.rsplit('_', 1)[-1] | ||
| return x |
There was a problem hiding this comment.
The clean_key logic is too aggressive and will break counter names that contain underscores. For example, a counter named records_count (following the recommended naming convention) would be incorrectly truncated to count. Additionally, rsplit('_', 1) only works if the counter name itself has no underscores. A more robust approach would be to use a regular expression to strip the specific stage prefix pattern (e.g., digit:string_).
def clean_key(x):
if not isinstance(x, str):
return x
import re
# Strip stage prefix like "2:prepare_output_"
return re.sub(r'^\d+:[^:]+_', '', x)References
- When naming a variable for a count of items, prefer the pattern plural_noun_count (e.g., records_count) over singular_noun_counts (e.g., record_counts).
| except Exception as e: | ||
| logging.error( | ||
| f"CSV parse error while reading counters report at {counters_report}: {e}" | ||
| ) |
There was a problem hiding this comment.
Avoid catching broad exceptions. It is better to catch specific errors that pd.read_csv might raise, such as pd.errors.ParserError or pd.errors.EmptyDataError.
| except Exception as e: | |
| logging.error( | |
| f"CSV parse error while reading counters report at {counters_report}: {e}" | |
| ) | |
| except pd.errors.ParserError as e: | |
| logging.error( | |
| f"CSV parse error while reading counters report at {counters_report}: {e}" | |
| ) |
| counter_name = params['counter_name'] | ||
| value = counters.get(counter_name, 0) | ||
|
|
||
| if value > 0: |
No description provided.