Skip to content

Add CI validation for harbor tasks#11

Open
gb-vmax wants to merge 1 commit intoVmaxAI:mainfrom
gb-vmax:ci/harbor-validation
Open

Add CI validation for harbor tasks#11
gb-vmax wants to merge 1 commit intoVmaxAI:mainfrom
gb-vmax:ci/harbor-validation

Conversation

@gb-vmax
Copy link
Copy Markdown

@gb-vmax gb-vmax commented Feb 26, 2026

Summary

  • Tier 1 (lint): Fast schema validation (~1s for 1241 tasks) — checks required files, task.toml structure, Dockerfile FROM, shebangs, instruction.md non-empty. Runs on every PR.
  • Tier 2 (docker-build): Verifies Dockerfiles actually build for changed tasks. Runs on PRs.
  • Tier 3 (verify-solutions): Builds Docker image, runs solution, runs tests, checks reward=1. Uses a checksum cache (.validated-solutions.json) so each task only needs to pass once — re-runs only if the task's files change. Runs on push to main or manual trigger.

Scripts are also usable locally:

# Lint all tasks
python scripts/validate_harbor_task.py data/

# Dry-run to see what needs verification
python scripts/verify_solutions.py --dry-run data/

# Verify specific tasks
python scripts/verify_solutions.py --platform linux/amd64 data/some_task_id

Test plan

  • Linter validates all 1241 existing tasks successfully
  • Linter catches errors on deliberately malformed tasks (missing files, bad toml, no shebang, etc.)
  • Solution verification builds Docker, runs solution, checks reward=1 (tested on chalk tasks)
  • Checksum cache correctly skips already-verified tasks on re-run
  • Verify GitHub Actions workflow triggers correctly on PR with data/ changes

🤖 Generated with Claude Code


What changed?

Added GitHub Actions workflow (.github/workflows/validate-tasks.yml):

  • Tier 1 (lint): Fast schema validation running on every PR that touches data/ — validates task.toml structure, required files, Dockerfile syntax, shebang presence
  • Tier 2 (docker-build): Builds Dockerfiles for changed tasks on PRs to catch build errors early
  • Tier 3 (verify-solutions): Full integration testing (build + run solution + check reward=1) that runs on push to main or manual trigger, with checksum-based caching to skip already-verified tasks

Added validation script (scripts/validate_harbor_task.py):

  • Checks all required files exist (task.toml, instruction.md, Dockerfile, solve.sh, test.sh)
  • Validates task.toml structure (verifier/agent/environment sections, timeout_sec, cpus, memory fields)
  • Ensures Dockerfile starts with FROM instruction
  • Verifies shell scripts have shebangs
  • Supports JSON output and verbose mode

Added solution verification script (scripts/verify_solutions.py):

  • Builds Docker image for each task
  • Runs solution script inside container
  • Executes tests and verifies reward.txt equals 1
  • Maintains .validated-solutions.json with file checksums to skip unchanged tasks
  • Supports parallel execution (--jobs flag), platform selection (--platform), and forced re-verification (--force)

Validation

  • Linter validates all 1241 existing tasks successfully
  • Linter catches errors on deliberately malformed tasks (missing files, bad toml, no shebang, etc.)
  • Solution verification builds Docker, runs solution, checks reward=1 (tested on chalk tasks)
  • Checksum cache correctly skips already-verified tasks on re-run
  • Verify GitHub Actions workflow triggers correctly on PR with data/ changes

Description generated by Mesa. Update settings

Three-tier validation pipeline:
- Tier 1 (lint): schema validation of task.toml, required files, Dockerfile, shebangs
- Tier 2 (docker-build): verify Dockerfiles actually build for changed tasks
- Tier 3 (verify-solutions): build + run solutions, check reward=1, with checksum
  cache so each task only needs to pass once

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Copy link
Copy Markdown

@mesa-dot-dev mesa-dot-dev bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Performed full review of 3542d8d...b076f0a

Analysis

• Race condition in checksum file updates: multiple threads can corrupt JSON during parallel solution verification. Must implement threading lock or collect all results before writing once at completion.

• Git push in PR workflows will fail for fork-based PRs due to permissions. Restrict checksum commits to main branch pushes only, or use a bot account/artifacts mechanism for PR workflows.

• Change detection logic using cut -d'/' -f1-2 is fragile and assumes rigid directory structure. Will break if task hierarchy changes; migrate to robust walk-up approach from changed files to find task.toml.

• Docker tag sanitization can create collisions between tasks with similar names post-character-replacement. Incorporate hash into tag for uniqueness guarantee.

Tip

Help

Slash Commands:

  • /review - Request a full code review
  • /review latest - Review only changes since the last review
  • /describe - Generate PR description. This will update the PR body or issue comment depending on your configuration
  • /help - Get help with Mesa commands and configuration options

3 files reviewed | 8 comments | Edit Agent SettingsRead Docs

else:
info(f"task.toml has {section}")
else:
try:
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Medium

Multiple bare try/except blocks are used throughout this file (lines 78-83, 81-83). Consider allowing specific exceptions to propagate or handling them more explicitly. Catching all exceptions with bare except Exception can hide bugs and make debugging difficult. For example, if the TOML file has permission issues vs. syntax errors, these should be handled differently.

Fix in Cursor • Fix in Claude

Prompt for Agent
Task: Address review feedback left on GitHub.
Repository: VmaxAI/tasks#11
File: scripts/validate_harbor_task.py#L78
Action: Open this file location in your editor, inspect the highlighted code, and resolve the issue described below.

Feedback:
Multiple bare try/except blocks are used throughout this file (lines 78-83, 81-83). Consider allowing specific exceptions to propagate or handling them more explicitly. Catching all exceptions with bare `except Exception` can hide bugs and make debugging difficult. For example, if the TOML file has permission issues vs. syntax errors, these should be handled differently.

# Environment checks
env = toml_data.get("environment", {})
if "cpus" not in env:
err("task.toml [environment] missing cpus")
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Low

The validation checks for either memory or memory_mb in the environment section, but doesn't validate which format is actually expected by the harbor specification. Consider documenting which field is canonical or validating that only one is present to avoid ambiguity.

Fix in Cursor • Fix in Claude

Prompt for Agent
Task: Address review feedback left on GitHub.
Repository: VmaxAI/tasks#11
File: scripts/validate_harbor_task.py#L118
Action: Open this file location in your editor, inspect the highlighted code, and resolve the issue described below.

Feedback:
The validation checks for either `memory` or `memory_mb` in the environment section, but doesn't validate which format is actually expected by the harbor specification. Consider documenting which field is canonical or validating that only one is present to avoid ambiguity.

CHECKSUM_FILE = ".validated-solutions.json"


def compute_task_hash(task_dir: str) -> str:
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Medium

The compute_task_hash() function reads all files in a task directory into memory for hashing. For tasks with large binary files or embedded tarballs (mentioned in validate_harbor_task.py line 165), this could cause memory issues. Consider streaming files in chunks for hashing instead of filepath.read_bytes().

Fix in Cursor • Fix in Claude

Prompt for Agent
Task: Address review feedback left on GitHub.
Repository: VmaxAI/tasks#11
File: scripts/verify_solutions.py#L36
Action: Open this file location in your editor, inspect the highlighted code, and resolve the issue described below.

Feedback:
The `compute_task_hash()` function reads all files in a task directory into memory for hashing. For tasks with large binary files or embedded tarballs (mentioned in validate_harbor_task.py line 165), this could cause memory issues. Consider streaming files in chunks for hashing instead of `filepath.read_bytes()`.

"""
task_id = os.path.basename(task_dir)
# Docker tags: [a-z0-9] with single separators, no slashes (local image)
safe_tag = re.sub(r'[^a-z0-9]', '-', task_id.lower())
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Medium

The Docker tag sanitization uses a simple regex replacement that could result in collisions. For example, task__ABC and task.ABC would both become hci-task-abc. Consider including a hash of the original task_id in the tag to guarantee uniqueness, or validate that the sanitized tag is unique.

Fix in Cursor • Fix in Claude

Prompt for Agent
Task: Address review feedback left on GitHub.
Repository: VmaxAI/tasks#11
File: scripts/verify_solutions.py#L88
Action: Open this file location in your editor, inspect the highlighted code, and resolve the issue described below.

Feedback:
The Docker tag sanitization uses a simple regex replacement that could result in collisions. For example, `task__ABC` and `task.ABC` would both become `hci-task-abc`. Consider including a hash of the original task_id in the tag to guarantee uniqueness, or validate that the sanitized tag is unique.


if success:
results["passed"].append(tid)
checksums[tid] = current_hash
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

High

The checksum file is saved after each successful verification (line 266), which is good for incremental progress. However, this creates a race condition in parallel execution - multiple threads could try to update the same file simultaneously. Consider using a lock or collecting all updates and saving once at the end.

Fix in Cursor • Fix in Claude

Prompt for Agent
Task: Address review feedback left on GitHub.
Repository: VmaxAI/tasks#11
File: scripts/verify_solutions.py#L264
Action: Open this file location in your editor, inspect the highlighted code, and resolve the issue described below.

Feedback:
The checksum file is saved after each successful verification (line 266), which is good for incremental progress. However, this creates a race condition in parallel execution - multiple threads could try to update the same file simultaneously. Consider using a lock or collecting all updates and saving once at the end.

"bash", "/runner.sh",
]
result = subprocess.run(
run_cmd,
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Low

The timeout for subprocess.run is set to the task timeout (default 600s), but this doesn't account for the time already spent in Docker build (which has its own 300s timeout). A task could theoretically take up to 900s total. Consider if this is intentional or if the timeouts should be managed differently.

Fix in Cursor • Fix in Claude

Prompt for Agent
Task: Address review feedback left on GitHub.
Repository: VmaxAI/tasks#11
File: scripts/verify_solutions.py#L166
Action: Open this file location in your editor, inspect the highlighted code, and resolve the issue described below.

Feedback:
The timeout for `subprocess.run` is set to the task timeout (default 600s), but this doesn't account for the time already spent in Docker build (which has its own 300s timeout). A task could theoretically take up to 900s total. Consider if this is intentional or if the timeouts should be managed differently.

fi

# Find changed task directories
CHANGED=$(git diff --name-only "$BASE" "$HEAD" -- 'data/' \
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Medium

The change detection uses git diff --name-only and extracts the first two path components with cut -d'/' -f1-2. This will break for task directories nested deeper than data/task_id/ or if the data directory structure changes. Consider using a more robust method to identify task directories, such as finding all changed files and then identifying their parent task directory by looking for task.toml.

Fix in Cursor • Fix in Claude

Prompt for Agent
Task: Address review feedback left on GitHub.
Repository: VmaxAI/tasks#11
File: .github/workflows/validate-tasks.yml#L46
Action: Open this file location in your editor, inspect the highlighted code, and resolve the issue described below.

Feedback:
The change detection uses `git diff --name-only` and extracts the first two path components with `cut -d'/' -f1-2`. This will break for task directories nested deeper than `data/task_id/` or if the data directory structure changes. Consider using a more robust method to identify task directories, such as finding all changed files and then identifying their parent task directory by looking for `task.toml`.

- name: Detect changed task directories
id: detect
run: |
if [ "${{ github.event_name }}" = "pull_request" ]; then
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Medium

For pull requests from forks, github.event.pull_request.base.sha may not be available or may point to an outdated commit if the base branch has moved. This could cause incorrect change detection. Consider using github.event.pull_request.base.ref and fetching the latest commit from that ref, or using the tj-actions/changed-files action which handles these edge cases.

Fix in Cursor • Fix in Claude

Prompt for Agent
Task: Address review feedback left on GitHub.
Repository: VmaxAI/tasks#11
File: .github/workflows/validate-tasks.yml#L36
Action: Open this file location in your editor, inspect the highlighted code, and resolve the issue described below.

Feedback:
For pull requests from forks, `github.event.pull_request.base.sha` may not be available or may point to an outdated commit if the base branch has moved. This could cause incorrect change detection. Consider using `github.event.pull_request.base.ref` and fetching the latest commit from that ref, or using the `tj-actions/changed-files` action which handles these edge cases.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant