Add CI validation for harbor tasks by gb-vmax · Pull Request #11 · VmaxAI/tasks

gb-vmax · 2026-02-26T01:10:15Z

Summary

Tier 1 (lint): Fast schema validation (~1s for 1241 tasks) — checks required files, task.toml structure, Dockerfile FROM, shebangs, instruction.md non-empty. Runs on every PR.
Tier 2 (docker-build): Verifies Dockerfiles actually build for changed tasks. Runs on PRs.
Tier 3 (verify-solutions): Builds Docker image, runs solution, runs tests, checks reward=1. Uses a checksum cache (.validated-solutions.json) so each task only needs to pass once — re-runs only if the task's files change. Runs on push to main or manual trigger.

Scripts are also usable locally:

# Lint all tasks
python scripts/validate_harbor_task.py data/

# Dry-run to see what needs verification
python scripts/verify_solutions.py --dry-run data/

# Verify specific tasks
python scripts/verify_solutions.py --platform linux/amd64 data/some_task_id

Test plan

Linter validates all 1241 existing tasks successfully
Linter catches errors on deliberately malformed tasks (missing files, bad toml, no shebang, etc.)
Solution verification builds Docker, runs solution, checks reward=1 (tested on chalk tasks)
Checksum cache correctly skips already-verified tasks on re-run
Verify GitHub Actions workflow triggers correctly on PR with data/ changes

🤖 Generated with Claude Code

What changed?

Added GitHub Actions workflow (.github/workflows/validate-tasks.yml):

Tier 1 (lint): Fast schema validation running on every PR that touches data/ — validates task.toml structure, required files, Dockerfile syntax, shebang presence
Tier 2 (docker-build): Builds Dockerfiles for changed tasks on PRs to catch build errors early
Tier 3 (verify-solutions): Full integration testing (build + run solution + check reward=1) that runs on push to main or manual trigger, with checksum-based caching to skip already-verified tasks

Added validation script (scripts/validate_harbor_task.py):

Checks all required files exist (task.toml, instruction.md, Dockerfile, solve.sh, test.sh)
Validates task.toml structure (verifier/agent/environment sections, timeout_sec, cpus, memory fields)
Ensures Dockerfile starts with FROM instruction
Verifies shell scripts have shebangs
Supports JSON output and verbose mode

Added solution verification script (scripts/verify_solutions.py):

Builds Docker image for each task
Runs solution script inside container
Executes tests and verifies reward.txt equals 1
Maintains .validated-solutions.json with file checksums to skip unchanged tasks
Supports parallel execution (--jobs flag), platform selection (--platform), and forced re-verification (--force)

Validation

Linter validates all 1241 existing tasks successfully
Linter catches errors on deliberately malformed tasks (missing files, bad toml, no shebang, etc.)
Solution verification builds Docker, runs solution, checks reward=1 (tested on chalk tasks)
Checksum cache correctly skips already-verified tasks on re-run
Verify GitHub Actions workflow triggers correctly on PR with data/ changes

^{Description generated by Mesa. Update settings}

Three-tier validation pipeline: - Tier 1 (lint): schema validation of task.toml, required files, Dockerfile, shebangs - Tier 2 (docker-build): verify Dockerfiles actually build for changed tasks - Tier 3 (verify-solutions): build + run solutions, check reward=1, with checksum cache so each task only needs to pass once Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

mesa-dot-dev

Performed full review of 3542d8d...b076f0a

Analysis

• Race condition in checksum file updates: multiple threads can corrupt JSON during parallel solution verification. Must implement threading lock or collect all results before writing once at completion.

• Git push in PR workflows will fail for fork-based PRs due to permissions. Restrict checksum commits to main branch pushes only, or use a bot account/artifacts mechanism for PR workflows.

• Change detection logic using cut -d'/' -f1-2 is fragile and assumes rigid directory structure. Will break if task hierarchy changes; migrate to robust walk-up approach from changed files to find task.toml.

• Docker tag sanitization can create collisions between tasks with similar names post-character-replacement. Incorporate hash into tag for uniqueness guarantee.

Tip

Help

Slash Commands:

/review - Request a full code review
/review latest - Review only changes since the last review
/describe - Generate PR description. This will update the PR body or issue comment depending on your configuration
/help - Get help with Mesa commands and configuration options

^{3 files reviewed | 8 comments | Edit Agent Settings • Read Docs}

mesa-dot-dev · 2026-02-26T01:14:06Z

scripts/validate_harbor_task.py

+                    else:
+                        info(f"task.toml has {section}")
+        else:
+            try:


Multiple bare try/except blocks are used throughout this file (lines 78-83, 81-83). Consider allowing specific exceptions to propagate or handling them more explicitly. Catching all exceptions with bare except Exception can hide bugs and make debugging difficult. For example, if the TOML file has permission issues vs. syntax errors, these should be handled differently.

•

Prompt for Agent

Task: Address review feedback left on GitHub. Repository: VmaxAI/tasks#11 File: scripts/validate_harbor_task.py#L78 Action: Open this file location in your editor, inspect the highlighted code, and resolve the issue described below. Feedback: Multiple bare try/except blocks are used throughout this file (lines 78-83, 81-83). Consider allowing specific exceptions to propagate or handling them more explicitly. Catching all exceptions with bare `except Exception` can hide bugs and make debugging difficult. For example, if the TOML file has permission issues vs. syntax errors, these should be handled differently.

mesa-dot-dev · 2026-02-26T01:14:07Z

scripts/validate_harbor_task.py

+                # Environment checks
+                env = toml_data.get("environment", {})
+                if "cpus" not in env:
+                    err("task.toml [environment] missing cpus")


The validation checks for either memory or memory_mb in the environment section, but doesn't validate which format is actually expected by the harbor specification. Consider documenting which field is canonical or validating that only one is present to avoid ambiguity.

•

Prompt for Agent

Task: Address review feedback left on GitHub. Repository: VmaxAI/tasks#11 File: scripts/validate_harbor_task.py#L118 Action: Open this file location in your editor, inspect the highlighted code, and resolve the issue described below. Feedback: The validation checks for either `memory` or `memory_mb` in the environment section, but doesn't validate which format is actually expected by the harbor specification. Consider documenting which field is canonical or validating that only one is present to avoid ambiguity.

mesa-dot-dev · 2026-02-26T01:14:08Z

scripts/verify_solutions.py

+CHECKSUM_FILE = ".validated-solutions.json"
+
+
+def compute_task_hash(task_dir: str) -> str:


The compute_task_hash() function reads all files in a task directory into memory for hashing. For tasks with large binary files or embedded tarballs (mentioned in validate_harbor_task.py line 165), this could cause memory issues. Consider streaming files in chunks for hashing instead of filepath.read_bytes().

•

Prompt for Agent

Task: Address review feedback left on GitHub. Repository: VmaxAI/tasks#11 File: scripts/verify_solutions.py#L36 Action: Open this file location in your editor, inspect the highlighted code, and resolve the issue described below. Feedback: The `compute_task_hash()` function reads all files in a task directory into memory for hashing. For tasks with large binary files or embedded tarballs (mentioned in validate_harbor_task.py line 165), this could cause memory issues. Consider streaming files in chunks for hashing instead of `filepath.read_bytes()`.

mesa-dot-dev · 2026-02-26T01:14:09Z

scripts/verify_solutions.py

+    """
+    task_id = os.path.basename(task_dir)
+    # Docker tags: [a-z0-9] with single separators, no slashes (local image)
+    safe_tag = re.sub(r'[^a-z0-9]', '-', task_id.lower())


The Docker tag sanitization uses a simple regex replacement that could result in collisions. For example, task__ABC and task.ABC would both become hci-task-abc. Consider including a hash of the original task_id in the tag to guarantee uniqueness, or validate that the sanitized tag is unique.

•

Prompt for Agent

Task: Address review feedback left on GitHub. Repository: VmaxAI/tasks#11 File: scripts/verify_solutions.py#L88 Action: Open this file location in your editor, inspect the highlighted code, and resolve the issue described below. Feedback: The Docker tag sanitization uses a simple regex replacement that could result in collisions. For example, `task__ABC` and `task.ABC` would both become `hci-task-abc`. Consider including a hash of the original task_id in the tag to guarantee uniqueness, or validate that the sanitized tag is unique.

mesa-dot-dev · 2026-02-26T01:14:10Z

scripts/verify_solutions.py

+
+            if success:
+                results["passed"].append(tid)
+                checksums[tid] = current_hash


The checksum file is saved after each successful verification (line 266), which is good for incremental progress. However, this creates a race condition in parallel execution - multiple threads could try to update the same file simultaneously. Consider using a lock or collecting all updates and saving once at the end.

•

Prompt for Agent

Task: Address review feedback left on GitHub. Repository: VmaxAI/tasks#11 File: scripts/verify_solutions.py#L264 Action: Open this file location in your editor, inspect the highlighted code, and resolve the issue described below. Feedback: The checksum file is saved after each successful verification (line 266), which is good for incremental progress. However, this creates a race condition in parallel execution - multiple threads could try to update the same file simultaneously. Consider using a lock or collecting all updates and saving once at the end.

mesa-dot-dev · 2026-02-26T01:14:11Z

scripts/verify_solutions.py

+                "bash", "/runner.sh",
+            ]
+            result = subprocess.run(
+                run_cmd,


The timeout for subprocess.run is set to the task timeout (default 600s), but this doesn't account for the time already spent in Docker build (which has its own 300s timeout). A task could theoretically take up to 900s total. Consider if this is intentional or if the timeouts should be managed differently.

•

Prompt for Agent

Task: Address review feedback left on GitHub. Repository: VmaxAI/tasks#11 File: scripts/verify_solutions.py#L166 Action: Open this file location in your editor, inspect the highlighted code, and resolve the issue described below. Feedback: The timeout for `subprocess.run` is set to the task timeout (default 600s), but this doesn't account for the time already spent in Docker build (which has its own 300s timeout). A task could theoretically take up to 900s total. Consider if this is intentional or if the timeouts should be managed differently.

mesa-dot-dev · 2026-02-26T01:14:12Z

.github/workflows/validate-tasks.yml

+          fi
+
+          # Find changed task directories
+          CHANGED=$(git diff --name-only "$BASE" "$HEAD" -- 'data/' \


The change detection uses git diff --name-only and extracts the first two path components with cut -d'/' -f1-2. This will break for task directories nested deeper than data/task_id/ or if the data directory structure changes. Consider using a more robust method to identify task directories, such as finding all changed files and then identifying their parent task directory by looking for task.toml.

•

Prompt for Agent

Task: Address review feedback left on GitHub. Repository: VmaxAI/tasks#11 File: .github/workflows/validate-tasks.yml#L46 Action: Open this file location in your editor, inspect the highlighted code, and resolve the issue described below. Feedback: The change detection uses `git diff --name-only` and extracts the first two path components with `cut -d'/' -f1-2`. This will break for task directories nested deeper than `data/task_id/` or if the data directory structure changes. Consider using a more robust method to identify task directories, such as finding all changed files and then identifying their parent task directory by looking for `task.toml`.

mesa-dot-dev · 2026-02-26T01:14:13Z

.github/workflows/validate-tasks.yml

+      - name: Detect changed task directories
+        id: detect
+        run: |
+          if [ "${{ github.event_name }}" = "pull_request" ]; then


For pull requests from forks, github.event.pull_request.base.sha may not be available or may point to an outdated commit if the base branch has moved. This could cause incorrect change detection. Consider using github.event.pull_request.base.ref and fetching the latest commit from that ref, or using the tj-actions/changed-files action which handles these edge cases.

•

Prompt for Agent

Task: Address review feedback left on GitHub. Repository: VmaxAI/tasks#11 File: .github/workflows/validate-tasks.yml#L36 Action: Open this file location in your editor, inspect the highlighted code, and resolve the issue described below. Feedback: For pull requests from forks, `github.event.pull_request.base.sha` may not be available or may point to an outdated commit if the base branch has moved. This could cause incorrect change detection. Consider using `github.event.pull_request.base.ref` and fetching the latest commit from that ref, or using the `tj-actions/changed-files` action which handles these edge cases.

mesa-dot-dev bot reviewed Feb 26, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add CI validation for harbor tasks#11

Add CI validation for harbor tasks#11
gb-vmax wants to merge 1 commit intoVmaxAI:mainfrom
gb-vmax:ci/harbor-validation

gb-vmax commented Feb 26, 2026 •

edited by mesa-dot-dev bot

Loading

Uh oh!

mesa-dot-dev bot left a comment

Uh oh!

mesa-dot-dev bot Feb 26, 2026

Uh oh!

mesa-dot-dev bot Feb 26, 2026

Uh oh!

mesa-dot-dev bot Feb 26, 2026

Uh oh!

mesa-dot-dev bot Feb 26, 2026

Uh oh!

mesa-dot-dev bot Feb 26, 2026

Uh oh!

mesa-dot-dev bot Feb 26, 2026

Uh oh!

mesa-dot-dev bot Feb 26, 2026

Uh oh!

mesa-dot-dev bot Feb 26, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

		CHECKSUM_FILE = ".validated-solutions.json"


		def compute_task_hash(task_dir: str) -> str:

Conversation

gb-vmax commented Feb 26, 2026 • edited by mesa-dot-dev bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Test plan

What changed?

Validation

Uh oh!

mesa-dot-dev bot left a comment

Choose a reason for hiding this comment

Analysis

Uh oh!

mesa-dot-dev bot Feb 26, 2026

Choose a reason for hiding this comment

Uh oh!

mesa-dot-dev bot Feb 26, 2026

Choose a reason for hiding this comment

Uh oh!

mesa-dot-dev bot Feb 26, 2026

Choose a reason for hiding this comment

Uh oh!

mesa-dot-dev bot Feb 26, 2026

Choose a reason for hiding this comment

Uh oh!

mesa-dot-dev bot Feb 26, 2026

Choose a reason for hiding this comment

Uh oh!

mesa-dot-dev bot Feb 26, 2026

Choose a reason for hiding this comment

Uh oh!

mesa-dot-dev bot Feb 26, 2026

Choose a reason for hiding this comment

Uh oh!

mesa-dot-dev bot Feb 26, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

gb-vmax commented Feb 26, 2026 •

edited by mesa-dot-dev bot

Loading