Skip to content

Fix flaky TestConcurrentDeletes e2e test#1001

Merged
efiacor merged 1 commit into
kptdev:mainfrom
Nordix:fix-concurrent-deletes-e2e
May 20, 2026
Merged

Fix flaky TestConcurrentDeletes e2e test#1001
efiacor merged 1 commit into
kptdev:mainfrom
Nordix:fix-concurrent-deletes-e2e

Conversation

@liamfallon
Copy link
Copy Markdown
Collaborator

Fix flaky TestConcurrentDeletes e2e test


Description

  • What changed: The TestConcurrentDeletes e2e test fails often
  • Why it’s needed: It breaks the CI with false positivies
  • How it works: The test check is relaxed to check for "another request is already in progress" or "not found"

Type of Change

  • Bug fix
  • New feature
  • Enhancement
  • Refactor
  • Documentation
  • Tests
  • Other: ________

Checklist

  • Code follows project style guidelines
  • Self-reviewed changes
  • Tests added/updated
  • Documentation added/updated
  • All tests and gating checks pass

AI Disclosure

  • I have used AI in the creation of this PR.

I used the solution suggested by CoPilot in Github when I asked it to "fix with Copilot"

Copilot dialogue transcript

The failure is in TestE2E/TestConcurrentDeletes in test/e2e/api/concurrent_test.go at #L343-L349, triggered by the E2E workflow job definition in .github/workflows/porch-e2e-ci-jobs.yaml#L175-L176.
What failed

The test currently assumes that concurrent deletes must produce:

at least one success, and
at least one error containing "another request is already in progress"

But the actual results in the log were:

nil
Internal error occurred: sql: no rows in result set
packagerevisions.porch.kpt.dev "... not found"
package ...

So one request succeeded, but the losing requests no longer consistently return the expected conflict string. That makes the test too strict and flaky against the current backend behavior, especially in the DB-cache path.
Best fix

Make the delete concurrency assertion accept the set of valid losing outcomes for concurrent delete:

conflict / in-progress
not found
sql no rows in result set

This keeps the test focused on the real invariant: exactly one delete wins, and the object is gone at the end.
Suggested code change

Update assertConcurrentResults to allow operation-specific acceptable errors, or special-case delete.

Example minimal patch:
Go

func (t *PorchSuite) assertConcurrentResults(results []any, operation string) {
assert.Contains(t, results, nil, "expected one %s request to succeed, but did not happen - results: %v", operation, results)

validFailurePresent := slices.ContainsFunc(results, func(eachResult any) bool {
	if eachResult == nil {
		return false
	}
	err := eachResult.(error).Error()

	if strings.Contains(err, conflictErrorMessage) {
		return true
	}

	if operation == "delete" {
		return strings.Contains(err, "not found") ||
			strings.Contains(err, "sql: no rows in result set")
	}

	return false
})

assert.True(t, validFailurePresent, "expected one %s request to fail with an expected concurrent outcome, but did not happen - results: %v", operation, results)

}

Better version

If you want cleaner semantics, split delete into its own assertion:
Go

func (t *PorchSuite) assertConcurrentDeleteResults(results []any) {
assert.Contains(t, results, nil, "expected one delete request to succeed, but did not happen - results: %v", results)

validFailurePresent := slices.ContainsFunc(results, func(eachResult any) bool {
	if eachResult == nil {
		return false
	}
	err := eachResult.(error).Error()
	return strings.Contains(err, conflictErrorMessage) ||
		strings.Contains(err, "not found") ||
		strings.Contains(err, "sql: no rows in result set")
})

assert.True(t, validFailurePresent, "expected losing delete requests to fail with conflict/not-found semantics - results: %v", results)

}

Then change TestConcurrentDeletes from:
Go

t.assertConcurrentResults(results, "delete")

to:
Go

t.assertConcurrentDeleteResults(results)

Why this is the right solution

For concurrent delete, once one request removes the object, other in-flight requests may observe:

a lock/conflict,
the row already missing in storage,
the Kubernetes resource already gone.

Those are all valid race outcomes. Requiring only one exact error string makes the test assert implementation details instead of behavior.
Files to update

Failing test: [test/e2e/api/concurrent_test.go](https://github.com/kptdev/porch/blob/3a79bd0dbee1bd9cacf278f7e53ef64c6705a2bf/test/e2e/api/concurrent_test.go)
Job definition: [.github/workflows/porch-e2e-ci-jobs.yaml](https://github.com/kptdev/porch/blob/3a79bd0dbee1bd9cacf278f7e53ef64c6705a2bf/.github/workflows/porch-e2e-ci-jobs.yaml)

Recommended final adjustment

Keep these two assertions in TestConcurrentDeletes:

one delete succeeded
t.MustNotExist(&draft) passes

Relax only the expected loser error shape. That should make the failing job stable without weakening the actual concurrency guarantee.

Signed-off-by: liamfallon <liam.fallon@est.tech>
Copilot AI review requested due to automatic review settings May 20, 2026 10:54
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Relaxes the TestConcurrentDeletes e2e test expectations to reduce CI flakiness by accepting additional valid outcomes from losing concurrent delete requests (e.g., NotFound / DB “no rows” style errors) while still requiring at least one successful delete and verifying the resource is gone.

Changes:

  • Switch TestConcurrentDeletes to a delete-specific concurrency assertion helper.
  • Add assertConcurrentDeleteResults to treat conflict, not-found, and “sql: no rows in result set” as acceptable losing outcomes for concurrent deletes.

Comment thread test/e2e/api/concurrent_test.go
Comment thread test/e2e/api/concurrent_test.go
@efiacor efiacor merged commit 94ad17d into kptdev:main May 20, 2026
23 checks passed
@efiacor efiacor deleted the fix-concurrent-deletes-e2e branch May 20, 2026 11:31
thc1006 added a commit to thc1006/porch that referenced this pull request May 21, 2026
Sync with latest main to pick up kptdev#1001 (flaky TestConcurrentDeletes
fix) and trigger a fresh CI run after the flaky rpkg-get CLI E2E
timeout on the DB cache backend.

Signed-off-by: thc1006 <84045975+thc1006@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants