Skip to content

test: refactor test suite to (hopefully) fix flakiness#65

Merged
mccutchen merged 26 commits intomainfrom
flaky-tests-2
Oct 25, 2025
Merged

test: refactor test suite to (hopefully) fix flakiness#65
mccutchen merged 26 commits intomainfrom
flaky-tests-2

Conversation

@mccutchen
Copy link
Owner

@mccutchen mccutchen commented Oct 7, 2025

The theory here is that the new, more rigorous tests for closing handshake added as part of #63 have caused significantly more flakiness because of setupRawConnWithHandler's use of a throwaway bufio.Reader around the underlying net.Conn just to read the HTTP 101 response for the opening handshake.

This causes a race with tests where the server might write websocket data to the connection immediately after the handshake (e.g. when a test immediately closes the connection), because the bufio.Reader is likely to read (partial) websocket data while reading the HTTP response, leaving subsequent reads with only partial/incomplete data and causing them to block until the test suite times out.

I'm very curious to see if this fixes the tests. If it works, credit goes to Claude Code for helping identify the race condition. If it doesn't, well, obviously LLMs are useless.


Update: while that issue was likely a source of flakiness, fixing it did not actually make the test suite any less flaky.

Ended up doing a much larger refactor to set up complete client and server connections instead of setting up a single connection to simulate a client and server reading from/writing to the same conn.

This seems to have helped a bit, maybe, but we're still seeing intermittent test failures due to timeouts.

@github-actions
Copy link

github-actions bot commented Oct 7, 2025

🔥 Run benchmarks comparing 856dd89 against main:

gh workflow run bench.yaml -f pr_number=65

Note: this comment will update with each new commit.

@mccutchen
Copy link
Owner Author

No dice on the more minimal changes, let's see if dropping net.Pipe will help us out here

@codecov
Copy link

codecov bot commented Oct 18, 2025

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 96.99%. Comparing base (ca6c1e4) to head (856dd89).
⚠️ Report is 1 commits behind head on main.

Additional details and impacted files
@@            Coverage Diff             @@
##             main      #65      +/-   ##
==========================================
+ Coverage   96.66%   96.99%   +0.32%     
==========================================
  Files           2        2              
  Lines         570      432     -138     
==========================================
- Hits          551      419     -132     
+ Misses         14        7       -7     
- Partials        5        6       +1     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@mccutchen
Copy link
Owner Author

Ended up doing a much bigger refactor to ensure all tests were using real TCP connections between a client and a real server instead of reading from and writing to a single shared connection.

Unfortunately … we're still failing in confusing and weird ways:

--- FAIL: TestProtocolErrors (0.00s)
    --- FAIL: TestProtocolErrors/max_message_size (0.02s)
        websocket_test.go:612: expected error "message paylaod too large", got "client frame must be masked" (*websocket.Error vs *websocket.Error)
        websocket_test.go:605: incorrect close status code:
            want: 1009
             got: 1002
coverage: 87.0% of statements
panic: test timed out after 1m0s
	running tests:
		TestCloseFrameValidation (1m0s)
		TestCloseFrameValidation/invalid_close_code_(too_high)) (1m0s)

@mccutchen mccutchen changed the title test: fix flaky tests (hopefully) test: refactor test suite to (hopefully) fix flakiness Oct 19, 2025
@mccutchen mccutchen enabled auto-merge (squash) October 25, 2025 10:31
@mccutchen mccutchen merged commit 8744466 into main Oct 25, 2025
10 checks passed
@mccutchen mccutchen deleted the flaky-tests-2 branch October 25, 2025 10:45
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant