Skip to content

Alter logic for yielding from halfjoin#8818

Merged
frankmcsherry merged 1 commit into
MaterializeInc:mainfrom
frankmcsherry:address_halfjoin_spinning
Oct 28, 2021
Merged

Alter logic for yielding from halfjoin#8818
frankmcsherry merged 1 commit into
MaterializeInc:mainfrom
frankmcsherry:address_halfjoin_spinning

Conversation

@frankmcsherry

@frankmcsherry frankmcsherry commented Oct 27, 2021

Copy link
Copy Markdown
Contributor

This PR changes the optimistically aggressive 10ms yield time for HalfJoin to a more conservative 1M records, which ensures that at least some amount of work gets done, at the expense of potentially stalling for longer. This has the potential to mitigate some "spinning" behavior that folks are seeing with this operator.

It is also possible that there is just a bug in the HalfJoin implementation, rather than the yielding policy. At the moment we don't have a reproduction of this issue, so it is hard to know. This PR is meant to be something they can use to test.

fixes MaterializeInc/database-issues#2699

@philip-stoev philip-stoev self-requested a review October 27, 2021 21:22
@frankmcsherry

frankmcsherry commented Oct 27, 2021

Copy link
Copy Markdown
Contributor Author

This reportedly fixes the issue for at least one instance of 100% spinning with no progress.

Some more diagnosis: the removed 10ms yielding policy has the defect that it (due to the half join implementation) will yield after an initial consolidate of the work to do, and if that work takes 10ms or more will result in no work being done. That consolidation .. shouldn't need to be expensive once things are sorted, but you still need to go and do a whole bunch of memory dereferences to confirm everything.

The issue could also be ameliorated in the half_join implementation by stashing consolidated data in a representation that makes it clear that it has been consolidated (e.g. an enum), which had no value when we wouldn't revisit that work, but now makes sense if we will repeatedly yield and revisit the work (consider 1B inbound records; reviewing them each time we pluck off 1M work items means a massive amount of redundant work).

@frankmcsherry frankmcsherry marked this pull request as ready for review October 27, 2021 22:13
@frankmcsherry

Copy link
Copy Markdown
Contributor Author

Merging with permission from @philip-stoev. Follow-up work on introducing tests to the nightly stress tests to look for this sort of issue, and others like it.

@frankmcsherry frankmcsherry merged commit fbcfa06 into MaterializeInc:main Oct 28, 2021
@frankmcsherry frankmcsherry deleted the address_halfjoin_spinning branch March 8, 2022 13:28
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant