Skip to content

[DO NOT MERGE] Repro bb8 issues#5351

Draft
smklein wants to merge 1 commit into
mainfrom
attempt-to-repro-bb8-oddities
Draft

[DO NOT MERGE] Repro bb8 issues#5351
smklein wants to merge 1 commit into
mainfrom
attempt-to-repro-bb8-oddities

Conversation

@smklein

@smklein smklein commented Mar 29, 2024

Copy link
Copy Markdown
Collaborator

As a part of #5172 , I hit a bug during RSS handoff.

In particular:

  • Sled Agent called "initialization-completed"
  • Nexus started a transaction to update a bunch of data
  • Nexus incorrectly sent a request on a "new connection", rather than the "transaction connection"

Expected behavior:

  • Initialization succeeds, but without safe transaction semantics? Or perhaps bb8 complains about being exhausted of connections, if it's blocking anywhere? I'm not totally sure I understand the semantics yet.

Observed behavior:

  • Several connections -- even ones from unrelated background jobs -- started returning "Timed out in bb8" errors when attempting to access new connections. Furthermore, the "initialization-completed" request appeared to hang indefinitely.

This PR attempts to act as a reproduction case for that class of issues.

I'm also working on adding a reproduction case to https://github.com/oxidecomputer/async-bb8-diesel , but I haven't managed that quite yet.

@smklein

smklein commented Mar 29, 2024

Copy link
Copy Markdown
Collaborator Author

If this reproduces, I think I'm going to take the following tactics:

  • Try to create a smaller reproduction. This is a big beefy transaction that needs a lot of stars aligned to work. Would be nice if we could re-create this with a smaller endpoint, or ideally, without a dropshot endpoint at all.
  • Add a lot more inspection in bb8. What is the status of our transaction pool? How many connections do we think are open?
  • Add more inspection on the Cockroach side. Is there any way to see which transactions are still open?

I have theories about how things could be going wrong, but need more data to validate.

  • Is this triggered purely by "getting a new connection from within a transaction, which already has a transaction" -> I don't think so. This doesn't reproduce minimally (I tried, in async-bb8-diesel), and I think is happening implicitly with all the auth check calls that aren't using the same connection as the higher-level transaction.
  • Could there be a deadlock here between transactions? -> This seems possible to me? Perhaps the transaction touches rows that are getting poked at by the "independently-checked out connection", and the inability of the transaction to complete hangs both?
  • Could local state in Diesel (or bb8) be corrupted? This is possible, but I believe the "Transaction manager" semantics are "per-connection", so this seems unlikely to me?

@smklein

smklein commented Mar 29, 2024

Copy link
Copy Markdown
Collaborator Author

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant