fix: invoke reconnect when readInternal return error by itgram · Pull Request #188 · kurrent-io/EventStore-Client-Go

itgram · 2024-10-22T16:47:30Z

Fixed: Trigger a reconnect when readInternal returns an error.

CLAassistant · 2024-10-22T16:47:36Z

All committers have signed the CLA.

itgram · 2024-10-23T14:12:45Z

@w1am

Could you please merge this PR?

Let me know if this works for you!

YoEight · 2024-10-23T15:47:34Z

Hey @itgram, thanks for contribution.

Why that change is needed in the first place? Could you extend on the reason a little bit?

itgram · 2024-10-23T16:31:50Z

@YoEight

In a cluster mode setup of EventStoreDB, when one of the cluster nodes drops and we attempt to use ReadStream method, it raises an error and doesn’t rebalance with the updated cluster members.

This issue occurs here: https://github.com/EventStore/EventStore-Client-Go/blob/8206ac84067e5a6d591f71fafb82758015fb9f3c/esdb/client.go#L291.

If we call ReadStream again, it fails repeatedly because client.grpcClient.handleError is not invoked in case of an error, which prevents the reconnection process from being triggered.

YoEight · 2024-10-23T18:16:35Z

Could you come up with a test that shows your patch is fixing the behavior you are talking about?

YoEight

Thanks for your contribution.

As-is, I don't think your test reproduces the scenario you described earlier. I'd really like to replicate all the steps that lead to your issue before applying your patch.

itgram · 2024-10-27T15:15:21Z

@YoEight

I couldn’t find a way to stop the leader node of EventStoreDB programmatically from the code.

YoEight · 2024-10-27T22:07:23Z

If you already know the leader node of the cluster, what you can do is to call those endpoints on it, the order matters:

POST /admin/node/priority/-1
POST /admin/node/resign

This will not stop the leader per se but rather trigger a new election cycle and guarantee (in the context of your test) that a new a leader is elected.

itgram · 2024-10-27T23:11:29Z

If you already know the leader node of the cluster, what you can do is to call those endpoints on it, the order matters:

POST /admin/node/priority/-1

POST /admin/node/resign

This will not stop the leader per se but rather trigger a new election cycle and guarantee (in the context of your test) that a new a leader is elected.

Interesting, but as you know, I still haven't been able to determine the leader node programmatically yet.

YoEight · 2024-10-28T01:01:59Z

The best way to achieve this is to leverage the GET /gossip endpoint. It gives you something like this:

{
  "members": [
    {
      "instanceId": "cf2e423a-a664-43ed-806f-10745c6d0ea0",
      "timeStamp": "2024-10-28T00:53:05.55955Z",
      "state": "Leader",
      "isAlive": true,
      "internalTcpIp": "127.0.0.1",
      "internalTcpPort": 1112,
      "internalSecureTcpPort": 0,
      "externalTcpIp": "127.0.0.1",
      "externalTcpPort": 0,
      "externalSecureTcpPort": 0,
      "internalHttpEndPointIp": "127.0.0.1",
      "internalHttpEndPointPort": 2113,
      "httpEndPointIp": "127.0.0.1",
      "httpEndPointPort": 2113,
      "lastCommitPosition": 891,
      "writerCheckpoint": 1063,
      "chaserCheckpoint": 1063,
      "epochPosition": 0,
      "epochNumber": 0,
      "epochId": "ded879b8-d7e6-45a6-87b2-2f3ef8c1f47b",
      "nodePriority": 0,
      "isReadOnlyReplica": false,
      "esVersion": "24.10.0-prerelease"
    },
   {...}
  ],
}

The leader node will have its state property set to Leader.

YoEight · 2024-10-29T15:50:24Z

Hey @itgram

I came up with a test case that I believe replicates the steps that you described earlier:

func readStreamNotLeaderExceptionButWorkAfterRetry(t *testing.T) {
	db := CreateClient("esdb://admin:changeit@localhost:2111,localhost:2112,localhost:2113?nodepreference=leader&tlsverifycert=false", t)
	defer db.Close()

	ctx := context.Background()
	streamID := NAME_GENERATOR.Generate()

	_, err := db.AppendToStream(ctx, streamID, esdb.AppendToStreamOptions{}, createTestEvent())

	assert.Nil(t, err)

	members, err := db.Gossip(ctx)

	assert.Nil(t, err)

	for _, member := range members {
		if member.State != gossip.MemberInfo_Leader {
			continue
		}

		url := fmt.Sprintf("https://%s:%d/admin/shutdown", member.HttpEndPoint.Address, member.HttpEndPoint.Port)
		req, err := http.NewRequest("POST", url, nil)
		assert.Nil(t, err)

		req.SetBasicAuth("admin", "changeit")
		client := &http.Client{
			Transport: &http.Transport{
				TLSClientConfig: &tls.Config{InsecureSkipVerify: true},
			},
		}
		resp, err := client.Do(req)

		assert.Nil(t, err)
		resp.Body.Close()

		break
	}

	_, err = db.ReadStream(ctx, streamID, esdb.ReadStreamOptions{
		From:           esdb.Start{},
		RequiresLeader: true,
	}, math.MaxUint64)

	assert.Nil(t, err)
}

The Gossip function I used is one that reuse our internal grpc gossip client:

func (client *Client) Gossip(ctx context.Context) ([]*gossip.MemberInfo, error) {
	handle, err := client.grpcClient.getConnectionHandle()

	if err != nil {
		return nil, err
	}

	gossipClient := gossip.NewGossipClient(handle.Connection())
	clusterInfo, err := gossipClient.Read(ctx, &shared.Empty{})

	if err != nil {
		return nil, err
	}

	return clusterInfo.Members, nil
}

I wasn’t able to reproduce the issue on my end. While I can’t rule out the possibility of a bug, it may help to explore other potential causes. Do you have any additional leads or context that could help narrow down the source?

itgram · 2024-10-31T05:19:06Z

@YoEight
Please review the test case once more and share your feedback.

itgram · 2024-11-07T07:04:07Z

Hello @YoEight, do you have any concerns?

YoEight · 2024-11-08T04:19:33Z

Hey @itgram, I haven’t forgotten about your PR! I’m finishing up a few other tasks first, then I’ll jump into reviewing it. Apologies for the delay, my time has been tighter than usual as we’re gearing up for the upcoming release. Thanks for your patience!

YoEight

I confirm your patch fixed the issue, nice work!

Co-authored-by: Yo Eight <yo.eight@gmail.com>

fix: invoke reconnect when readInternal return error

a77b25e

add unit test to read stream after cluster rebalance

6f30030

YoEight suggested changes Oct 23, 2024

View reviewed changes

Comment thread esdb/cluster_test.go Outdated

Comment thread esdb/cluster_test.go Outdated

Comment thread esdb/cluster_test.go

itgram added 2 commits October 24, 2024 08:28

adjust the test

76a20e9

wait for the leader to be elected

5216f04

itgram requested a review from YoEight October 26, 2024 11:28

revert the test case

01bfadf

itgram added 2 commits October 29, 2024 19:55

correct the test function

d31ae7f

check for member is alive

21cd60d

itgram marked this pull request as draft October 29, 2024 17:25

Retry reading the stream a specific number of times

f4c2232

itgram marked this pull request as ready for review October 29, 2024 17:33

itgram added 2 commits October 29, 2024 20:39

log retries count

061c1c9

fix retries count

5bb37c6

itgram force-pushed the master branch from 366d0da to b1fde23 Compare October 29, 2024 22:16

isolate the rebalance tests to avoid conflict with other tests

7e6927a

itgram force-pushed the master branch from b1fde23 to 7e6927a Compare October 29, 2024 22:25

itgram mentioned this pull request Nov 4, 2024

Read Stream always return error on cluster rebalance #190

Closed

YoEight approved these changes Nov 12, 2024

View reviewed changes

Merge branch 'master' into master

106d8c1

w1am self-requested a review November 12, 2024 15:50

w1am previously approved these changes Nov 12, 2024

View reviewed changes

YoEight suggested changes Nov 12, 2024

View reviewed changes

Comment thread esdb/client.go Outdated

Update esdb/client.go

8da4cb1

Co-authored-by: Yo Eight <yo.eight@gmail.com>

itgram dismissed w1am’s stale review via 8da4cb1 November 12, 2024 16:25

YoEight merged commit c29aa7d into kurrent-io:master Nov 12, 2024

Conversation

itgram commented Oct 22, 2024 • edited by YoEight Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

CLAassistant commented Oct 22, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

itgram commented Oct 23, 2024

Uh oh!

YoEight commented Oct 23, 2024

Uh oh!

itgram commented Oct 23, 2024

Uh oh!

YoEight commented Oct 23, 2024

Uh oh!

YoEight left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

itgram commented Oct 27, 2024

Uh oh!

YoEight commented Oct 27, 2024

Uh oh!

itgram commented Oct 27, 2024

Uh oh!

YoEight commented Oct 28, 2024

Uh oh!

YoEight commented Oct 29, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

itgram commented Oct 31, 2024

Uh oh!

itgram commented Nov 7, 2024

Uh oh!

YoEight commented Nov 8, 2024

Uh oh!

YoEight left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

itgram commented Oct 22, 2024 •

edited by YoEight

Loading

CLAassistant commented Oct 22, 2024 •

edited

Loading

YoEight commented Oct 29, 2024 •

edited

Loading