More logging for raft/processInternalRaftRequest by anshulpundir · Pull Request #2389 · moby/swarmkit

anshulpundir · 2017-09-26T01:27:29Z

Adding a new manager node or promoting a worker to a manager fails with "XXX: node lost leader status". After the failure, the leadership does not actually change. Here, my hypothesis is that the reason for raft proposals to fail may not just be ErrLostLeadership.

codecov · 2017-09-26T01:38:29Z

Codecov Report

Merging #2389 into master will decrease coverage by 0.01%.
The diff coverage is 70%.

@@            Coverage Diff             @@
##           master    #2389      +/-   ##
==========================================
- Coverage   60.37%   60.36%   -0.02%     
==========================================
  Files         128      128              
  Lines       26260    26275      +15     
==========================================
+ Hits        15855    15860       +5     
- Misses       9010     9019       +9     
- Partials     1395     1396       +1

aaronlehmann

What is swarmkit#9393? Is there a problem you're trying to debug?

aaronlehmann · 2017-09-26T05:15:58Z

 	case x, ok := <-ch:
 		if !ok {
-			return nil, ErrLostLeadership
+			if err, ok := x.(error); ok {


If ok is false it means the channel was closed and nothing was received, so trying to use x is wrong. Also, errors are never sent over this channel.

If ok is false it means the channel was closed and nothing was received, so trying to use x is wrong. Also, errors are never sent over this channel.

Thanks for pointing this out!
I guess I need to read the code again. The reason I'm doing this is because I think assuming ErrLostLeadership here may not accurate (please correct me I'm wrong). So, I'm trying to see if a more meaningful error can be returned.

I believe it is accurate. That's the only thing that causes proposals to get canceled.

This is in investigating a case where proposal fails with ErrLostLeadership but the leader does not actually lose leadership. Unfortunately, its not reproducible any more.

anshulpundir · 2017-09-26T07:36:07Z

What is swarmkit#9393? Is there a problem you're trying to debug?

Updated description.

nishanttotla · 2017-09-26T23:36:26Z

 					// cancelAll, or by its own check of signalledLeadership.
 					n.wait.cancelAll()
 				} else if !wasLeader && rd.SoftState.RaftState == raft.StateLeader {
+					log.G(ctx).Infof("Manager is now a leader.", n.opts.ID)


This message doesn't contain the format specifier in the message.

nishanttotla · 2017-09-26T23:39:20Z

+			// Wait notification channel was closed. This should only happen if the wait was cancelled.
+			log.G(ctx).Errorf("Wait cancelled, likely because node %x lost leader position. Wait channel closed with nothing to read.", n.opts.ID)
+			if atomic.LoadUint32(&n.signalledLeadership) == 1 {
+				log.G(ctx).Errorf("Wait cancelled but node %x is still a leader.", n.opts.ID)


@anshulpundir based on our discussion, let's update these message to remove "wait cancelled". This will allow us to distinguish this case from the one below.

Changed to lowercase. The log above (line 1713) can help us differentiate this from the case below.

nishanttotla · 2017-09-26T23:42:03Z

@@ -630,6 +632,7 @@ func (n *Node) Run(ctx context.Context) error {
 					// cancelAll, or by its own check of signalledLeadership.
 					n.wait.cancelAll()


@anshulpundir do we need to put a log message just before this call to cancelAll?

See log on line 615. Lemme know if you think we need another log.

My bad, we don't. Please ignore this comment.

GordonTheTurtle · 2017-09-26T23:49:15Z

Please sign your commits following these rules:
https://github.com/moby/moby/blob/master/CONTRIBUTING.md#sign-your-work
The easiest way to do this is to amend the last commit:

$ git clone -b "log" git@github.com:anshulpundir/swarmkit.git somewhere
$ cd somewhere
$ git rebase -i HEAD~842354263464
editor opens
change each 'pick' to 'edit'
save the file and quit
$ git commit --amend -s --no-edit
$ git rebase --continue # and repeat the amend for each commit
$ git push -f

Amending updates the existing PR. You DO NOT need to open a new one.

nishanttotla · 2017-09-26T23:54:30Z

+			// Wait notification channel was closed. This should only happen if the wait was cancelled.
+			log.G(ctx).Errorf("wait cancelled, likely because node %x lost leader position. Wait channel closed with nothing to read.", n.opts.ID)
+			if atomic.LoadUint32(&n.signalledLeadership) == 1 {
+				log.G(ctx).Errorf("Wait cancelled but node %x is still a leader.", n.opts.ID)


I don't mean to pontificate on this too much, but can we make this message textually different from the one below, instead of just using lower and upper case as the difference? By convention, all messages are lower case.

cc @aaronlehmann if you have ideas.

The sadness of not including file names/line numbers in log messages :(

BTW, if your concern is to differentiate the two cases, it is still possible because there's another log before this one. @nishanttotla

anshulpundir · 2017-09-26T23:57:50Z

@@ -630,6 +632,7 @@ func (n *Node) Run(ctx context.Context) error {
 					// cancelAll, or by its own check of signalledLeadership.
 					n.wait.cancelAll()


See log on line 615. Lemme know if you think we need another log.

anshulpundir · 2017-09-26T23:57:59Z

 			if rd.SoftState != nil {
 				if wasLeader && rd.SoftState.RaftState != raft.StateLeader {
 					wasLeader = false
+					log.G(ctx).Infof("soft state changed for node %x. Manager no longer a leader. Cancelling all waits.", n.opts.ID)


@nishanttotla

anshulpundir · 2017-09-27T00:06:29Z

+		// position and cancelling the transaction. This entry still needs
+		// to be commited since other nodes have already created a new
 		// transaction to commit the data.
-


Removing this since the only way we can get here is if the wait item was removed by calling cancelAll(). Please let me know if you think otherwise @aaronlehmann thx!

nishanttotla

LGTM

aaronlehmann · 2017-09-27T03:12:24Z

I'm not sure this is the right code path to instrument. If the problem occurs on joining a manager node and promoting a worker, I think you are seeing the ErrLostLeadership returned here: https://github.com/anshulpundir/swarmkit/blob/2780e221f84d2b9a8b10c4c8040bfd94dd276995/manager/state/raft/raft.go#L922-L924

To the question of whether ErrLostLeadership is the right one to return from processInternalRaftRequest, it's fairly straightforward to look at all the code paths where these waits are cancelled. Just search for n.wait.cancel and n.wait.cancelAll.

We only care about instances outside processInternalRaftRequest, because ones inside the function can't execute concurrently with that select statement.

There is one call to cancelAll in Run, in the situation where the node does lose leadership.

There are some calls to cancel in configure, but these are cancelling a special wait registered there, and it will never be one that processInternalRaftRequest is waiting for.

So I really think that the logic in processInternalRaftRequest is correct and we don't need this extra logging. Also, some of the new logs are redundant with leadership-related logs produced by etcd/raft, and some of them expose internal details that I don't think should go in user-facing logs (for example, stuff about a channel being closed).

nishanttotla · 2017-09-27T21:18:45Z

@aaronlehmann on a related note, in processInternalRaftRequest, in the select loop for the case of waitCtx.Done(), why are we calling wait.cancel again?

anshulpundir · 2017-09-28T00:40:53Z

I'm not sure this is the right code path to instrument. If the problem occurs on joining a manager node and promoting a worker, I think you are seeing the ErrLostLeadership returned here: https://github.com/anshulpundir/swarmkit/blob/2780e221f84d2b9a8b10c4c8040bfd94dd276995/manager/state/raft/raft.go#L922-L924

I'm pretty sure its not the path you just pointed out, because the error code is different.

To the question of whether ErrLostLeadership is the right one to return from processInternalRaftRequest, it's fairly straightforward to look at all the code paths where these waits are cancelled. Just search for n.wait.cancel and n.wait.cancelAll.
We only care about instances outside processInternalRaftRequest, because ones inside the function can't execute concurrently with that select statement.
There is one call to cancelAll in Run, in the situation where the node does lose leadership.
There are some calls to cancel in configure, but these are cancelling a special wait registered there, and it will never be one that processInternalRaftRequest is waiting for.

Agreed

So I really think that the logic in processInternalRaftRequest is correct and we don't need this extra logging.

I also think the logic is correct, there are maybe some redundancies .e.g. the call to
n.wait.cancel(r.ID) when the context might already be cancelled. I still suspect it is this code path that is returning ErrLostLeadership.
I don't entirely agree with not needing extra logging. I think its better to be verbose here rather than not since this code path is not commonly executed anyway, and having more logs for better debugability.

Also, some of the new logs are redundant with leadership-related logs produced by etcd/raft, and some of them expose internal details that I don't think should go in user-facing logs (for example, stuff about a channel being closed).

I'll try to address the redundancy. The reason I put the bit about the channel being closed was to differentiate that select case (since there's no file names/line numbers in the logs, which is a pain btw). Since the logs are primarily used by engineers for debugging, I think it should be ok to expose whatever needs to be exposed for debugging.

aaronlehmann · 2017-09-28T03:01:10Z

on a related note, in processInternalRaftRequest, in the select loop for the case of waitCtx.Done(), why are we calling wait.cancel again?

It's doing that to remove the wait entry, now that the function will no longer be waiting for the entry. It's probably not necessary in the waitCtx.Done case, but it doesn't hurt either, and having it there avoids making assumptions about what other code will do.

Since the logs are primarily used by engineers for debugging, I think it should be ok to expose whatever needs to be exposed for debugging.

That's true for debug-level logs, not for other log levels though.

anshulpundir · 2017-09-28T17:13:09Z

That's true for debug-level logs, not for other log levels though.

I'll see if we can run with debug on in test runs and change ones that expose internal details to debug level.

Signed-off-by: Anshul Pundir <anshul.pundir@docker.com>

aaronlehmann · 2017-09-29T14:50:32Z

+		// If we can read from the channel, wait item was triggered. Otherwise it was cancelled.
 		x, ok := <-ch
 		if !ok {
+			log.G(ctx).WithError(waitCtx.Err()).Errorf("wait context cancelled, likeyly because node %x lost leader position", n.opts.ID)


likelyly is a misspelling.

aaronlehmann · 2017-09-29T14:52:23Z

 		if !ok {
+			log.G(ctx).WithError(waitCtx.Err()).Errorf("wait context cancelled, likeyly because node %x lost leader position", n.opts.ID)
+			if atomic.LoadUint32(&n.signalledLeadership) == 1 {
+				log.G(ctx).Errorf("wait context cancelled but node %x is still a leader", n.opts.ID)


This message may appear at shutdown, because that's when the context gets cancelled.

Thanks! Will adjust the comment.

On a related node, we don't wait for all transaction for complete during shutdown ?

No, transactions can take an arbitrarily long time to reach consensus.

aaronlehmann · 2017-09-29T14:52:33Z

 	}

 	if !n.wait.trigger(r.ID, r) {
+		log.G(ctx).Errorf("wait not found for raft id %x", r.ID)


"proposal id"?

My bad, I'll fix this.

anshulpundir

Thanks again for reviewing! @aaronlehmann

anshulpundir · 2017-09-29T22:02:55Z

 	}

 	if !n.wait.trigger(r.ID, r) {
+		log.G(ctx).Errorf("wait not found for raft id %x", r.ID)


My bad, I'll fix this.

anshulpundir · 2017-09-29T22:04:36Z

 		if !ok {
+			log.G(ctx).WithError(waitCtx.Err()).Errorf("wait context cancelled, likeyly because node %x lost leader position", n.opts.ID)
+			if atomic.LoadUint32(&n.signalledLeadership) == 1 {
+				log.G(ctx).Errorf("wait context cancelled but node %x is still a leader", n.opts.ID)


Thanks! Will adjust the comment.

On a related node, we don't wait for all transaction for complete during shutdown ?

Signed-off-by: Anshul Pundir <anshul.pundir@docker.com>

GordonTheTurtle added the dco/no label Sep 26, 2017

anshulpundir force-pushed the log branch from a71c3e5 to f4e0360 Compare September 26, 2017 01:28

GordonTheTurtle added the dco/no label Sep 26, 2017

anshulpundir force-pushed the log branch from f4e0360 to f02bbf5 Compare September 26, 2017 01:29

GordonTheTurtle removed the dco/no label Sep 26, 2017

anshulpundir force-pushed the log branch from f02bbf5 to 39f9acc Compare September 26, 2017 03:49

aaronlehmann reviewed Sep 26, 2017

View reviewed changes

anshulpundir changed the title ~~Cleaning out processInternalRaftRequest() error logic. Added logs and errors for swarmkit#9393.~~ Cleaning out processInternalRaftRequest() error logic. Added logs and errors for orca#9393. Sep 26, 2017

moby deleted a comment from GordonTheTurtle Sep 26, 2017

nishanttotla added the area/raft label Sep 26, 2017

anshulpundir changed the title ~~Cleaning out processInternalRaftRequest() error logic. Added logs and errors for orca#9393.~~ Cleaning out processInternalRaftRequest() error logic. Added logs and errors. Sep 26, 2017

anshulpundir force-pushed the log branch from 39f9acc to 4940eff Compare September 26, 2017 21:28

anshulpundir changed the title ~~Cleaning out processInternalRaftRequest() error logic. Added logs and errors.~~ More logging for raft/processInternalRaftRequest Sep 26, 2017

nishanttotla reviewed Sep 26, 2017

View reviewed changes

GordonTheTurtle added the dco/no label Sep 26, 2017

nishanttotla reviewed Sep 26, 2017

View reviewed changes

anshulpundir commented Sep 26, 2017

View reviewed changes

anshulpundir force-pushed the log branch from 263b192 to dfe2618 Compare September 27, 2017 00:02

GordonTheTurtle removed the dco/no label Sep 27, 2017

anshulpundir commented Sep 27, 2017

View reviewed changes

anshulpundir force-pushed the log branch from dfe2618 to 2780e22 Compare September 27, 2017 00:11

nishanttotla approved these changes Sep 27, 2017

View reviewed changes

anshulpundir force-pushed the log branch 2 times, most recently from ddbda82 to 1602550 Compare September 29, 2017 01:08

More logging for raft/processInternalRaftRequest

094ee95

Signed-off-by: Anshul Pundir <anshul.pundir@docker.com>

anshulpundir force-pushed the log branch from 1602550 to 094ee95 Compare September 29, 2017 04:53

anshulpundir merged commit 144ddc5 into moby:master Sep 29, 2017

anshulpundir deleted the log branch September 29, 2017 06:30

aaronlehmann reviewed Sep 29, 2017

View reviewed changes

anshulpundir commented Sep 29, 2017

View reviewed changes

andrewhsu mentioned this pull request Oct 11, 2017

[17.10] re-vndr libnetwork to 6c51292 and swarmkit to 872861d docker-archive/docker-ce#264

Merged

denverdino pushed a commit to AliyunContainerService/swarmkit that referenced this pull request Nov 5, 2017

More logging for raft/processInternalRaftRequest (moby#2389)

3125409

Signed-off-by: Anshul Pundir <anshul.pundir@docker.com>

		@@ -630,6 +632,7 @@ func (n *Node) Run(ctx context.Context) error {
		// cancelAll, or by its own check of signalledLeadership.
		n.wait.cancelAll()

Conversation

anshulpundir commented Sep 26, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

codecov Bot commented Sep 26, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

aaronlehmann left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

anshulpundir Sep 26, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

anshulpundir commented Sep 26, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

nishanttotla Sep 26, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

GordonTheTurtle commented Sep 26, 2017

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

nishanttotla left a comment

Choose a reason for hiding this comment

Uh oh!

aaronlehmann commented Sep 27, 2017

Uh oh!

nishanttotla commented Sep 27, 2017

Uh oh!

anshulpundir commented Sep 28, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

aaronlehmann commented Sep 28, 2017

Uh oh!

anshulpundir commented Sep 28, 2017

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

anshulpundir left a comment

Choose a reason for hiding this comment

anshulpundir commented Sep 26, 2017 •

edited

Loading

codecov Bot commented Sep 26, 2017 •

edited

Loading

anshulpundir Sep 26, 2017 •

edited

Loading

anshulpundir commented Sep 26, 2017 •

edited

Loading

nishanttotla Sep 26, 2017 •

edited

Loading

anshulpundir commented Sep 28, 2017 •

edited

Loading