More logging for raft/processInternalRaftRequest#2389
Conversation
Codecov Report
@@ Coverage Diff @@
## master #2389 +/- ##
==========================================
- Coverage 60.37% 60.36% -0.02%
==========================================
Files 128 128
Lines 26260 26275 +15
==========================================
+ Hits 15855 15860 +5
- Misses 9010 9019 +9
- Partials 1395 1396 +1 |
aaronlehmann
left a comment
There was a problem hiding this comment.
What is swarmkit#9393? Is there a problem you're trying to debug?
| case x, ok := <-ch: | ||
| if !ok { | ||
| return nil, ErrLostLeadership | ||
| if err, ok := x.(error); ok { |
There was a problem hiding this comment.
If ok is false it means the channel was closed and nothing was received, so trying to use x is wrong. Also, errors are never sent over this channel.
There was a problem hiding this comment.
If ok is false it means the channel was closed and nothing was received, so trying to use x is wrong. Also, errors are never sent over this channel.
Thanks for pointing this out!
I guess I need to read the code again. The reason I'm doing this is because I think assuming ErrLostLeadership here may not accurate (please correct me I'm wrong). So, I'm trying to see if a more meaningful error can be returned.
There was a problem hiding this comment.
I believe it is accurate. That's the only thing that causes proposals to get canceled.
There was a problem hiding this comment.
This is in investigating a case where proposal fails with ErrLostLeadership but the leader does not actually lose leadership. Unfortunately, its not reproducible any more.
Updated description. |
| // cancelAll, or by its own check of signalledLeadership. | ||
| n.wait.cancelAll() | ||
| } else if !wasLeader && rd.SoftState.RaftState == raft.StateLeader { | ||
| log.G(ctx).Infof("Manager is now a leader.", n.opts.ID) |
There was a problem hiding this comment.
This message doesn't contain the format specifier in the message.
| // Wait notification channel was closed. This should only happen if the wait was cancelled. | ||
| log.G(ctx).Errorf("Wait cancelled, likely because node %x lost leader position. Wait channel closed with nothing to read.", n.opts.ID) | ||
| if atomic.LoadUint32(&n.signalledLeadership) == 1 { | ||
| log.G(ctx).Errorf("Wait cancelled but node %x is still a leader.", n.opts.ID) |
There was a problem hiding this comment.
@anshulpundir based on our discussion, let's update these message to remove "wait cancelled". This will allow us to distinguish this case from the one below.
There was a problem hiding this comment.
Changed to lowercase. The log above (line 1713) can help us differentiate this from the case below.
| @@ -630,6 +632,7 @@ func (n *Node) Run(ctx context.Context) error { | |||
| // cancelAll, or by its own check of signalledLeadership. | |||
| n.wait.cancelAll() | |||
There was a problem hiding this comment.
@anshulpundir do we need to put a log message just before this call to cancelAll?
There was a problem hiding this comment.
See log on line 615. Lemme know if you think we need another log.
There was a problem hiding this comment.
My bad, we don't. Please ignore this comment.
|
Please sign your commits following these rules: $ git clone -b "log" git@github.com:anshulpundir/swarmkit.git somewhere
$ cd somewhere
$ git rebase -i HEAD~842354263464
editor opens
change each 'pick' to 'edit'
save the file and quit
$ git commit --amend -s --no-edit
$ git rebase --continue # and repeat the amend for each commit
$ git push -fAmending updates the existing PR. You DO NOT need to open a new one. |
| // Wait notification channel was closed. This should only happen if the wait was cancelled. | ||
| log.G(ctx).Errorf("wait cancelled, likely because node %x lost leader position. Wait channel closed with nothing to read.", n.opts.ID) | ||
| if atomic.LoadUint32(&n.signalledLeadership) == 1 { | ||
| log.G(ctx).Errorf("Wait cancelled but node %x is still a leader.", n.opts.ID) |
There was a problem hiding this comment.
I don't mean to pontificate on this too much, but can we make this message textually different from the one below, instead of just using lower and upper case as the difference? By convention, all messages are lower case.
cc @aaronlehmann if you have ideas.
There was a problem hiding this comment.
The sadness of not including file names/line numbers in log messages :(
There was a problem hiding this comment.
BTW, if your concern is to differentiate the two cases, it is still possible because there's another log before this one. @nishanttotla
| @@ -630,6 +632,7 @@ func (n *Node) Run(ctx context.Context) error { | |||
| // cancelAll, or by its own check of signalledLeadership. | |||
| n.wait.cancelAll() | |||
There was a problem hiding this comment.
See log on line 615. Lemme know if you think we need another log.
| if rd.SoftState != nil { | ||
| if wasLeader && rd.SoftState.RaftState != raft.StateLeader { | ||
| wasLeader = false | ||
| log.G(ctx).Infof("soft state changed for node %x. Manager no longer a leader. Cancelling all waits.", n.opts.ID) |
| // position and cancelling the transaction. This entry still needs | ||
| // to be commited since other nodes have already created a new | ||
| // transaction to commit the data. | ||
|
|
There was a problem hiding this comment.
Removing this since the only way we can get here is if the wait item was removed by calling cancelAll(). Please let me know if you think otherwise @aaronlehmann thx!
|
I'm not sure this is the right code path to instrument. If the problem occurs on joining a manager node and promoting a worker, I think you are seeing the To the question of whether We only care about instances outside There is one call to There are some calls to So I really think that the logic in |
|
@aaronlehmann on a related note, in |
I'm pretty sure its not the path you just pointed out, because the error code is different.
Agreed
I also think the logic is correct, there are maybe some redundancies .e.g. the call to
I'll try to address the redundancy. The reason I put the bit about the channel being closed was to differentiate that select case (since there's no file names/line numbers in the logs, which is a pain btw). Since the logs are primarily used by engineers for debugging, I think it should be ok to expose whatever needs to be exposed for debugging. |
It's doing that to remove the wait entry, now that the function will no longer be waiting for the entry. It's probably not necessary in the
That's true for debug-level logs, not for other log levels though. |
I'll see if we can run with debug on in test runs and change ones that expose internal details to debug level. |
ddbda82 to
1602550
Compare
Signed-off-by: Anshul Pundir <anshul.pundir@docker.com>
| // If we can read from the channel, wait item was triggered. Otherwise it was cancelled. | ||
| x, ok := <-ch | ||
| if !ok { | ||
| log.G(ctx).WithError(waitCtx.Err()).Errorf("wait context cancelled, likeyly because node %x lost leader position", n.opts.ID) |
There was a problem hiding this comment.
likelyly is a misspelling.
| if !ok { | ||
| log.G(ctx).WithError(waitCtx.Err()).Errorf("wait context cancelled, likeyly because node %x lost leader position", n.opts.ID) | ||
| if atomic.LoadUint32(&n.signalledLeadership) == 1 { | ||
| log.G(ctx).Errorf("wait context cancelled but node %x is still a leader", n.opts.ID) |
There was a problem hiding this comment.
This message may appear at shutdown, because that's when the context gets cancelled.
There was a problem hiding this comment.
Thanks! Will adjust the comment.
On a related node, we don't wait for all transaction for complete during shutdown ?
There was a problem hiding this comment.
No, transactions can take an arbitrarily long time to reach consensus.
| } | ||
|
|
||
| if !n.wait.trigger(r.ID, r) { | ||
| log.G(ctx).Errorf("wait not found for raft id %x", r.ID) |
There was a problem hiding this comment.
My bad, I'll fix this.
anshulpundir
left a comment
There was a problem hiding this comment.
Thanks again for reviewing! @aaronlehmann
| } | ||
|
|
||
| if !n.wait.trigger(r.ID, r) { | ||
| log.G(ctx).Errorf("wait not found for raft id %x", r.ID) |
There was a problem hiding this comment.
My bad, I'll fix this.
| if !ok { | ||
| log.G(ctx).WithError(waitCtx.Err()).Errorf("wait context cancelled, likeyly because node %x lost leader position", n.opts.ID) | ||
| if atomic.LoadUint32(&n.signalledLeadership) == 1 { | ||
| log.G(ctx).Errorf("wait context cancelled but node %x is still a leader", n.opts.ID) |
There was a problem hiding this comment.
Thanks! Will adjust the comment.
On a related node, we don't wait for all transaction for complete during shutdown ?
Signed-off-by: Anshul Pundir <anshul.pundir@docker.com>
Adding a new manager node or promoting a worker to a manager fails with "XXX: node lost leader status". After the failure, the leadership does not actually change. Here, my hypothesis is that the reason for raft proposals to fail may not just be ErrLostLeadership.