Data loss protect resending#1937
Conversation
channeldb/channel.go
Outdated
There was a problem hiding this comment.
Note: we can add write the false marker here as part of the migration to avoid this EOF check.
There was a problem hiding this comment.
Currently the addition of the LastChanSyncMsg field is kept in its own commit to show how new fields can easily be added after the presence flags are introduced in the serialization.
There was a problem hiding this comment.
Why are we mixing the two approaches? We should either use EOF everywhere (so no migration) or the bool based approach.
There was a problem hiding this comment.
Ah scratch that, I see the usage now
There was a problem hiding this comment.
This is how you would add new fields in the future without a migration.
channeldb/channel.go
Outdated
There was a problem hiding this comment.
#golang pop quiz: Why doesn't this work?
var msg *lnwire.ChannelReestablish
if err := ReadElements(r, &msg); err != nil {
return nil, err
}
c.LastChanSyncMsg = msgThere was a problem hiding this comment.
i'm guessing that **lnwire.ChannelReestablish and *lnwire.Message are distinct concrete types, and the type switch doesn't try to determine if the suffix *lnwire.ChannelReestablish implements the other suffix lnwire.Message
There was a problem hiding this comment.
Yes, I think that's right. I think the type switch is checking if the passed interface{} is of concrete type *lnwire.Message, not if it is satisfying the interface. TIL you cannot do that: https://play.golang.org/p/SmL37Hobvvs
Roasbeef
left a comment
There was a problem hiding this comment.
Nice work! This along with static channel back ups (incoming) really wrap up the set of items we need in place in lnd to cover all the bases we (currently) can as far as channel recovery. Most of my comments in this initial sweep are surrounding the new migration and if it's actually needed or not. Will do another pass in the tests, and also test this out a bit locally my self as well.
channeldb/channel.go
Outdated
There was a problem hiding this comment.
This is to indicate if optional fields are there or not? Why's this better (in terms of set of changes) than just attempting to read where we know fields will be appended? May be worth even just going to something more future proof here.
There was a problem hiding this comment.
The problem with omitting optional fields is that it only works for the last field. If there are more than one optional field, then you won't know what the data present represents. This way we attempt to make it future proof.
By adding booleans to indicate presence, we can make sure old code will read the new fields (they won't attempt to read any extra data they don't understand) and that new code will read the old format (booleans will indicate whether fields are present, so it can know which fields are not present from the old format).
There was a problem hiding this comment.
Yep, realized this later that day. I think in the future, we should do a clean sweep to just move to a TLV format everywhere. You read it all, then at the end check the array/map for the keys you actually know of then parse that. With that, we gain a way to handle an arbitrary number of fields in the future in a uniform manner throughout the codebase.
There was a problem hiding this comment.
ooooh, yeah that would be nice!
channeldb/channel.go
Outdated
There was a problem hiding this comment.
This seems to break the current serialization more than necessary. The minimal patch set would be to append the chan reest we need and nothing more.
There was a problem hiding this comment.
Won't work for the case where cs.RemoteNextRevocation == nil and chanSyncMsg != nil, as the deserialization code will attempt to read the chanSyncMsg as RemoteNextRevocation.
peer.go
Outdated
channeldb/migrations.go
Outdated
There was a problem hiding this comment.
See comments re if this is really worth yet another migration.
There was a problem hiding this comment.
I first attempted to do this without a migration, but it turned out to be complex: 14648b2
There was a problem hiding this comment.
Ah ok, yeah that's pretty convoluted.
lnd_test.go
Outdated
There was a problem hiding this comment.
Perhaps this should be split out into another test, or a sub-test? As is this test is nearly 400 lines, not a blocker though, just something we should start to think about re the growth of the integration tests in the future.
There was a problem hiding this comment.
Least blocking route would be a comment that delineates this new testing scenario.
There was a problem hiding this comment.
Improved the comments.
Agreed that some of the tests would benefit from being split up. I think this might cause our integration tests to run even longer though, bc of the added overhead. Might be worth looking into in combination with running integration tests in parallel (perhaps using containers?).
There was a problem hiding this comment.
👍
Yeah we'll need to white board out a better architecture as we add more backends, and also eventually starting other platforms other than mac OS.
contractcourt/chain_watcher.go
Outdated
There was a problem hiding this comment.
Unrelated to this PR, but shouldn't the BRAR be the one that's making this DB state here? Re the whole reliable hand off thing.
There was a problem hiding this comment.
the brar writes the retributions to disk, then acks back. after that the chain watcher knows it's safe to mark the channel as pending closed
There was a problem hiding this comment.
We could let the BRAR have that responsibility, as we could get rid of ACK logic.
There was a problem hiding this comment.
the cleaner solution IMO is just to write resolutions in CloseChannel. then there is no need for a reliable handoff. the resolutions can be passed in memory to kick off the process, but if we crash the brar would look for all breached channels in the db and load resolutions from there
contractcourt/chain_watcher.go
Outdated
There was a problem hiding this comment.
This can eventually be hooked up into the channel life cycle management stuff. So to request a notification once a new node connects. Actually we already have this for use in the funding manager, so perhaps it can be used here?
There was a problem hiding this comment.
yeah i think we can use NotifyWhenOnline to just try if they ever reconnect
There was a problem hiding this comment.
I actually like the current polling more, since we might need some time to process the ChannelReestablish after they reconnect. In that case the first lookup after they connect might fail, so we have to wait a little bit anyway.
If we don't wait before retrying at that point, we risk the peer coming online send us the ChannelReestablish and disconnects, and we will stall here waiting for them to connect again.
There was a problem hiding this comment.
good point, leaving this decoupled from the peer lifecycle makes sense to me then
a34d016 to
4124c6d
Compare
|
Comments address, PTAL 🐶 |
lnwallet/channel.go
Outdated
There was a problem hiding this comment.
Where do we set the new starting height of a commitment chain? Don't see why this change is necessary atm. There're other spots where we depend on the height being set properly.
There was a problem hiding this comment.
We never read it, we just look at the height of the commitment in the chain.
Agreed that the change is not directly relevant to this PR, I just noticed when transforming the ChanSync method to not being dependent on the chain heights.
Removed and made a separate PR: #2206
lnwallet/channel.go
Outdated
There was a problem hiding this comment.
Ah ok I see this now re remote close. We could also shuffle the other spots we populate this information into this package. Don't consider it a blocker though.
|
Final lingering comments are re if we should have a lower backoff, and also if a minor refactor is necessary or not given. |
channeldb/channel.go
Outdated
There was a problem hiding this comment.
Seems equivalent to:
// Write whether the channel sync message is present.
if err := WriteElements(w, cs.LastChanSyncMsg != nil); err != nil {
return err
}
// Write the channel sync message, if present.
if c.LastChanSyncMsg != nil {
if err := WriteElements(w, cs.LastChanSyncMsg); err != nil {
return err
}
}There was a problem hiding this comment.
Also did the same change for RemoteNextRevocation
contractcourt/chain_watcher.go
Outdated
There was a problem hiding this comment.
good point, leaving this decoupled from the peer lifecycle makes sense to me then
lnd_test.go
Outdated
lnwallet/channel.go
Outdated
There was a problem hiding this comment.
should this have a mutex around it? esp now that this is called in more places than just the switch
also seems like something that should be implemented in channeldb as a method of OpenChannel, since it is otherwise unrelated to anything in lnwallet. would be nice to keep the .ChanSyncMsg() notation, and not use a package level function
There was a problem hiding this comment.
Good idea, added mutex.
The reason I didn't make this a method on OpenChannel is that it calls ComputeCommitmentPoint, which is a method from lnwallet, leading to an import cycle. We could however move this method to a new package (lnutil? 😛) if we want this, but I felt it was not worth it.
Also moving it to channeldb felt wrong 🤔
channeldb/migrations.go
Outdated
There was a problem hiding this comment.
would be nice to have a unit test for this migration, asserting it behaves properly in all four cases
There was a problem hiding this comment.
Migration test added! (only three cases though?)
4124c6d to
1d1ce73
Compare
This extracts part of the test into a new helper method timeTravel, which can be used to easily reset a node back to a state where channel state is lost.
This commit adds a new file legacy_serialization.go, where a copy of the current deserializeCloseChannelSummary is made, called deserializeCloseChannelSummaryV6. The rationale is to keep old deserialization code around to be used during migration, as it is hard maintaining compatibility with the old format while changing the code in use.
1d1ce73 to
96bb63e
Compare
96bb63e to
56ea47b
Compare
This commit adds an optional field LastChanSyncMsg to the CloseChannelSummary, which will be used to save the ChannelReestablish message for the channel at the point of channel close.
This lets us get the channel reestablish message without creating the LightningChannel struct first.
FetchClosedChannelForID is used to find the channel close summary given only a channel ID.
This method is used to fetch channel sync messages for closed channels from the db, and respond to the peer.
We pool the database for the channel commit point with an exponential backoff. This is meant to handle the case where we are in process of handling a channel sync, and the case where we detect a channel close and must wait for the peer to come online to start channel sync before we can proceed.
This adds the scenario where a channel is closed while the node is offline, the node loses state and comes back online. In this case the node should attempt to resync the channel, and the peer should resend a channel sync message for the closed channel, such that the node can retrieve its funds.
56ea47b to
a9bd610
Compare
| // ChannelCloseSummary. | ||
| // | ||
| // NOTE: deprecated, only for migration. | ||
| func deserializeCloseChannelSummaryV6(r io.Reader) (*ChannelCloseSummary, error) { |
There was a problem hiding this comment.
This was after an idea by @joostjager. Lmk what you think :)
|
Comments addressed and commits squashed. PTAL. |
Roasbeef
left a comment
There was a problem hiding this comment.
LGTM 🧩
Have been running this on all my nodes (mainnet+testnet) the past few days, and haven't run into any issues. Migration went smooth, but ofc haven't ran (yet) into any instances where my node needed to retransmit the last chan sync message.
This PR makes the latest channel reestablishment message being stored as part of the channel close summary, such that it can be sent even when a channel ahs been closed.
This makes it possible for nodes to recover from the case where they have been offline, lost state, and comes back online, attempting to reseync the channel.
Note: This PR contains a DB migration. To ensure deserialization code is not changed in the future, and breaking old migrations, a versioned copy of the current deserialization logic is added to the file
legacy_serialization.go. Backwards compatible addition of the new field turned out to be complex because of the presence of optional fields (see the now removed commit 14648b2).TODO:
FetchClosedChannelForID