Data loss protect resending by halseth · Pull Request #1937 · lightningnetwork/lnd

halseth · 2018-09-19T11:09:50Z

This PR makes the latest channel reestablishment message being stored as part of the channel close summary, such that it can be sent even when a channel ahs been closed.

This makes it possible for nodes to recover from the case where they have been offline, lost state, and comes back online, attempting to reseync the channel.

Note: This PR contains a DB migration. To ensure deserialization code is not changed in the future, and breaking old migrations, a versioned copy of the current deserialization logic is added to the file legacy_serialization.go. Backwards compatible addition of the new field turned out to be complex because of the presence of optional fields (see the now removed commit 14648b2).

TODO:

Unit test for FetchClosedChannelForID

halseth · 2018-09-19T11:12:37Z

channeldb/channel.go

Note: we can add write the false marker here as part of the migration to avoid this EOF check.

Currently the addition of the LastChanSyncMsg field is kept in its own commit to show how new fields can easily be added after the presence flags are introduced in the serialization.

Why are we mixing the two approaches? We should either use EOF everywhere (so no migration) or the bool based approach.

Ah scratch that, I see the usage now

This is how you would add new fields in the future without a migration.

halseth · 2018-09-19T11:23:34Z

channeldb/channel.go

#golang pop quiz: Why doesn't this work?

var msg *lnwire.ChannelReestablish if err := ReadElements(r, &msg); err != nil { return nil, err } c.LastChanSyncMsg = msg

i'm guessing that **lnwire.ChannelReestablish and *lnwire.Message are distinct concrete types, and the type switch doesn't try to determine if the suffix *lnwire.ChannelReestablish implements the other suffix lnwire.Message

Yes, I think that's right. I think the type switch is checking if the passed interface{} is of concrete type *lnwire.Message, not if it is satisfying the interface. TIL you cannot do that: https://play.golang.org/p/SmL37Hobvvs

Roasbeef

Nice work! This along with static channel back ups (incoming) really wrap up the set of items we need in place in lnd to cover all the bases we (currently) can as far as channel recovery. Most of my comments in this initial sweep are surrounding the new migration and if it's actually needed or not. Will do another pass in the tests, and also test this out a bit locally my self as well.

Roasbeef · 2018-11-20T00:50:55Z

channeldb/channel.go

This is to indicate if optional fields are there or not? Why's this better (in terms of set of changes) than just attempting to read where we know fields will be appended? May be worth even just going to something more future proof here.

The problem with omitting optional fields is that it only works for the last field. If there are more than one optional field, then you won't know what the data present represents. This way we attempt to make it future proof.

By adding booleans to indicate presence, we can make sure old code will read the new fields (they won't attempt to read any extra data they don't understand) and that new code will read the old format (booleans will indicate whether fields are present, so it can know which fields are not present from the old format).

Yep, realized this later that day. I think in the future, we should do a clean sweep to just move to a TLV format everywhere. You read it all, then at the end check the array/map for the keys you actually know of then parse that. With that, we gain a way to handle an arbitrary number of fields in the future in a uniform manner throughout the codebase.

ooooh, yeah that would be nice!

Roasbeef · 2018-11-20T00:51:43Z

channeldb/channel.go

This seems to break the current serialization more than necessary. The minimal patch set would be to append the chan reest we need and nothing more.

Won't work for the case where cs.RemoteNextRevocation == nil and chanSyncMsg != nil, as the deserialization code will attempt to read the chanSyncMsg as RemoteNextRevocation.

Roasbeef · 2018-11-20T00:53:18Z

peer.go

Roasbeef · 2018-11-20T00:53:48Z

channeldb/migrations.go

See comments re if this is really worth yet another migration.

I first attempted to do this without a migration, but it turned out to be complex: 14648b2

Ah ok, yeah that's pretty convoluted.

Roasbeef · 2018-11-20T00:58:02Z

lnd_test.go

Perhaps this should be split out into another test, or a sub-test? As is this test is nearly 400 lines, not a blocker though, just something we should start to think about re the growth of the integration tests in the future.

Least blocking route would be a comment that delineates this new testing scenario.

Improved the comments.

Agreed that some of the tests would benefit from being split up. I think this might cause our integration tests to run even longer though, bc of the added overhead. Might be worth looking into in combination with running integration tests in parallel (perhaps using containers?).

👍

Yeah we'll need to white board out a better architecture as we add more backends, and also eventually starting other platforms other than mac OS.

contractcourt/chain_watcher.go

Roasbeef · 2018-11-20T01:14:09Z

contractcourt/chain_watcher.go

Unrelated to this PR, but shouldn't the BRAR be the one that's making this DB state here? Re the whole reliable hand off thing.

the brar writes the retributions to disk, then acks back. after that the chain watcher knows it's safe to mark the channel as pending closed

We could let the BRAR have that responsibility, as we could get rid of ACK logic.

the cleaner solution IMO is just to write resolutions in CloseChannel. then there is no need for a reliable handoff. the resolutions can be passed in memory to kick off the process, but if we crash the brar would look for all breached channels in the db and load resolutions from there

lnd_test.go

peer.go

Roasbeef · 2018-11-20T01:17:09Z

contractcourt/chain_watcher.go

This can eventually be hooked up into the channel life cycle management stuff. So to request a notification once a new node connects. Actually we already have this for use in the funding manager, so perhaps it can be used here?

yeah i think we can use NotifyWhenOnline to just try if they ever reconnect

I actually like the current polling more, since we might need some time to process the ChannelReestablish after they reconnect. In that case the first lookup after they connect might fail, so we have to wait a little bit anyway.

If we don't wait before retrying at that point, we risk the peer coming online send us the ChannelReestablish and disconnects, and we will stall here waiting for them to connect again.

good point, leaving this decoupled from the peer lifecycle makes sense to me then

lnd_test.go

halseth · 2018-11-20T14:13:02Z

Comments address, PTAL 🐶

channeldb/db_test.go

contractcourt/chain_watcher.go

Roasbeef · 2018-11-21T02:46:34Z

lnwallet/channel.go

Where do we set the new starting height of a commitment chain? Don't see why this change is necessary atm. There're other spots where we depend on the height being set properly.

We never read it, we just look at the height of the commitment in the chain.

Agreed that the change is not directly relevant to this PR, I just noticed when transforming the ChanSync method to not being dependent on the chain heights.

Removed and made a separate PR: #2206

Roasbeef · 2018-11-21T02:48:03Z

lnwallet/channel.go

Ah ok I see this now re remote close. We could also shuffle the other spots we populate this information into this package. Don't consider it a blocker though.

Roasbeef · 2018-11-21T02:49:28Z

Final lingering comments are re if we should have a lower backoff, and also if a minor refactor is necessary or not given.

cfromknecht · 2018-11-21T03:29:09Z

channeldb/channel.go

Seems equivalent to:

// Write whether the channel sync message is present. if err := WriteElements(w, cs.LastChanSyncMsg != nil); err != nil { return err } // Write the channel sync message, if present. if c.LastChanSyncMsg != nil { if err := WriteElements(w, cs.LastChanSyncMsg); err != nil { return err } }

YES! Very nice :)

Also did the same change for RemoteNextRevocation

peer.go

contractcourt/chain_watcher.go

cfromknecht · 2018-11-21T03:48:12Z

contractcourt/chain_watcher.go

good point, leaving this decoupled from the peer lifecycle makes sense to me then

cfromknecht · 2018-11-21T03:49:51Z

lnd_test.go

channeldb/db.go

cfromknecht · 2018-11-21T04:01:52Z

lnwallet/channel.go

should this have a mutex around it? esp now that this is called in more places than just the switch

also seems like something that should be implemented in channeldb as a method of OpenChannel, since it is otherwise unrelated to anything in lnwallet. would be nice to keep the .ChanSyncMsg() notation, and not use a package level function

Good idea, added mutex.

The reason I didn't make this a method on OpenChannel is that it calls ComputeCommitmentPoint, which is a method from lnwallet, leading to an import cycle. We could however move this method to a new package (lnutil? 😛) if we want this, but I felt it was not worth it.

Also moving it to channeldb felt wrong 🤔

cfromknecht · 2018-11-21T04:09:30Z

channeldb/migrations.go

would be nice to have a unit test for this migration, asserting it behaves properly in all four cases

Migration test added! (only three cases though?)

This extracts part of the test into a new helper method timeTravel, which can be used to easily reset a node back to a state where channel state is lost.

This commit adds a new file legacy_serialization.go, where a copy of the current deserializeCloseChannelSummary is made, called deserializeCloseChannelSummaryV6. The rationale is to keep old deserialization code around to be used during migration, as it is hard maintaining compatibility with the old format while changing the code in use.

…Summary fields

This commit adds an optional field LastChanSyncMsg to the CloseChannelSummary, which will be used to save the ChannelReestablish message for the channel at the point of channel close.

This lets us get the channel reestablish message without creating the LightningChannel struct first.

…Summary

FetchClosedChannelForID is used to find the channel close summary given only a channel ID.

This method is used to fetch channel sync messages for closed channels from the db, and respond to the peer.

We pool the database for the channel commit point with an exponential backoff. This is meant to handle the case where we are in process of handling a channel sync, and the case where we detect a channel close and must wait for the peer to come online to start channel sync before we can proceed.

This adds the scenario where a channel is closed while the node is offline, the node loses state and comes back online. In this case the node should attempt to resync the channel, and the peer should resend a channel sync message for the closed channel, such that the node can retrieve its funds.

halseth · 2018-11-21T09:38:05Z

channeldb/legacy_serialization.go

+// ChannelCloseSummary.
+//
+// NOTE: deprecated, only for migration.
+func deserializeCloseChannelSummaryV6(r io.Reader) (*ChannelCloseSummary, error) {


This was after an idea by @joostjager. Lmk what you think :)

…Fields

halseth · 2018-11-21T12:49:04Z

Comments addressed and commits squashed. PTAL.

Roasbeef

LGTM 🧩

Have been running this on all my nodes (mainnet+testnet) the past few days, and haven't run into any issues. Migration went smooth, but ofc haven't ran (yet) into any instances where my node needed to retransmit the last chan sync message.

halseth commented Sep 19, 2018

View reviewed changes

halseth modified the milestones: 0.5.1, 0.5.2 Sep 19, 2018

Roasbeef added wallet The wallet (lnwallet) which LND uses safety General label for issues/PRs related to the safety of using the software P2 should be fixed if one has time needs review PR needs review by regular contributors interop interop with other implementations labels Sep 21, 2018

Roasbeef requested changes Nov 20, 2018

View reviewed changes

Roasbeef reviewed Nov 20, 2018

View reviewed changes

lnd_test.go Outdated Show resolved Hide resolved

halseth force-pushed the data-loss-protect-resending branch 8 times, most recently from a34d016 to 4124c6d Compare November 20, 2018 14:11

Roasbeef requested changes Nov 21, 2018

View reviewed changes

cfromknecht reviewed Nov 21, 2018

View reviewed changes

lnd_test: remove unused math/rand rependency

e1d8b07

halseth force-pushed the data-loss-protect-resending branch from 4124c6d to 1d1ce73 Compare November 21, 2018 09:09

halseth added 4 commits November 21, 2018 10:12

lnd_test: define createPayReqs helper method

eb2f832

lnd_test: define helper getChanInfo

0ca7c8c

lnd test: refactor testDataLossProtection

13098a5

This extracts part of the test into a new helper method timeTravel, which can be used to easily reset a node back to a state where channel state is lost.

peer: correct variable name, print error

780e1bb

halseth force-pushed the data-loss-protect-resending branch from 1d1ce73 to 96bb63e Compare November 21, 2018 09:17

halseth added 2 commits November 21, 2018 10:27

channeldb/channel: write boolean to indicate presence of ChannelClose…

28b15dc

…Summary fields

channeldb/migrations: migrate ChannelCloseSummary

46c961a

halseth force-pushed the data-loss-protect-resending branch from 96bb63e to 56ea47b Compare November 21, 2018 09:28

halseth added 9 commits November 21, 2018 10:28

channeldb/channel: add LastChanSync field to CloseChannelSummary

0b9a323

This commit adds an optional field LastChanSyncMsg to the CloseChannelSummary, which will be used to save the ChannelReestablish message for the channel at the point of channel close.

lnwallet+link: make ChanSyncMsg take channel state as arg

676a1b1

This lets us get the channel reestablish message without creating the LightningChannel struct first.

contractcourt/chain_watcher: add channel sync message to CloseChannel…

a9f93b2

…Summary

channeldb/db: define FetchClosedChannelForID

45b132a

FetchClosedChannelForID is used to find the channel close summary given only a channel ID.

channeldb/db_test: add TestFetchClosedChannelForID

da3f515

peer: define resendChanSyncMsg

0dd7eed

This method is used to fetch channel sync messages for closed channels from the db, and respond to the peer.

htlcswitch/link: remove handled TODO

a9bd610

halseth force-pushed the data-loss-protect-resending branch from 56ea47b to a9bd610 Compare November 21, 2018 09:29

halseth commented Nov 21, 2018

View reviewed changes

channeldb/migrations_test: add TestMigrateOptionalChannelCloseSummary…

b1a35fc

…Fields

Roasbeef approved these changes Nov 24, 2018

View reviewed changes

Roasbeef modified the milestones: 0.5.2, 0.5.1 Nov 26, 2018

Roasbeef merged commit 42c4597 into lightningnetwork:master Nov 26, 2018

Roasbeef mentioned this pull request Dec 18, 2018

No answer to channel_reestablish for unknown channels in case of channel db reset #1882

Closed

Roasbeef mentioned this pull request Jan 18, 2019

Still cannot force close inactive channels #2468

Closed

Conversation

halseth commented Sep 19, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Roasbeef left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

halseth Nov 20, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

halseth commented Nov 20, 2018

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

halseth commented Sep 19, 2018 •

edited

Loading

halseth Nov 20, 2018 •

edited

Loading

halseth Nov 21, 2018 •

edited

Loading