Skip to content

Data loss protect resending#1937

Merged
Roasbeef merged 18 commits intolightningnetwork:masterfrom
halseth:data-loss-protect-resending
Nov 26, 2018
Merged

Data loss protect resending#1937
Roasbeef merged 18 commits intolightningnetwork:masterfrom
halseth:data-loss-protect-resending

Conversation

@halseth
Copy link
Contributor

@halseth halseth commented Sep 19, 2018

This PR makes the latest channel reestablishment message being stored as part of the channel close summary, such that it can be sent even when a channel ahs been closed.

This makes it possible for nodes to recover from the case where they have been offline, lost state, and comes back online, attempting to reseync the channel.

Note: This PR contains a DB migration. To ensure deserialization code is not changed in the future, and breaking old migrations, a versioned copy of the current deserialization logic is added to the file legacy_serialization.go. Backwards compatible addition of the new field turned out to be complex because of the presence of optional fields (see the now removed commit 14648b2).

TODO:

  • Unit test for FetchClosedChannelForID

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note: we can add write the false marker here as part of the migration to avoid this EOF check.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Currently the addition of the LastChanSyncMsg field is kept in its own commit to show how new fields can easily be added after the presence flags are introduced in the serialization.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why are we mixing the two approaches? We should either use EOF everywhere (so no migration) or the bool based approach.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah scratch that, I see the usage now

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is how you would add new fields in the future without a migration.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

#golang pop quiz: Why doesn't this work?

var msg *lnwire.ChannelReestablish
if err := ReadElements(r, &msg); err != nil {
		return nil, err
}
c.LastChanSyncMsg = msg

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i'm guessing that **lnwire.ChannelReestablish and *lnwire.Message are distinct concrete types, and the type switch doesn't try to determine if the suffix *lnwire.ChannelReestablish implements the other suffix lnwire.Message

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I think that's right. I think the type switch is checking if the passed interface{} is of concrete type *lnwire.Message, not if it is satisfying the interface. TIL you cannot do that: https://play.golang.org/p/SmL37Hobvvs

@halseth halseth modified the milestones: 0.5.1, 0.5.2 Sep 19, 2018
@Roasbeef Roasbeef added wallet The wallet (lnwallet) which LND uses safety General label for issues/PRs related to the safety of using the software P2 should be fixed if one has time needs review PR needs review by regular contributors interop interop with other implementations labels Sep 21, 2018
Copy link
Member

@Roasbeef Roasbeef left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice work! This along with static channel back ups (incoming) really wrap up the set of items we need in place in lnd to cover all the bases we (currently) can as far as channel recovery. Most of my comments in this initial sweep are surrounding the new migration and if it's actually needed or not. Will do another pass in the tests, and also test this out a bit locally my self as well.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is to indicate if optional fields are there or not? Why's this better (in terms of set of changes) than just attempting to read where we know fields will be appended? May be worth even just going to something more future proof here.

Copy link
Contributor Author

@halseth halseth Nov 20, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The problem with omitting optional fields is that it only works for the last field. If there are more than one optional field, then you won't know what the data present represents. This way we attempt to make it future proof.

By adding booleans to indicate presence, we can make sure old code will read the new fields (they won't attempt to read any extra data they don't understand) and that new code will read the old format (booleans will indicate whether fields are present, so it can know which fields are not present from the old format).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yep, realized this later that day. I think in the future, we should do a clean sweep to just move to a TLV format everywhere. You read it all, then at the end check the array/map for the keys you actually know of then parse that. With that, we gain a way to handle an arbitrary number of fields in the future in a uniform manner throughout the codebase.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ooooh, yeah that would be nice!

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems to break the current serialization more than necessary. The minimal patch set would be to append the chan reest we need and nothing more.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Won't work for the case where cs.RemoteNextRevocation == nil and chanSyncMsg != nil, as the deserialization code will attempt to read the chanSyncMsg as RemoteNextRevocation.

peer.go Outdated
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

;)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See comments re if this is really worth yet another migration.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I first attempted to do this without a migration, but it turned out to be complex: 14648b2

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah ok, yeah that's pretty convoluted.

lnd_test.go Outdated
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Perhaps this should be split out into another test, or a sub-test? As is this test is nearly 400 lines, not a blocker though, just something we should start to think about re the growth of the integration tests in the future.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Least blocking route would be a comment that delineates this new testing scenario.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Improved the comments.

Agreed that some of the tests would benefit from being split up. I think this might cause our integration tests to run even longer though, bc of the added overhead. Might be worth looking into in combination with running integration tests in parallel (perhaps using containers?).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

Yeah we'll need to white board out a better architecture as we add more backends, and also eventually starting other platforms other than mac OS.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unrelated to this PR, but shouldn't the BRAR be the one that's making this DB state here? Re the whole reliable hand off thing.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the brar writes the retributions to disk, then acks back. after that the chain watcher knows it's safe to mark the channel as pending closed

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We could let the BRAR have that responsibility, as we could get rid of ACK logic.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the cleaner solution IMO is just to write resolutions in CloseChannel. then there is no need for a reliable handoff. the resolutions can be passed in memory to kick off the process, but if we crash the brar would look for all breached channels in the db and load resolutions from there

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This can eventually be hooked up into the channel life cycle management stuff. So to request a notification once a new node connects. Actually we already have this for use in the funding manager, so perhaps it can be used here?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah i think we can use NotifyWhenOnline to just try if they ever reconnect

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I actually like the current polling more, since we might need some time to process the ChannelReestablish after they reconnect. In that case the first lookup after they connect might fail, so we have to wait a little bit anyway.

If we don't wait before retrying at that point, we risk the peer coming online send us the ChannelReestablish and disconnects, and we will stall here waiting for them to connect again.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

good point, leaving this decoupled from the peer lifecycle makes sense to me then

@halseth halseth force-pushed the data-loss-protect-resending branch 8 times, most recently from a34d016 to 4124c6d Compare November 20, 2018 14:11
@halseth
Copy link
Contributor Author

halseth commented Nov 20, 2018

Comments address, PTAL 🐶

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Where do we set the new starting height of a commitment chain? Don't see why this change is necessary atm. There're other spots where we depend on the height being set properly.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We never read it, we just look at the height of the commitment in the chain.

Agreed that the change is not directly relevant to this PR, I just noticed when transforming the ChanSync method to not being dependent on the chain heights.

Removed and made a separate PR: #2206

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah ok I see this now re remote close. We could also shuffle the other spots we populate this information into this package. Don't consider it a blocker though.

@Roasbeef
Copy link
Member

Final lingering comments are re if we should have a lower backoff, and also if a minor refactor is necessary or not given.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems equivalent to:

	// Write whether the channel sync message is present.
	if err := WriteElements(w, cs.LastChanSyncMsg != nil); err != nil {
		return err
	}
	// Write the channel sync message, if present.
	if c.LastChanSyncMsg != nil {
		if err := WriteElements(w, cs.LastChanSyncMsg); err != nil {
			return err
		}
	}

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

YES! Very nice :)

Copy link
Contributor Author

@halseth halseth Nov 21, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also did the same change for RemoteNextRevocation

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

good point, leaving this decoupled from the peer lifecycle makes sense to me then

lnd_test.go Outdated
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should this have a mutex around it? esp now that this is called in more places than just the switch

also seems like something that should be implemented in channeldb as a method of OpenChannel, since it is otherwise unrelated to anything in lnwallet. would be nice to keep the .ChanSyncMsg() notation, and not use a package level function

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good idea, added mutex.

The reason I didn't make this a method on OpenChannel is that it calls ComputeCommitmentPoint, which is a method from lnwallet, leading to an import cycle. We could however move this method to a new package (lnutil? 😛) if we want this, but I felt it was not worth it.

Also moving it to channeldb felt wrong 🤔

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

would be nice to have a unit test for this migration, asserting it behaves properly in all four cases

Copy link
Contributor Author

@halseth halseth Nov 21, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Migration test added! (only three cases though?)

@halseth halseth force-pushed the data-loss-protect-resending branch from 4124c6d to 1d1ce73 Compare November 21, 2018 09:09
This extracts part of the test into a new helper method timeTravel,
which can be used to easily reset a node back to a state where channel
state is lost.
This commit adds a new file legacy_serialization.go, where a copy of the
current deserializeCloseChannelSummary is made, called
deserializeCloseChannelSummaryV6.

The rationale is to keep old deserialization code around to be used
during migration, as it is hard maintaining compatibility with the old
format while changing the code in use.
@halseth halseth force-pushed the data-loss-protect-resending branch from 1d1ce73 to 96bb63e Compare November 21, 2018 09:17
@halseth halseth force-pushed the data-loss-protect-resending branch from 96bb63e to 56ea47b Compare November 21, 2018 09:28
This commit adds an optional field LastChanSyncMsg to the
CloseChannelSummary, which will be used to save the ChannelReestablish
message for the channel at the point of channel close.
This lets us get the channel reestablish message without creating the LightningChannel struct first.
FetchClosedChannelForID is used to find the channel close summary given
only a channel ID.
This method is used to fetch channel sync messages for closed channels
from the db, and respond to the peer.
We pool the database for the channel commit point with an exponential
backoff. This is meant to handle the case where we are in process of
handling a channel sync, and the case where we detect a channel close
and must wait for the peer to come online to start channel sync before
we can proceed.
This adds the scenario where a channel is closed while the node is
offline, the node loses state and comes back online. In this case the
node should attempt to resync the channel, and the peer should resend a
channel sync message for the closed channel, such that the node can
retrieve its funds.
@halseth halseth force-pushed the data-loss-protect-resending branch from 56ea47b to a9bd610 Compare November 21, 2018 09:29
// ChannelCloseSummary.
//
// NOTE: deprecated, only for migration.
func deserializeCloseChannelSummaryV6(r io.Reader) (*ChannelCloseSummary, error) {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This was after an idea by @joostjager. Lmk what you think :)

@halseth
Copy link
Contributor Author

halseth commented Nov 21, 2018

Comments addressed and commits squashed. PTAL.

Copy link
Member

@Roasbeef Roasbeef left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM 🧩

Have been running this on all my nodes (mainnet+testnet) the past few days, and haven't run into any issues. Migration went smooth, but ofc haven't ran (yet) into any instances where my node needed to retransmit the last chan sync message.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

interop interop with other implementations needs review PR needs review by regular contributors P2 should be fixed if one has time safety General label for issues/PRs related to the safety of using the software wallet The wallet (lnwallet) which LND uses

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants