From: Matt Corallo Date: Thu, 21 Apr 2022 02:30:16 +0000 (+0000) Subject: Reorder the BP loop to make manager persistence more reliable X-Git-Tag: v0.0.107~55^2 X-Git-Url: http://git.bitcoin.ninja/index.cgi?a=commitdiff_plain;h=refs%2Fheads%2F2022-04-event-process-try-lock;p=rust-lightning Reorder the BP loop to make manager persistence more reliable The main loop of the background processor has this line: `peer_manager.process_events(); // Note that this may block on ChannelManager's locking` which does, indeed, sometimes block waiting on the `ChannelManager` to finish whatever its doing. Specifically, its the only place in the background processor loop that we block waiting on the `ChannelManager`, so if the `ChannelManager` is relatively busy, we may end up being blocked there most of the time. This should be fine, except today we had a user who's node was particularly slow in processing some channel updates, resulting in the background processor being blocked there (as expected). Then, when the channel updates were completed (and persisted) the next thing the background processor did was hand the user events to process, creating yet more channel updates. Ultimately, the users' node crashed before finishing the event processing. This left us with an updated monitor on disk and an outdated manager, and they lost the channel on startup. Here we simply move the above quoted line to after the normal event processing, ensuring the next thing we do after blocking on `ChannelManager` locks is persist the manager, prior to event handling. --- diff --git a/lightning-background-processor/src/lib.rs b/lightning-background-processor/src/lib.rs index 6beee915b..255fb245e 100644 --- a/lightning-background-processor/src/lib.rs +++ b/lightning-background-processor/src/lib.rs @@ -203,10 +203,22 @@ impl BackgroundProcessor { let mut have_pruned = false; loop { - peer_manager.process_events(); // Note that this may block on ChannelManager's locking channel_manager.process_pending_events(&event_handler); chain_monitor.process_pending_events(&event_handler); + // Note that the PeerManager::process_events may block on ChannelManager's locks, + // hence it comes last here. When the ChannelManager finishes whatever it's doing, + // we want to ensure we get into `persist_manager` as quickly as we can, especially + // without running the normal event processing above and handing events to users. + // + // Specifically, on an *extremely* slow machine, we may see ChannelManager start + // processing a message effectively at any point during this loop. In order to + // minimize the time between such processing completing and persisting the updated + // ChannelManager, we want to minimize methods blocking on a ChannelManager + // generally, and as a fallback place such blocking only immediately before + // persistence. + peer_manager.process_events(); + // We wait up to 100ms, but track how long it takes to detect being put to sleep, // see `await_start`'s use below. let await_start = Instant::now();