Blame outbound channel on UPDATE onion failure with 0-len update
We've run into this several times in the wild, likely due to
https://github.com/ElementsProject/lightning/issues/6200 wherein a node on the
path will error with 0x1000 but not provide a channel update (a spec
violation).
Previously, we would blame the inbound edge even though the buggy peer wanted
us to blame the outbound edge. Since this issue seems to be recurring and our
blaming the inbound edge is causing us to punish innocent channels, trust the
peer that the outbound edge is the one to blame.
Fix PaymentPathFailed::payment_failed_permanently on blinded path fail
Previously this value would be incorrectly set to true because we wouldn't
account for blinded hops when determining if we were processing the last hop's
failure packet.
Our ultimate goal with this field is to set
PaymentPathFailed::payment_failed_permanently, so use this name rather than
flipping a bool back and forth across methods.
Matt Corallo [Thu, 14 Sep 2023 20:02:46 +0000 (20:02 +0000)]
Add an `UnrecoverableError` variant to `ChannelMonitorUpdateStatus`
While there is no great way to handle a true failure to persist a
`ChannelMonitorUpdate`, it is confusing for users for there to be
no error variant at all on an I/O operation.
Thus, here we re-add the error variant removed over the past
handful of commits, but rather than handle it in a truly unsafe
way, we simply panic, optimizing for maximum mutex poisoning to
ensure any future operations fail and return immediately.
In the future, we may consider changing the handling of this to
instead set some "disconnect all peers and fail all operations"
bool to give the user a better chance to shutdown in a semi-orderly
fashion, but there's only so much that can be done in lightning if
we truly cannot persist new updates.
Matt Corallo [Sun, 10 Sep 2023 20:21:50 +0000 (20:21 +0000)]
Drop doc comments on `ChainMonitor` trait impl methods
In general, doc comments on trait impl blocks are not very visible
in rustdoc output, and unless they provide useful information they
should be elided.
Here we drop useless doc comments on `ChainMonitor`'s `Watch` impl
methods.
Matt Corallo [Sun, 10 Sep 2023 20:05:43 +0000 (20:05 +0000)]
Drop error handling in `handle_new_monitor_update`
Now that `handle_new_monitor_update` can no longer return an error,
it similarly no longer needs any code to handle errors. Here we
remove that code, substantially reducing macro variants.
Matt Corallo [Fri, 8 Sep 2023 00:05:56 +0000 (00:05 +0000)]
Update `ChannelMonitorUpdateStatus` documentation with async support
Since we now (almost) support async monitor update persistence, the
documentation on `ChannelMonitorUpdateStatus` can be updated to no
longer suggest users must keep a local copy that persists before
returning. However, because there are still a few remaining issues,
we note that async support is currently beta and explicily warn of
potential for funds-loss.
Matt Corallo [Wed, 30 Aug 2023 18:22:33 +0000 (18:22 +0000)]
Rename `MonitorEvent::CommitmentTxConfirmed` to `HolderForceClosed`
The `MonitorEvent::CommitmentTxConfirmed` has always been a result
of us force-closing the channel, not the counterparty doing so.
Thus, it was always a bit of a misnomer. Worse, it carried over
into the channel's `ClosureReason` in the event API.
Here we simply rename it and use the proper `ClosureReason`.
Matt Corallo [Sun, 10 Sep 2023 17:14:32 +0000 (17:14 +0000)]
Drop the `ChannelMonitorUpdateStatus::PermanentFailure` variant
When a `ChannelMonitorUpdate` fails to apply, it generally means
we cannot reach our storage backend. This, in general, is a
critical issue, but is often only a transient issue.
Sadly, users see the failure variant and return it on any I/O
error, resulting in channel force-closures due to transient issues.
Users don't generally expect force-closes in most cases, and
luckily with async `ChannelMonitorUpdate`s supported we don't take
any risk by "delaying" the `ChannelMonitorUpdate` indefinitely.
Thus, here we drop the `PermanentFailure` variant entirely, making
all failures instead be "the update is in progress, but won't ever
complete", which is equivalent if we do not close the channel
automatically.
Matt Corallo [Sun, 10 Sep 2023 22:11:56 +0000 (22:11 +0000)]
Rewrite failure payment retry tests to avoid perm-fail storage
Two tests in the payment tests currently rely on failing to persist
ChannelMonitorUpdates as their method of failing payments before
they even get out the door.
In the coming commits we'll drop the persist failure error codes,
so here rewrite these tests to rely on trying to send more than is
available in a channel.
Matt Corallo [Thu, 21 Sep 2023 01:44:23 +0000 (01:44 +0000)]
Use `Default::default()` to construct `()` as a test scoring param
In bindings, we can't use unbounded generic types, and thus have to
rip out the `ScoreParams` and replace them with static
`ProbabilisticScoringFeeParams` universally. To make this easier,
using `Default::default()` everywhere allows the type to change out
from under the test without the test needing to change.
Matt Corallo [Mon, 24 Apr 2023 01:52:41 +0000 (01:52 +0000)]
Add an option to make the success probability estimation nonlinear
Our "what is the success probability of paying over a channel with
the given liquidity bounds" calculation currently assumes the
probability of where the liquidity lies in a channel is constant
across the entire capacity of a channel. This is obviously a
somewhat dubious assumption given most nodes don't materially
rebalance and flows within the network often push liquidity
"towards the edges".
Here we add an option to consider this when scoring channels during
routefinding. Specifically, if a new `linear_success_probability`
flag is unset on `ProbabilisticScoringFeeParameters`, rather than
assuming a PDF of `1` (across the channel's capacity scaled from 0
to 1), we use `(x - 0.5)^2`.
This assumes liquidity is likely to be near the edges, which
matches experimental results. Further, calculating the CDF (i.e.
integral) between arbitrary points on the PDF is trivial, which we
do as our main scoring function.
While this (finally) introduces floats in our scoring, its not
practical to exponentiate using fixed-precision, and benchmarks
show this is a performance regression, but not a huge one, more
than made up for by the increase in payment success rates.
Matt Corallo [Sat, 16 Sep 2023 18:39:03 +0000 (18:39 +0000)]
Score in-flight amounts as amounts, not a capacity reduction
When we started considering the in-flight amounts when scoring, we
took the approach of considering the in-flight amount as an
effective reduction in the channel's total capacity. When we were
scoring using a flat success probability PDF, that was fine,
however in the next commit we'll move to a highly nonlinear one,
which makes this a pretty confusing heuristic.
Here, instead, we move to considering the in-flight amount as
simply an extension of the amount we're trying to send over the
channel, which is equivalent for the flat success probability PDF,
but makes much more sense in a nonlinear world.
Matt Corallo [Mon, 24 Apr 2023 22:53:28 +0000 (22:53 +0000)]
Scale the success probability of channels without info down by 75%
If we are examining a channel for which we have no information at
all, we traditionally assume the HTLC success probability is
proportional to the channel's capacity. While this may be the case,
it is not the case that a tiny payment over a huge channel is
guaranteed to succeed, as we assume. Rather, the probability of
such success is likely closer to 50% than 100%.
Here we try to capture this by simply scaling the success
probability for channels where we have no information down
linearly. We pick 75% as the upper bound rather arbitrarily - while
50% may be more accurate, its possible it would lead to an
over-reliance on channels which we have paid through in the past,
which aren't necessarily always the best candidates.
Note that we only do this scaling for the historical bucket
tracker, as there we can be confident we've never seen a successful
HTLC completion on the given channel. If we were to apply the same
scaling to the simple liquidity bounds based scoring we'd penalize
channels we've never tried over those we've only ever fails to pay
over, which is obviously not a good outcome.
Matt Corallo [Mon, 24 Apr 2023 05:16:51 +0000 (05:16 +0000)]
Split out success probability calculation to allow for changes
Our "what is the success probability of paying over a channel with
the given liquidity bounds" calculation is reused in a few places,
and is a key assumption across our main score calculation and the
historical bucket score calculations.
Here we break it out into a function to make it easier to
experiment with different success probability calculations.
Note that this drops the numerator +1 in the liquidity scorer,
which was added to compensate for the divisor + 1 (which exists to
avoid divide-by-zero), making the new math slightly less correct
but not by any material amount.
Account for existing input amounts throughout coin selection
We'd previously ignore the existing amount transactions were already
attempting to spend when deciding whether we should add more inputs
throughout coin selection. This would result in us attaching more inputs
than necessary to satisfy our target amount. In the case of HTLC
transactions, we'd burn the HTLC amount completely, since the pre-signed
transaction has zero fee (input amount == output amount).
Along the way, we also fix the slight overpayment in anchor
transactions. We now properly account for the fees the transaction
already paid for, simply by pretending the fees are part of the anchor
input amount.
Since we don't know the total input amount of an external claim (those
which come anchor channels), we can't limit our feerate bumps by the
amount of funds we have available to use. Instead, we choose to limit it
by a margin of the new feerate estimate.
Elias Rohrer [Mon, 11 Sep 2023 11:50:10 +0000 (13:50 +0200)]
Probe up to second-to-last hop if last was provided by route hint
If the last hop was provided by route hint we assume it's not an announced channel.
If furthermore only a single route hint is provided we refrain from probing through
all the way to the end and instead probe up to the second-to-last channel.
Optimally we'd do this not based on above mentioned assumption but
rather by checking inclusion in our network graph. However, we don't
have access to our graph in `ChannelManager`.
Elias Rohrer [Mon, 28 Aug 2023 13:03:04 +0000 (15:03 +0200)]
Add preflight probing capabilities
We add a `ChannelManager::send_preflight_probes` method that can be used
to send pre-flight probes given some [`RouteParameters`]. Additionally,
we add convenience methods in for spontaneous probes and send pre-flight
probes for a given invoice.
As pre-flight probes might take up some of the available liquidity, we
here introduce that channels whose available liquidity is less than the
required amount times
`UserConfig::preflight_probing_liquidity_limit_multiplier` won't be used
to send pre-flight probes.
This commit is a more or less a carbon copy of the pre-flight
probing code recently added to LDK Node.
Elias Rohrer [Tue, 12 Sep 2023 13:51:37 +0000 (15:51 +0200)]
Include `maybe_announced` field in `RouteHop`
When sending preflight probes, we want to exclude last hops that are
possibly announced. To this end, we here include a new field in
`RouteHop` that will be `true` when we either def. know the hop to be
announced, or, if there exist public channels between the hop's
counterparties that this hop might refer to (i.e., be an alias for).
Matt Corallo [Fri, 15 Sep 2023 23:52:40 +0000 (23:52 +0000)]
Correct syn pinning on cargo 1.48
Sadly the pinning introduced in 050f5a90297678f2cad36feee9a1db2367e
was brittle in the face of any further syn updates, and has already
broken.
Here we fix it by looking up the actual version of syn to pin.
Note that this dependency is somewhat nonsense as its actually only
a `criterion` dependency, pulled in even though we haven't set the
bench flag (as we aren't yet using `resolver = 2`).
Matt Corallo [Fri, 15 Sep 2023 19:17:34 +0000 (19:17 +0000)]
Avoid unnecessarily cloning unsigned Transaction when broadcasting
Our `Trusted*` wrappers in `chan_utils` expose additional inner
fields by reference. However, because they were not explicitly
marked as returning a reference with the wrapped struct's
lifetimes, rustc was considering them to return a reference with
the wrapper struct's lifetime.
Matt Corallo [Thu, 7 Sep 2023 22:34:30 +0000 (22:34 +0000)]
Move to a constant for "bucket one" in the scoring buckets
Scoring buckets are stored as fixed point ints, with a 5-bit
fractional part (i.e. a value of 1.0 is stored as "32"). Now that
we also have 32 buckets, this leads to the codebase having many
references to 32 which could reasonably be confused for each other.
Thus, we add a constant here for the value 1.0 in our fixed-point
scheme.
Matt Corallo [Tue, 6 Jun 2023 04:08:32 +0000 (04:08 +0000)]
Decay `historical_estimated_channel_liquidity_*` result to `None`
`historical_estimated_channel_liquidity_probabilities` previously
decayed to `Some(([0; 8], [0; 8]))`. This was thought to be useful
in that it allowed identification of cases where data was previously
available but is now decayed away vs cases where data was never
available. However, with the introduction of
`historical_estimated_payment_success_probability` (which uses the
existing scoring routines so will decay to `None`) this is
unnecessarily confusing.
Given data which has decayed to zero will also not be used anyway,
there's little reason to keep the old behavior, and we now decay to
`None`.
We also take this opportunity to split the overloaded
`get_decayed_buckets`, removing uneccessary code during scoring.
Matt Corallo [Sat, 19 Aug 2023 18:43:45 +0000 (18:43 +0000)]
Special-case the 0th minimum bucket in historical scoring
Points in the 0th minimum bucket either indicate we sent a payment
which is < 1/16,384th of the channel's capacity or, more likely,
we failed to send a payment. In either case, averaging the success
probability across the full range of upper-bounds doesn't make a
whole lot of sense - if we've never managed to send a "real"
payment over a channel, we should be considering it quite poor.
To address this, we special-case the 0th minimum bucket and only
look at the largest-offset max bucket when calculating the success
probability.
Matt Corallo [Sun, 16 Apr 2023 04:03:08 +0000 (04:03 +0000)]
Track "steady-state" channel balances in history buckets not live
The lower-bound of the scoring history buckets generally never get
used - if we try to send a payment and it fails, we don't learn
a new lower-bound for the liquidity of a channel, and if we
successfully send a payment we only learn a lower-bound that
applied *before* we sent the payment, not after it completed.
If we assume channels have some "steady-state" liquidity, then
tracking our liquidity estimates *after* a payment doesn't really
make sense - we're not super likely to make a second payment across
the same channel immediately (or, if we are, we can use our
un-decayed liquidity estimates for that). By the time we do go to
use the same channel again, we'd assume that its back at its
"steady-state" and the impacts of our payment have been lost.
To combat both of these effects, here we "subtract" the impact of
any just-successful payments from our liquidity estimates prior to
updating the historical buckets.
Matt Corallo [Fri, 8 Sep 2023 17:48:50 +0000 (17:48 +0000)]
Move the historical bucket tracker to 32 unequal sized buckets
Currently we store our historical estimates of channel liquidity in
eight evenly-sized buckets, each representing a full octile of the
channel's total capacity. This lacks precision, especially at the
edges of channels where liquidity is expected to lie.
To mitigate this, we'd originally checked if a payment lies within
a bucket by comparing it to a sliding scale of 64ths of the
channel's capacity. This allowed us to assign penalties to payments
that fall within any more than the bottom 64th or lower than the
top 64th of a channel.
However, this still lacks material precision - on a 1 BTC channel
we could only consider failures for HTLCs above 1.5 million sats.
With today's lightning usage often including 1-100 sat payments in
tips, this is a rather significant lack of precision.
Here we rip out the existing buckets and replace them with 32
*unequal* sized buckets. This allows us to focus our precision at
the edges of a channel (where the liquidity is likely to lie, and
where precision helps the most).
We set the size of the edge buckets to 1/16,384th of the channel,
with the size increasing exponentially until it approaches the
inner buckets. For backwards compatibility, the buckets divide
evenly into octets, allowing us to convert the existing buckets
into the new ones cleanly.
This allows us to consider HTLCs down to 6,000 sats for 1 BTC
channels. In order to avoid failing to penalize channels which have
always failed, we drop the sliding scale for comparisons and simply
check if the payment is above the minimum bucket we're analyzing and
below *or in* the maximum one. This generates somewhat more
pessimistic scores, but fixes the lower bound where we suddenly
assign a 0% failure probability.
While this does represent a regression in routing performance, in
some cases the impact of not having to examine as many nodes
dominates, leading to a performance increase.
On a Xeon E3-1220 v5, the `large_mpp_routes` benchmark shows a 15%
performance increase, while the more stable benchmarks show an 8%
and 15% performance regression.
Matt Corallo [Thu, 14 Sep 2023 21:49:14 +0000 (21:49 +0000)]
Pin memchr in our release dependency list due to `core2` using it
We're working with rust-bitcoin to remove the `core2` dependency
at https://github.com/rust-bitcoin/rust-bitcoin/pull/2066 but until
that lands and we can upgrade rust-bitcoin we're stuck with it. In
the mean time, we should still pass our MSRV tests.
Matt Corallo [Thu, 14 Sep 2023 21:47:07 +0000 (21:47 +0000)]
Drop `test_esplora_connects_to_public_server`
`blockstream.info` is currently down, causing our CI to fail. This
shouldn't really be a thing, so we drop the blockstream.info-based
test here.
More generally, I'm not really a fan of having tests which run
(outside of CI) and call out to external servers - a developer
working on LDK shouldn't have to have internet access to run our
test suite and shouldn't be registering their presence with a third
party to run our tests.
Elias Rohrer [Mon, 28 Aug 2023 11:57:29 +0000 (13:57 +0200)]
Set `payment_secret` when sending probes
Previously, we'd leave the payment secret field empty while sending
probes, which resulted in having them rejected
with `(PERM|invalid_onion_payload)` by Eclair nodes.
In order to mitigate the issue, we just set a random payment secret.
Elias Rohrer [Tue, 12 Sep 2023 11:37:57 +0000 (13:37 +0200)]
Cleanup `ChannelId` re-export
`ChannelId` was weirdly listed in the re-export section of the docs and
reachable via multiple paths. Here we opt to make the `channel_id`
module private and leave only the `ChannelId` struct itself exposed.
Matt Corallo [Mon, 28 Aug 2023 01:25:36 +0000 (01:25 +0000)]
Restrict `ChannelManager` persist in fuzzing to when we're told to
In the `chanmon_consistency` fuzz, we currently "persist" the
`ChannelManager` on each loop iteration. With the new logic in the
past few commits to reduce the frequency of `ChannelManager`
persistences, this behavior now leaves a gap in our test coverage -
missing persistence notifications.
In order to cath (common-case) persistence misses, we update the
`chanmon_consistency` fuzzer to no longer persist the
`ChannelManager` unless the waker was woken and signaled to
persist, possibly reloading with a previous `ChannelManager` if we
were not signaled.
Matt Corallo [Mon, 28 Aug 2023 01:35:16 +0000 (01:35 +0000)]
Remove largely useless checks in chanmon_consistency fuzzer
When reloading nodes A or C, the chanmon_consistency fuzzer
currently calls `get_and_clear_pending_msg_events` on the node,
potentially causing additional `ChannelMonitor` or `ChannelManager`
updates, just to check that no unexpected messages are generated.
There's not much reason to do so, the fuzzer could always swap for
a different command to call the same method, and the additional
checking requires some weird monitor persistence introspection.
Here we simplify the fuzzer by simply removing this logic.
Matt Corallo [Thu, 24 Aug 2023 20:02:08 +0000 (20:02 +0000)]
Skip persistence in the usual case handling channel_reestablish
When we handle an inbound `channel_reestablish` from our peers it
generally doesn't change any state and thus doesn't need a
`ChannelManager` persistence. Here we avoid said persistence where
possible.
Matt Corallo [Thu, 24 Aug 2023 19:57:45 +0000 (19:57 +0000)]
Always persist the `ChannelManager` on a failed ChannelUpdate
If we receive a `ChannelUpdate` message which was invalid, it can
cause us to force-close the channel, which should result in a
`ChannelManager` persistence, though its not critical to do so.
Matt Corallo [Sun, 10 Sep 2023 23:10:03 +0000 (23:10 +0000)]
Avoid persisting `ChannelManager` in response to peer connection
When a peer connects and we send some `channel_reestablish`
messages or create a `per_peer_state` entry there's really no
reason to need to persist the `ChannelManager`. None of the
possible actions we take immediately result in a change to the
persisted contents of a `ChannelManager`, only the peer's later
`channel_reestablish` message does.
Matt Corallo [Thu, 24 Aug 2023 19:36:58 +0000 (19:36 +0000)]
Move a handful of channel messages to notify-without-persist
Many channel related messages don't actually change the channel
state in a way that changes the persisted channel. For example,
an `update_add_htlc` or `update_fail_htlc` message simply adds the
change to a queue, changing the channel state when we receive a
`commitment_signed` message.
In these cases there's really no reason to wake the background
processor at all - there's no response message and there's no state
update. However, note that if we close the channel we should
persist the `ChannelManager`. If we send an error message without
closing the channel, we should wake the background processor
without persisting.
Here we move to the appropriate `NotifyOption` on some of the
simpler channel message handlers.
Matt Corallo [Thu, 24 Aug 2023 18:37:18 +0000 (18:37 +0000)]
Update `channelmanager::NotifyOption` to indicate persist or event
As we now signal events-available from persistence-needed
separately, the `NotifyOption` enum should include a separate
variant for events-but-no-persistence, which we add here.
Matt Corallo [Fri, 8 Sep 2023 20:26:29 +0000 (20:26 +0000)]
Separate ChannelManager needing persistence from having events
Currently, when a ChannelManager generates a notification for the
background processor, any pending events are handled and the
ChannelManager is always re-persisted.
Many channel related messages don't actually change the channel
state in a way that changes the persisted channel. For example,
an `update_add_htlc` or `update_fail_htlc` message simply adds the
change to a queue, changing the channel state when we receive a
`commitment_signed` message.
In these cases we shouldn't be re-persisting the ChannelManager as
it hasn't changed (persisted) state at all. In anticipation of
doing so in the next few commits, here we make the public API
handle the two concepts (somewhat) separately. The notification
still goes out via a single waker, however whether or not to
persist is now handled via a separate atomic bool.
Matt Corallo [Mon, 11 Sep 2023 03:38:14 +0000 (03:38 +0000)]
Make it harder to forget to call CM::process_background_events
Prior to any actions which may generate a `ChannelMonitorUpdate`,
and in general after startup,
`ChannelManager::process_background_events` must be called. This is
mostly accomplished by doing so on taking the
`total_consistency_lock` via the `PersistenceNotifierGuard`. In
order to skip this call in block connection logic, the
`PersistenceNotifierGuard::optionally_notify` constructor did not
call the `process_background_events` method.
However, this is very easy to misuse - `optionally_notify` does not
convey to the reader that they need to call
`process_background_events` at all.
Here we fix this by adding a separate
`optionally_notify_skipping_background_events` method, making the
requirements much clearer to callers.
Matt Corallo [Wed, 5 Jul 2023 16:15:59 +0000 (16:15 +0000)]
Test monitor update completion actions on pre-startup completion
This adds a test for monitor update actions being completed on
startup if a monitor update completed "while we were shut down"
(or, really, the manager didn't get persisted after the update
completed).
Matt Corallo [Mon, 21 Aug 2023 18:44:22 +0000 (18:44 +0000)]
Update tests to test re-claiming of forwarded HTLCs on startup
Because some of these tests require connecting blocks without
calling `get_and_clear_pending_msg_events`, we need to split up
the block connection utilities to only optionally call
sanity-checks.
`expect_payment_forwarded` takes a bool to indicate that the
inbound channel on which we received a forwarded payment has been
closed, but then ignores it in favor of looking at the fee in the
event. While this is generally correct, in cases where we process
an event after a channel was closed, which was generated before a
channel closed this is incorrect.
Instead, we examine the bool we already passed and use that.
Matt Corallo [Thu, 7 Sep 2023 02:22:52 +0000 (02:22 +0000)]
Block the mon update removing a preimage until upstream mon writes
When we forward a payment and receive an `update_fulfill_htlc`
message from the downstream channel, we immediately claim the HTLC
on the upstream channel, before even doing a `commitment_signed`
dance on the downstream channel. This implies that our
`ChannelMonitorUpdate`s "go out" in the right order - first we
ensure we'll get our money by writing the preimage down, then we
write the update that resolves giving money on the downstream node.
This is safe as long as `ChannelMonitorUpdate`s complete in the
order in which they are generated, but of course looking forward we
want to support asynchronous updates, which may complete in any
order.
Thus, here, we enforce the correct ordering by blocking the
downstream `ChannelMonitorUpdate` until the upstream one completes.
Like the `PaymentSent` event handling we do so only for the
`revoke_and_ack` `ChannelMonitorUpdate`, ensuring the
preimage-containing upstream update has a full RTT to complete
before we actually manage to slow anything down.
Matt Corallo [Thu, 31 Aug 2023 19:06:34 +0000 (19:06 +0000)]
Clean up test handling of resending responding commitment_signed
When we need to rebroadcast a `commitment_signed` on reconnect in
response to a previous update (ie not one which contains any
updates) we previously hacked in support for it by passing a `-1`
for the number of expected update_add_htlcs. This is a mess, and
with the introduction of `ReconnectArgs` we can now clean it up
easily with a new bool.