Re: [PATCH] libceph: handle EADDRNOTAVAIL more gracefully

From: Ilya Dryomov

Date: Tue Feb 10 2026 - 07:26:01 EST

On Tue, Feb 10, 2026 at 8:19 AM Ionut Nechita (Wind River)
<ionut.nechita@xxxxxxxxxxxxx> wrote:
>
> Hi Ilya,
>
> Thank you for the thorough review and the good questions. You're right
> to challenge the "1-2 seconds" claim -- looking at the dmesg data more
> carefully, that was misleading in the commit message.
>
> > I'm missing how an error that is typically transient and goes away in
> > 1-2s can cause a delay of 15+ seconds against a 250ms, 500ms, 1s, 2s,
> > 4s, 8s, 15s backoff loop.
>
> You're absolutely right that if the address became valid in 1-2s, the
> third or fourth attempt would succeed. The problem is that in our
> environment, EADDRNOTAVAIL does NOT resolve in 1-2 seconds. That was
> an incorrect generalization from simple DAD scenarios.
>
> From the production dmesg (6.12.0-1-rt-amd64, StarlingX on Dell
> PowerEdge R720, IPv6-only Ceph cluster), the EADDRNOTAVAIL condition
> persists for much longer:
>
> 13:20:52 - mon0 session lost, hunting begins, first error -99
> 13:57:03 - mon0 session finally re-established
>
> That's approximately 36 minutes of continuous EADDRNOTAVAIL on all
> source addresses. This happens during a StarlingX rolling upgrade,
> where the platform reconfigures the network stack extensively (interface
> teardown/rebuild, address reassignment, routing changes).

Hi Ionut,

For how long of those 36 minutes EADDRNOTAVAIL was actually being
returned from kernel_connect()? I'm trying to separate the time for
which the external condition persisted from the time that it took the
client to reestablish the session after the resolution. The
"approximately 36 minutes of continuous EADDRNOTAVAIL on all source
addresses" makes it sound like kernel_connect() was returning
EADDRNOTAVAIL all that time. If so, it would mean that the client
managed to reestablish the monitor session in 13:57:03 - 13:20:52
= 0:36:11 (i.e. just some double-digit seconds on top of the error
disappearing), which would seem acceptable.

>
> The reason the delays compound beyond the simple backoff sequence is
> that there are two independent backoff mechanisms stacking:
>
> 1) Connection-level backoff (con_fault in messenger.c):
> 250ms -> 500ms -> 1s -> 2s -> 4s -> 8s -> 15s (MAX_DELAY_INTERVAL)
>
> 2) Monitor hunt-level backoff (mon_client.c delayed_work):
> 3s * hunt_mult, where hunt_mult doubles each cycle up to 10x max,
> so the hunt interval grows: 3s -> 6s -> 12s -> 24s -> 30s (capped)
>
> At steady state, each monitor gets ~30 seconds of attempts before
> the hunt timer switches to the next one. Within those 30 seconds,
> the connection goes through the full exponential backoff (several
> attempts up to the 15s max delay). The round-trip through both
> monitors takes ~60 seconds at max backoff.

What is meant by both monitors? IIRC the client only tries to connect
to a single monitor at a time. How many monitors does this cluster have
configured?

>
> > How many attempts do you see per session and in total for the event
> > before and after this patch?
>
> Before the patch (from the dmesg):
> - Total error-99 messages: ~470 connect attempts over 36 minutes
> - Per monitor session (one hunt cycle at steady state): ~8 attempts
> (immediate x3, +1s, +2s, +3s, +5s, +8s before hunt switches)
> - The sync task was blocked for 983+ seconds (over 16 minutes),
> triggering repeated hung task warnings:
> 12:52:11 - "task sync blocked for more than 122 seconds"
> 13:31:05 - "task sync blocked for more than 122 seconds" (new sync)
> 13:33:08 - 245 seconds
> 13:35:11 - 368 seconds
> ...continued up to 983+ seconds at 13:45:26
>
> After the patch:
> - The ADDRNOTAVAIL_DELAY (HZ/10 = 100ms) replaces the exponential
> backoff for EADDRNOTAVAIL failures specifically, so retries happen
> at a fixed 100ms interval instead of growing to 15s
> - In testing with the same rolling upgrade scenario, the total
> reconnection time dropped from 36 minutes to under 3 seconds once
> the address became available, because the client was retrying every
> 100ms rather than waiting 15s between attempts at the connection
> level

This is where I'm getting lost again. It's stated above that in this
environment EADDRNOTAVAIL doesn't resolve in 1-2 seconds. If it takes
minutes for the underlying error to disappear in this scenario, how
could the patch result in total reconnection time dropping to under
3 seconds?

> - Total attempts per event: similar count, but compressed into a
> much shorter window with faster recovery once the address is valid
>
> I should correct the commit message -- the "1-2 seconds" claim was
> wrong. The accurate description is that the duration of EADDRNOTAVAIL
> varies widely depending on the environment: it can be brief (simple
> DAD) or very long (complex network reconfiguration during rolling
> upgrades). The patch helps in both cases by keeping the retry interval
> short so that recovery happens as soon as the address becomes
> available, rather than potentially waiting up to 15 seconds for the
> next connection attempt.

Is this setup experiencing the brief or the very long case? Or is it
both, heavily intermixed? Speaking generally, if the address doesn't
become available for tens of minutes, waiting for up to 30 seconds on
top of that isn't a problem IMO.

>
> I will also note that the connection-level backoff delay does NOT
> reset when the monitor client switches monitors via reopen_session(),
> because ceph_con_open() sets con->delay = 0 but the new connection
> immediately hits EADDRNOTAVAIL and con_fault() sets it right back
> into exponential backoff.

... but when the new connection goes back into backoff, it starts at
250ms, right? I'd call that a reset since the new connection doesn't
inherit e.g. 15s delay from the old connection. Are you observing
something different there?

Thanks,

Ilya