[PATCH v1 00/13] ceph/libceph: fix hung tasks and connection recovery during network disruptions

From: Ionut Nechita (Wind River)

Date: Thu Mar 12 2026 - 04:18:03 EST

During Rook-Ceph rolling upgrades (e.g., Ceph 18.2.2 -> 18.2.5) with
active CephFS workloads (ReadWriteMany PVCs with continuous I/O), the
kernel CephFS client encounters multiple cascading failures that leave
the filesystem completely unresponsive:

1. Persistent EADDRNOTAVAIL (-99) on all connections

When monitor/MDS/OSD pods restart during the upgrade, they receive
new IP addresses from the CNI plugin. The kernel client's cached
source address (learned via process_hello/process_banner from the
initial connection) may become invalid -- e.g., a Calico-assigned
pod address that was removed, with a blackhole route installed for
the old address range. All subsequent kernel_connect() calls fail
with EADDRNOTAVAIL at ip6_dst_lookup_flow() before even sending a
TCP SYN.

The existing exponential backoff (250ms -> 15s) compounds with
the monitor hunt backoff (3s * hunt_mult, up to 30s), making
recovery take 30+ minutes even after the network issue resolves.

Observed: ~470 failed connect attempts over ~36 minutes, with the
client sitting idle for up to 15s between attempts.

2. Indefinite hangs in sync() path

ceph_mdsc_sync() and ceph_osdc_sync() use wait_event() and
wait_for_completion() with no timeout. When MDS/OSD connections
are down, sync tasks block indefinitely in D state, triggering
hung_task warnings that escalate: 122s, 245s, 368s, ... 983s+.

Stack traces show:
ceph_mdsc_sync -> wait_caps_flush (indefinite wait_event)
ceph_mdsc_sync -> flush_mdlog_and_wait_mdsc_unsafe_requests
(indefinite wait_for_completion)
ceph_osdc_sync -> wait_for_completion (indefinite)

3. Stale mdsmap causing permanent MDS reconnection failure

The kernel client caches the mdsmap and subscribes for incremental
updates (start=current_epoch+1). During the upgrade, if the monitor
subscription was lost (also affected by EADDRNOTAVAIL), the client
never receives updated maps. The mdsmap was observed stuck at
epoch 53 while the cluster had progressed to epoch 90. The client
retried connections to the old MDS address indefinitely.

Two scenarios lead to this:
a) Active connection failures: mds_con_ops had no .fault callback,
so the MDS client was never notified
b) Silent connection death: messenger enters STANDBY, session
transitions to HUNG via TTL, but no mdsmap refresh is triggered

4. I/O operations hung in unkillable D state

ceph_start_io_write() and related I/O lock functions use
inode_dio_wait() and wait_on_inode_writeback() which are
TASK_UNINTERRUPTIBLE. During MDS failover, these block indefinitely
and cannot be killed, accumulating D-state processes.

Test results (20 iterations of MDS kill during active I/O):
12 passed, 8 failed with hung tasks in ceph_start_io_write,
__ceph_get_caps, ceph_fsync.

5. Wrong network namespace captured at mount time

ceph_messenger_init() captures current->nsproxy->net_ns. In CSI
environments, mount() may be invoked from a pod namespace that
lacks routes to Ceph monitors, causing permanent EADDRNOTAVAIL.

This series addresses all five issues:

Connection layer (patches 1, 11-12):
- Bypass exponential backoff for EADDRNOTAVAIL, use fixed 100ms retry
- After 30 consecutive EADDRNOTAVAIL failures, reset the cached source
address to blank so process_hello() re-learns it from the monitor
- Force immediate monitor reconnect during persistent EADDRNOTAVAIL,
reset hunt_mult to prevent accumulated backoff

Sync path timeouts (patches 2-3, 5-7):
- Add mount_timeout-based timeouts to all indefinite waits in
the sync path: wait_caps_flush(), ceph_osdc_sync(),
flush_mdlog_and_wait_mdsc_unsafe_requests(),
ceph_lock_wait_for_completion(), __ceph_get_caps()
- Set a default timeout for MDS requests
- On timeout, pending operations are NOT discarded -- they remain in
memory and complete when connectivity is restored

Race condition fix (patch 4):
- Fix race in cleanup_session_requests() where the request list can
be modified concurrently during MDS reconnection

I/O killability (patches 8-9):
- Make ceph_start_io_write() and related I/O lock functions killable
(TASK_KILLABLE instead of TASK_UNINTERRUPTIBLE)

MDS map refresh (patch 10):
- Add .fault callback to mds_con_ops to detect persistent MDS
connection failures
- Force fresh mdsmap subscription (start=0) after 10 consecutive
failures or when a session becomes HUNG
- Reset failure counter on successful session message

Network namespace (patch 13):
- Always use init_net in ceph_messenger_init() instead of the
caller's namespace
- This is the final piece that ensures mon, mds, and osd
connections all use the host network after the upgrade,
allowing the client to successfully reconnect to all Ceph
daemons regardless of which namespace triggered the mount

Tested on kernel 6.12.x with Rook-Ceph (Ceph Reef 18.2.5), IPv6-only
cluster, during rolling upgrades with active CephFS workloads. The
patches resolve all five failure modes described above.

Ionut Nechita (13):
libceph: handle EADDRNOTAVAIL more gracefully
ceph: add timeout protection to ceph_mdsc_sync() path
ceph: add timeout protection to ceph_osdc_sync() path
ceph: fix race condition in cleanup_session_requests()
ceph: add timeout protection to ceph_lock_wait_for_completion()
ceph: set default timeout for MDS requests
ceph: add timeout to caps wait in __ceph_get_caps()
ceph: make ceph_start_io_write() killable
ceph: make remaining I/O lock functions killable
ceph: force mdsmap refresh on persistent MDS connection failures
libceph: reset source address on persistent EADDRNOTAVAIL
libceph: force monitor reconnect on persistent EADDRNOTAVAIL
libceph: force host network namespace for kernel CephFS mounts

fs/ceph/caps.c | 16 +++-
fs/ceph/file.c | 34 ++++++--
fs/ceph/io.c | 37 +++++---
fs/ceph/io.h | 6 +-
fs/ceph/locks.c | 14 ++-
fs/ceph/mds_client.c | 148 +++++++++++++++++++++++++++++---
fs/ceph/mds_client.h | 4 +-
fs/ceph/super.c | 9 +-
include/linux/ceph/messenger.h | 31 +++++++
include/linux/ceph/osd_client.h | 2 +-
net/ceph/messenger.c | 133 +++++++++++++++++++++++++++-
net/ceph/messenger_v1.c | 7 ++
net/ceph/messenger_v2.c | 12 +++
net/ceph/mon_client.c | 39 ++++++++-
net/ceph/osd_client.c | 15 +++-
15 files changed, 457 insertions(+), 50 deletions(-)

base-commit: 8a243ecde1f6447b8e237f2c1c67c0bb67d16d67
--
2.53.0