Re: [PATCH v1 03/13] ceph: add timeout protection to ceph_osdc_sync() path

From: Viacheslav Dubeyko

Date: Thu Mar 12 2026 - 15:28:21 EST

On Thu, 2026-03-12 at 10:16 +0200, Ionut Nechita (Wind River) wrote:
> From: Ionut Nechita <ionut.nechita@xxxxxxxxxxxxx>
>
> When a Ceph OSD becomes unreachable (e.g., due to IPv6 EADDRNOTAVAIL
> during DAD or network transitions), the sync syscall can block
> indefinitely in ceph_osdc_sync(). This function iterates over all
> in-flight write requests and calls wait_for_completion() with no
> timeout on each one. The hung_task detector fires repeatedly with
> stack traces showing:
> ceph_osdc_sync [libceph]
> ceph_sync_fs [ceph]
> iterate_supers
> ksys_sync
> Since ceph_osdc_sync() is called before ceph_mdsc_sync() in
> ceph_sync_fs(), an OSD hang prevents the MDS timeout protection
> from commit e789e5252fda ("ceph: add timeout protection to
> ceph_mdsc_sync() path") from ever being reached.
> This is particularly problematic in containerized environments with
> PREEMPT_RT kernels where Ceph storage pods undergo rolling updates
> and IPv6 network reconfigurations cause temporary OSD unavailability.
> Fix this by adding mount_timeout-based timeout to the blocking wait,
> following the existing pattern used by wait_request_timeout() in the
> same file:
> - ceph_osdc_sync(): use wait_for_completion_timeout() with
> mount_timeout instead of indefinite wait_for_completion()
> - Change return type from void to int, return -ETIMEDOUT on timeout
> - ceph_sync_fs(): propagate OSD sync error, short-circuit before
> MDS sync on failure
> On timeout, pending OSD requests are NOT cancelled - they remain
> in-flight and complete when the OSD reconnects. The timeout simply
> unblocks the calling task. If mount_timeout is set to 0,
> ceph_timeout_jiffies() returns MAX_SCHEDULE_TIMEOUT, preserving the
> original infinite-wait behavior.
> Signed-off-by: Ionut Nechita <ionut.nechita@xxxxxxxxxxxxx>
> ---
> fs/ceph/super.c | 4 +++-
> include/linux/ceph/osd_client.h | 2 +-
> net/ceph/osd_client.c | 15 +++++++++++++--
> 3 files changed, 17 insertions(+), 4 deletions(-)
>
> diff --git a/fs/ceph/super.c b/fs/ceph/super.c
> index b52960402d68e..6f4ee457c1b52 100644
> --- a/fs/ceph/super.c
> +++ b/fs/ceph/super.c
> @@ -133,7 +133,9 @@ static int ceph_sync_fs(struct super_block *sb, int wait)
> }
>
> doutc(cl, "(blocking)\n");
> - ceph_osdc_sync(&fsc->client->osdc);
> + ret = ceph_osdc_sync(&fsc->client->osdc);
> + if (ret)
> + return ret;
> ret = ceph_mdsc_sync(fsc->mdsc);
> doutc(cl, "(blocking) done\n");
> return ret;
> diff --git a/include/linux/ceph/osd_client.h b/include/linux/ceph/osd_client.h
> index d7941478158cd..871827e2dd983 100644
> --- a/include/linux/ceph/osd_client.h
> +++ b/include/linux/ceph/osd_client.h
> @@ -587,7 +587,7 @@ void ceph_osdc_start_request(struct ceph_osd_client *osdc,
> extern void ceph_osdc_cancel_request(struct ceph_osd_request *req);
> extern int ceph_osdc_wait_request(struct ceph_osd_client *osdc,
> struct ceph_osd_request *req);
> -extern void ceph_osdc_sync(struct ceph_osd_client *osdc);
> +extern int ceph_osdc_sync(struct ceph_osd_client *osdc);
>
> extern void ceph_osdc_flush_notifies(struct ceph_osd_client *osdc);
> void ceph_osdc_maybe_request_map(struct ceph_osd_client *osdc);
> diff --git a/net/ceph/osd_client.c b/net/ceph/osd_client.c
> index abac770bc0b4c..7d5e4a078fb10 100644
> --- a/net/ceph/osd_client.c
> +++ b/net/ceph/osd_client.c
> @@ -4734,10 +4734,13 @@ EXPORT_SYMBOL(ceph_osdc_wait_request);
> /*
> * sync - wait for all in-flight requests to flush. avoid starvation.
> */
> -void ceph_osdc_sync(struct ceph_osd_client *osdc)
> +int ceph_osdc_sync(struct ceph_osd_client *osdc)
> {
> + struct ceph_options *opts = osdc->client->options;
> + unsigned long timeout = ceph_timeout_jiffies(opts->mount_timeout);

The opts->mount_timeout could be configured unreasonably.

> struct rb_node *n, *p;
> u64 last_tid = atomic64_read(&osdc->last_tid);
> + unsigned long left;
>
> again:
> down_read(&osdc->lock);
> @@ -4760,7 +4763,14 @@ void ceph_osdc_sync(struct ceph_osd_client *osdc)
> up_read(&osdc->lock);
> dout("%s waiting on req %p tid %llu last_tid %llu\n",
> __func__, req, req->r_tid, last_tid);
> - wait_for_completion(&req->r_completion);
> + left = wait_for_completion_timeout(&req->r_completion,
> + timeout);
> + if (!left) {
> + pr_warn("ceph: osd sync request tid %llu timed out\n",
> + req->r_tid);

I am not sure about necessity to send this message into syslog. Maybe, debug
output here? My point here, if we simply postpone some operation that will be
executed lately, then should we inform about elapsed timeout. Let's imagine that
opts->mount_timeout will be really short, then we will have bunch of messages in
the system log.

Thanks,
Slava.

> + ceph_osdc_put_request(req);
> + return -ETIMEDOUT;
> + }
> ceph_osdc_put_request(req);
> goto again;
> }
> @@ -4770,6 +4780,7 @@ void ceph_osdc_sync(struct ceph_osd_client *osdc)
>
> up_read(&osdc->lock);
> dout("%s done last_tid %llu\n", __func__, last_tid);
> + return 0;
> }
> EXPORT_SYMBOL(ceph_osdc_sync);
>