RE: Re: RE: Re: [PATCH] ceph: fix potential stray locked folios during umount

From: Viacheslav Dubeyko

Date: Wed Apr 29 2026 - 14:26:01 EST

On Wed, 2026-04-29 at 14:42 +0000, 李磊 wrote:
>
> > 2026年4月28日 05:52，Viacheslav Dubeyko <Slava.Dubeyko@xxxxxxx> 写道：
> >
> > 安全提示：此邮件来自公司外部。除非您确认发件人身份可信且邮件内容不含可疑信息，否则请勿回复或转发邮件、点击邮件链接或打开附件。
> >
> >
> > On Sun, 2026-04-26 at 15:38 +0000, 李磊 wrote:
> > >
> > > > >
> >
> > <skipped>
> >
> > > > > I understand your concern. This patch is a truly straightforward workaround.
> > > > > So, how about we just abort OSD requests if they take too long to return
> > > > > during unmounting ?
> > > >
> > > > The question here is how to define that OSD requests taking too long time?
> > > > Potentially, processing could be really slow for some reason. From one point of
> > > > view, if we know that destination OSD is down or we have network partitioning,
> > > > then it doesn't make sense to wait to long. I am thinking about potential
> > > > checking of number of OSD requests. If this number is going down, then it needs
> > > > to wait, otherwise, if this number doesn't change, then it needs to finish the
> > > > unmount without waiting. Does it make sense?
> > > >
> > > > >
> > > > > Compared to leaving some locked folios in the system, return -EIO to those
> > > > > OSD requests which may never return is more reasonable. This is because locked
> > > > > folios left behind Cephfs unmount may block kcompactd and render the entire
> > > > > system unstable.
> > > > >
> > > >
> > > > I agree. It makes sense. If we know that some OSD requests will never return,
> > > > then we need to manage this situation in better way. But how could we detect
> > > > that OSD request will never return?
> > > >
> > > > > Besides, successful unmounting doesn't guarantee dirty buffers are successfully
> > > > > written to the backend. For example, when a buffered write returns, the local
> > > > > filesystem may encounter bad blocks on the local disk and -EIO is returned to
> > > > > the writeback kworkers. Therefore, in our scenario, does it make sense if we
> > > > > treat the OSD requests that have been flight for a certain period as failed,
> > > > > And return -EIO to the caller?
> > > >
> > > > This is the main question: how to detect that OSD requests are failed?
> > > >
> > > > As far as I can see, if an OSD is down and osd_request_timeout is not set (the
> > > > default), a stalled write can block unmount indefinitely. I assume that you have
> > > > the osd_request_timeout is not set. So, maybe, we need to re-consider the policy
> > > > of management the stuck OSD requests during unmount.
> > > >
> > > > Laggy OSD path: if any request's r_stamp is older than osd_keepalive_timeout,
> > > > the OSD goes on a slow_osds list and ceph_con_keepalive() is called, sending a
> > > > keepalive byte over TCP. If the TCP connection is silently broken, the keepalive
> > > > write will fail, triggering con_fault().
> > > >
> > > > Timed-out request path: if osd_request_timeout is set (default 0 = disabled),
> > > > requests older than that deadline are aborted with -ETIMEDOUT via
> > > > abort_request().
> > > >
> > > > Homeless requests: requests that can't be mapped to any OSD are also checked
> > > > against osd_request_timeout.
> > > >
> > > > The ceph_con_keepalive_expired() uses the timestamp of the last keepalive
> > > > acknowledgement (con->last_keepalive_ack) to determine whether the peer has gone
> > > > silent beyond interval. When this fires, the connection is considered dead and
> > > > con_fault() is triggered.
> > > >
> > > > So, we need to find a proper approach of finding a good solution from available
> > > > functionality.
> > >
> > > I agree. Instead of waiting for inflight requests infinitely or aborting OSD
> > > requests brutally, you prefer a much more elegant way to deal with this dilemma.
> > > It’s cool, but it seems complex and more time is needed to fix locked folios leakage
> > > on the client nodes. Is there any acceptable short-term scheme?
> >
> > Have you tried to set up the osd_request_timeout and to see how CephFS kernel
> > client will behave afterwards? Will it change anything?
>
> If I apply this patch to wait for stopping blockers to drop to zero, setting osd_request_timeout
> can help abort OSD requests in time and allow the unmount process to proceed. However
> I think we still have 2 aspects to discuss.

I think that if osd_request_timeout has CEPH_OSD_REQUEST_TIMEOUT_DEFAULT value
(infinite timeout), then, probably, we need to process this in special way.
Maybe, we need to change the default timeout to another default value that can
manage aborting OSD requests in reasonable time. What do you think?

>
> 1. Instead of using mount_timeout, can we use other option to accommodate waiting during
> the unmount process?
>
> It is somewhat confusing that the mount_timeout option decides how long we should wait
> for both dirty_folios and stopping_blockers if they don’t drop to zero. As for as I know
> mount_timeout determines the maximum wait time in open_root_dentry() for loading root
> inode during the mount operation.
>
> Just for the scenario I described — stop all the OSDs and kill buffered read, is it
> better to use osd_request_timeout instead?
>
> Or can we wait_for_completion() infinitely if an OSD request never returns, but create a
> debugfs file (for example ‘abort’) to tigger all OSD’s requests to ensure a clean and
> successful and unmount.

Probably, you are right, the mount_timeout option could look confusing here.
But, from another point of view, we have unmout process here and mount_timeout
option could be considered like a good fit. But we need wait ending of OSD
requests. So, I can agree that osd_request_timeout sounds like more proper
option here.

Also, I started to think that we need to improve the logic. Currently, we have:

wait_queue_head_t *wq = &mdsc->flush_end_wq;
long timeleft = wait_event_killable_timeout(*wq,
atomic64_read(&mdsc->dirty_folios) <=
0,
fsc->client->options->mount_timeout);
if (!timeleft) /* timed out */
pr_warn_client(cl, "umount timed out, %ld\n", timeleft);
else if (timeleft < 0) /* killed */
pr_warn_client(cl, "umount was killed, %ld\n", timeleft);

Technically speaking, even if timeout has been elapsed (especially short
enough), then it doesn't mean that all dirty folios have been processed. I think
we need to have a loop in both cases for waiting processing all dirty folios or
processing/aborting all OSD requests.

What do you think?

>
> 2. Is killable waiting really suitable here ?
>
> Any user-space process may send a kill signal to the unmount process, which may leave
> behind some stray locked folios and degrade the system stability. Maybe we should use
> non-killable functions here ?
>
>
>

I think if anyone kills the process, then this person expects that this process
dies right now. Usually, we kill the process if something is going wrong
already. I am not sure that non-killable functions will be better here.

Thanks,
Slava.