Re: RE: Re: [PATCH] ceph: fix potential stray locked folios during umount

From: Viacheslav Dubeyko

Date: Mon Apr 27 2026 - 17:53:24 EST

On Sun, 2026-04-26 at 15:38 +0000, 李磊 wrote:
>
> > >

<skipped>

> > > I understand your concern. This patch is a truly straightforward workaround.
> > > So, how about we just abort OSD requests if they take too long to return
> > > during unmounting ?
> >
> > The question here is how to define that OSD requests taking too long time?
> > Potentially, processing could be really slow for some reason. From one point of
> > view, if we know that destination OSD is down or we have network partitioning,
> > then it doesn't make sense to wait to long. I am thinking about potential
> > checking of number of OSD requests. If this number is going down, then it needs
> > to wait, otherwise, if this number doesn't change, then it needs to finish the
> > unmount without waiting. Does it make sense?
> >
> > >
> > > Compared to leaving some locked folios in the system, return -EIO to those
> > > OSD requests which may never return is more reasonable. This is because locked
> > > folios left behind Cephfs unmount may block kcompactd and render the entire
> > > system unstable.
> > >
> >
> > I agree. It makes sense. If we know that some OSD requests will never return,
> > then we need to manage this situation in better way. But how could we detect
> > that OSD request will never return?
> >
> > > Besides, successful unmounting doesn't guarantee dirty buffers are successfully
> > > written to the backend. For example, when a buffered write returns, the local
> > > filesystem may encounter bad blocks on the local disk and -EIO is returned to
> > > the writeback kworkers. Therefore, in our scenario, does it make sense if we
> > > treat the OSD requests that have been flight for a certain period as failed,
> > > And return -EIO to the caller?
> >
> > This is the main question: how to detect that OSD requests are failed?
> >
> > As far as I can see, if an OSD is down and osd_request_timeout is not set (the
> > default), a stalled write can block unmount indefinitely. I assume that you have
> > the osd_request_timeout is not set. So, maybe, we need to re-consider the policy
> > of management the stuck OSD requests during unmount.
> >
> > Laggy OSD path: if any request's r_stamp is older than osd_keepalive_timeout,
> > the OSD goes on a slow_osds list and ceph_con_keepalive() is called, sending a
> > keepalive byte over TCP. If the TCP connection is silently broken, the keepalive
> > write will fail, triggering con_fault().
> >
> > Timed-out request path: if osd_request_timeout is set (default 0 = disabled),
> > requests older than that deadline are aborted with -ETIMEDOUT via
> > abort_request().
> >
> > Homeless requests: requests that can't be mapped to any OSD are also checked
> > against osd_request_timeout.
> >
> > The ceph_con_keepalive_expired() uses the timestamp of the last keepalive
> > acknowledgement (con->last_keepalive_ack) to determine whether the peer has gone
> > silent beyond interval. When this fires, the connection is considered dead and
> > con_fault() is triggered.
> >
> > So, we need to find a proper approach of finding a good solution from available
> > functionality.
>
> I agree. Instead of waiting for inflight requests infinitely or aborting OSD
> requests brutally, you prefer a much more elegant way to deal with this dilemma.
> It’s cool, but it seems complex and more time is needed to fix locked folios leakage
> on the client nodes. Is there any acceptable short-term scheme?

Have you tried to set up the osd_request_timeout and to see how CephFS kernel
client will behave afterwards? Will it change anything?

>
> I find it is not easy to work around this issue by merely increasing opt->mount_timeout.
> Both dirty_folios and stopping_blockers wait with TASK_KILLABLE set, which means the
> unmount process’s wait can be interrupted by a kill signal and leave some locked folios
> after unmount regardless of the mount_timeout setting.

If somebody (or something) would like to kill the process, then there is nothing
that we can do. The potential kill signal can be received at any time point and
some locked folios continue to exists.

>
> >
> > >
> > > Lastly, I think we can just use stopping blockers to replace dirty_folios to
> > > simplify the unmounting and wait process. Accordingly, in ceph_kill_sb(), we
> > > only need to wait for stopping_blockers count to drop to zero. If a timeout
> > > occurs, we can cancel all the inflight requests and print some warning messages.
> >
> > The dirty_folios is "how much dirty data is still to be flushed," while
> > stopping_blockers is "how many threads are currently inside code that holds an
> > implicit reference to the MDS client." Unmount must drain both in order, and the
> > two counters solve entirely different races.
> >
> > The mdsc->dirty_folios — count of dirty page-cache folios not yet written back.
> > It incremented in ceph_dirty_folio() at the moment of folio transitions from
> > clean to dirty in the page cache. It decremented in the OSD write-completion
> > callback after the OSD acknowledges the writeback and end_page_writeback() is
> > called. It represents the number of file-data folios that have been dirtied
> > (modified in the page cache) but whose data has not yet reached an OSD (i.e.,
> > writeback is pending or in flight).
> >
> > The mdsc->stopping_blockers counts of in-progress MDS/OSD message handlers. It
> > incremented by ceph_inc_mds_stopping_blocker() / ceph_inc_osd_stopping_blocker()
> > at the entry of any async operation that must not be interrupted mid-flight by
> > shutdown. It decremented at the exit of the async operations' handlers.
>
> Thanks for the detailed explanation of dirty_folios and stopping_blockers. However,
> it is still a bit confusing that we need to wait dirty foios to decrease to zero
> after sync_filesystem() in ceph_kill_sb(). It seems that the semantics of
> sync_filesystem() is broken in Cephfs? Because in the common situation, sync_filesystem()
> guarantees dirty folios are flushed to the backend, and PG_dirty of these folios
> are cleared after it returns.
>
>

As far as I can see, during sync_filesystem() call, it is used
filemap_fdatawait_keep_errors() that waits for pages tagged
PAGECACHE_TAG_WRITEBACK. A dirty page that was found by ceph_writepages_start()
but never submitted stays tagged PAGECACHE_TAG_DIRTY, and sync_filesystem()
returns treating it as invisible. So, this is why we need to wait dirty foios to
decrease to zero in ceph_kill_sb().

Thanks,
Slava.