Re: Re: RE: Re: [PATCH] ceph: fix potential stray locked folios during umount

From: 李磊

Date: Wed Apr 29 2026 - 10:54:06 EST

> 2026年4月28日 05:52，Viacheslav Dubeyko <Slava.Dubeyko@xxxxxxx> 写道：
>
> 安全提示：此邮件来自公司外部。除非您确认发件人身份可信且邮件内容不含可疑信息，否则请勿回复或转发邮件、点击邮件链接或打开附件。
>
>
> On Sun, 2026-04-26 at 15:38 +0000, 李磊 wrote:
>>
>>>>
>
> <skipped>
>
>>>> I understand your concern. This patch is a truly straightforward workaround.
>>>> So, how about we just abort OSD requests if they take too long to return
>>>> during unmounting ?
>>>
>>> The question here is how to define that OSD requests taking too long time?
>>> Potentially, processing could be really slow for some reason. From one point of
>>> view, if we know that destination OSD is down or we have network partitioning,
>>> then it doesn't make sense to wait to long. I am thinking about potential
>>> checking of number of OSD requests. If this number is going down, then it needs
>>> to wait, otherwise, if this number doesn't change, then it needs to finish the
>>> unmount without waiting. Does it make sense?
>>>
>>>>
>>>> Compared to leaving some locked folios in the system, return -EIO to those
>>>> OSD requests which may never return is more reasonable. This is because locked
>>>> folios left behind Cephfs unmount may block kcompactd and render the entire
>>>> system unstable.
>>>>
>>>
>>> I agree. It makes sense. If we know that some OSD requests will never return,
>>> then we need to manage this situation in better way. But how could we detect
>>> that OSD request will never return?
>>>
>>>> Besides, successful unmounting doesn't guarantee dirty buffers are successfully
>>>> written to the backend. For example, when a buffered write returns, the local
>>>> filesystem may encounter bad blocks on the local disk and -EIO is returned to
>>>> the writeback kworkers. Therefore, in our scenario, does it make sense if we
>>>> treat the OSD requests that have been flight for a certain period as failed,
>>>> And return -EIO to the caller?
>>>
>>> This is the main question: how to detect that OSD requests are failed?
>>>
>>> As far as I can see, if an OSD is down and osd_request_timeout is not set (the
>>> default), a stalled write can block unmount indefinitely. I assume that you have
>>> the osd_request_timeout is not set. So, maybe, we need to re-consider the policy
>>> of management the stuck OSD requests during unmount.
>>>
>>> Laggy OSD path: if any request's r_stamp is older than osd_keepalive_timeout,
>>> the OSD goes on a slow_osds list and ceph_con_keepalive() is called, sending a
>>> keepalive byte over TCP. If the TCP connection is silently broken, the keepalive
>>> write will fail, triggering con_fault().
>>>
>>> Timed-out request path: if osd_request_timeout is set (default 0 = disabled),
>>> requests older than that deadline are aborted with -ETIMEDOUT via
>>> abort_request().
>>>
>>> Homeless requests: requests that can't be mapped to any OSD are also checked
>>> against osd_request_timeout.
>>>
>>> The ceph_con_keepalive_expired() uses the timestamp of the last keepalive
>>> acknowledgement (con->last_keepalive_ack) to determine whether the peer has gone
>>> silent beyond interval. When this fires, the connection is considered dead and
>>> con_fault() is triggered.
>>>
>>> So, we need to find a proper approach of finding a good solution from available
>>> functionality.
>>
>> I agree. Instead of waiting for inflight requests infinitely or aborting OSD
>> requests brutally, you prefer a much more elegant way to deal with this dilemma.
>> It’s cool, but it seems complex and more time is needed to fix locked folios leakage
>> on the client nodes. Is there any acceptable short-term scheme?
>
> Have you tried to set up the osd_request_timeout and to see how CephFS kernel
> client will behave afterwards? Will it change anything?

If I apply this patch to wait for stopping blockers to drop to zero, setting osd_request_timeout
can help abort OSD requests in time and allow the unmount process to proceed. However
I think we still have 2 aspects to discuss.

1. Instead of using mount_timeout, can we use other option to accommodate waiting during
the unmount process?

It is somewhat confusing that the mount_timeout option decides how long we should wait
for both dirty_folios and stopping_blockers if they don’t drop to zero. As for as I know
mount_timeout determines the maximum wait time in open_root_dentry() for loading root
inode during the mount operation.

Just for the scenario I described — stop all the OSDs and kill buffered read, is it
better to use osd_request_timeout instead?

Or can we wait_for_completion() infinitely if an OSD request never returns, but create a
debugfs file (for example ‘abort’) to tigger all OSD’s requests to ensure a clean and
successful and unmount.

2. Is killable waiting really suitable here ?

Any user-space process may send a kill signal to the unmount process, which may leave
behind some stray locked folios and degrade the system stability. Maybe we should use
non-killable functions here ?

Thanks,
Li

>
>>
>> I find it is not easy to work around this issue by merely increasing opt->mount_timeout.
>> Both dirty_folios and stopping_blockers wait with TASK_KILLABLE set, which means the
>> unmount process’s wait can be interrupted by a kill signal and leave some locked folios
>> after unmount regardless of the mount_timeout setting.
>
> If somebody (or something) would like to kill the process, then there is nothing
> that we can do. The potential kill signal can be received at any time point and
> some locked folios continue to exists.
>
>>
>>>
>>>>
>>>> Lastly, I think we can just use stopping blockers to replace dirty_folios to
>>>> simplify the unmounting and wait process. Accordingly, in ceph_kill_sb(), we
>>>> only need to wait for stopping_blockers count to drop to zero. If a timeout
>>>> occurs, we can cancel all the inflight requests and print some warning messages.
>>>
>>> The dirty_folios is "how much dirty data is still to be flushed," while
>>> stopping_blockers is "how many threads are currently inside code that holds an
>>> implicit reference to the MDS client." Unmount must drain both in order, and the
>>> two counters solve entirely different races.
>>>
>>> The mdsc->dirty_folios — count of dirty page-cache folios not yet written back.
>>> It incremented in ceph_dirty_folio() at the moment of folio transitions from
>>> clean to dirty in the page cache. It decremented in the OSD write-completion
>>> callback after the OSD acknowledges the writeback and end_page_writeback() is
>>> called. It represents the number of file-data folios that have been dirtied
>>> (modified in the page cache) but whose data has not yet reached an OSD (i.e.,
>>> writeback is pending or in flight).
>>>
>>> The mdsc->stopping_blockers counts of in-progress MDS/OSD message handlers. It
>>> incremented by ceph_inc_mds_stopping_blocker() / ceph_inc_osd_stopping_blocker()
>>> at the entry of any async operation that must not be interrupted mid-flight by
>>> shutdown. It decremented at the exit of the async operations' handlers.
>>
>> Thanks for the detailed explanation of dirty_folios and stopping_blockers. However,
>> it is still a bit confusing that we need to wait dirty foios to decrease to zero
>> after sync_filesystem() in ceph_kill_sb(). It seems that the semantics of
>> sync_filesystem() is broken in Cephfs? Because in the common situation, sync_filesystem()
>> guarantees dirty folios are flushed to the backend, and PG_dirty of these folios
>> are cleared after it returns.
>>
>>
>
> As far as I can see, during sync_filesystem() call, it is used
> filemap_fdatawait_keep_errors() that waits for pages tagged
> PAGECACHE_TAG_WRITEBACK. A dirty page that was found by ceph_writepages_start()
> but never submitted stays tagged PAGECACHE_TAG_DIRTY, and sync_filesystem()
> returns treating it as invisible. So, this is why we need to wait dirty foios to
> decrease to zero in ceph_kill_sb().
>
> Thanks,
> Slava.