Re: [PATCH] writeback: fix race between cgroup_writeback_umount() and inode_switch_wbs()

From: Baokun Li

Date: Wed May 13 2026 - 22:55:24 EST


在 2026/5/14 04:36, Tejun Heo 写道:
> Hello,
>
> Resending - earlier send dropped the Cc list. Sorry for the noise.
>
> How rcu_barrier() got out of sync, as best I can reconstruct:
>
> - ec084de929e4 ("fs/writeback.c: use rcu_barrier() to wait for inflight
> wb switches going into workqueue when umount", 2019) put the inc
> after call_rcu(); rcu_barrier() worked from then.
>
> - 8826ee4fe750 ("writeback, cgroup: increment isw_nr_in_flight before
> grabbing an inode", 2021) moved the inc back ahead to cover the prep
> window, apparently reopening this gap.
>
> - e1b849cfa6b6 ("writeback: Avoid contention on wb->list_lock when
> switching inodes", 2025) replaced call_rcu() with llist_add() +
> queue_work(); rcu_barrier() looks like a no-op for this path since.
>
> Could SRCU work instead? srcu_read_lock around the publish (atomic_inc
> through wb_queue_isw), with cgroup_writeback_umount() keeping the
> counter gate but swapping rcu_barrier() for synchronize_srcu():
>
> if (atomic_read(&isw_nr_in_flight)) {
> synchronize_srcu(&isw_srcu);
> flush_workqueue(isw_wq);
> }
>
> Thoughts?

Thanks for the detailed analysis on how rcu_barrier() got out of sync,
that matches my understanding as well.

Regarding the SRCU idea: I considered it, but it has a key drawback.
synchronize_srcu() waits for all read-side critical sections globally
-- it cannot distinguish which superblock a given switcher is working
on. So if sb A is being unmounted while unrelated switchers for sb B/C/D
hold srcu_read_lock(), umount of A gets blocked unnecessarily. The
global isw_nr_in_flight gate makes this worse: any non-zero count from
any sb triggers synchronize_srcu(), even when the target sb has no
in-flight switches at all.

This is especially problematic in high-density container environments,
where many containers with separate filesystems are being created and
destroyed concurrently. Frequent cgroup migrations across multiple
superblocks keep the global isw_nr_in_flight perpetually non-zero,
causing every single umount to pay the synchronize_srcu() cost even
when the target sb has zero in-flight switches.

The per-sb counter avoids this entirely -- cgroup_writeback_umount()
only waits for switches belonging to its own superblock to drain, and
returns immediately when s_isw_nr_in_flight is zero. The global counter
is retained solely for throttling (WB_FRN_MAX_IN_FLIGHT).

The other trade-offs are roughly comparable: both need pairing on all
paths, but the per-sb atomic_t gets zero-initialized by kzalloc for
free, while SRCU needs init/cleanup lifecycle management. The per-cpu
read lock advantage doesn't matter here since wb switching is
infrequent.

So I went with the per-sb counter for its precision and simplicity.
That said, if you prefer the SRCU approach, I'm happy to spin a new
version using it.


Cheers,
Baokun