Re: [PATCH] writeback: fix race between cgroup_writeback_umount() and inode_switch_wbs()

From: Baokun Li

Date: Sun May 17 2026 - 02:26:01 EST


在 2026/5/14 21:09, Jan Kara 写道:
> On Thu 14-05-26 10:55:03, Baokun Li wrote:
>> 在 2026/5/14 04:36, Tejun Heo 写道:
>>> Hello,
>>>
>>> Resending - earlier send dropped the Cc list. Sorry for the noise.
>>>
>>> How rcu_barrier() got out of sync, as best I can reconstruct:
>>>
>>> - ec084de929e4 ("fs/writeback.c: use rcu_barrier() to wait for inflight
>>> wb switches going into workqueue when umount", 2019) put the inc
>>> after call_rcu(); rcu_barrier() worked from then.
>>>
>>> - 8826ee4fe750 ("writeback, cgroup: increment isw_nr_in_flight before
>>> grabbing an inode", 2021) moved the inc back ahead to cover the prep
>>> window, apparently reopening this gap.
>>>
>>> - e1b849cfa6b6 ("writeback: Avoid contention on wb->list_lock when
>>> switching inodes", 2025) replaced call_rcu() with llist_add() +
>>> queue_work(); rcu_barrier() looks like a no-op for this path since.
>>>
>>> Could SRCU work instead? srcu_read_lock around the publish (atomic_inc
>>> through wb_queue_isw), with cgroup_writeback_umount() keeping the
>>> counter gate but swapping rcu_barrier() for synchronize_srcu():
>>>
>>> if (atomic_read(&isw_nr_in_flight)) {
>>> synchronize_srcu(&isw_srcu);
>>> flush_workqueue(isw_wq);
>>> }
>>>
>>> Thoughts?
>> Thanks for the detailed analysis on how rcu_barrier() got out of sync,
>> that matches my understanding as well.
>>
>> Regarding the SRCU idea: I considered it, but it has a key drawback.
>> synchronize_srcu() waits for all read-side critical sections globally
>> -- it cannot distinguish which superblock a given switcher is working
>> on. So if sb A is being unmounted while unrelated switchers for sb B/C/D
>> hold srcu_read_lock(), umount of A gets blocked unnecessarily. The
>> global isw_nr_in_flight gate makes this worse: any non-zero count from
>> any sb triggers synchronize_srcu(), even when the target sb has no
>> in-flight switches at all.
>>
>> This is especially problematic in high-density container environments,
>> where many containers with separate filesystems are being created and
>> destroyed concurrently. Frequent cgroup migrations across multiple
>> superblocks keep the global isw_nr_in_flight perpetually non-zero,
>> causing every single umount to pay the synchronize_srcu() cost even
>> when the target sb has zero in-flight switches.
>>
>> The per-sb counter avoids this entirely -- cgroup_writeback_umount()
>> only waits for switches belonging to its own superblock to drain, and
>> returns immediately when s_isw_nr_in_flight is zero. The global counter
>> is retained solely for throttling (WB_FRN_MAX_IN_FLIGHT).
>>
>> The other trade-offs are roughly comparable: both need pairing on all
>> paths, but the per-sb atomic_t gets zero-initialized by kzalloc for
>> free, while SRCU needs init/cleanup lifecycle management. The per-cpu
>> read lock advantage doesn't matter here since wb switching is
>> infrequent.
>>
>> So I went with the per-sb counter for its precision and simplicity.
>> That said, if you prefer the SRCU approach, I'm happy to spin a new
>> version using it.
> So I don't think we need a new SRCU. What really needs protection is this
> snippet as you correctly identified in your changelog:
>
> if (!inode_prepare_wbs_switch(inode, new_wb))
> goto out_free;
>
> isw->inodes[0] = inode;
>
> trace_inode_switch_wbs_queue(inode->i_wb, new_wb, 1);
> wb_queue_isw(new_wb, isw);
>
> So between we __iget() the inode and we queue work. The rest gets solved
> either by SB_ACTIVE check in inode_prepare_wbs_switch() or
> flush_workqueue() in cgroup_writeback_umount(). Since there's no sleep in
> this part, we can just wrap it within rcu_read_lock() (with appropriate
> comment what's protected) and that fixes the race. We could even downgrade
> rcu_barrier() in cgroup_writeback_umount() to synchronize_rcu() and it
> would still work. So I'd prefer to fix the race like this (also to ease
> backporting to older kernels).
>
> Now the performance optimization to make isw_nr_in_flight per
> superblock makes sense and I'm open to it as well. But it should be a
> separate commit with proper justification and ideally some numbers backing
> it showing the benefit.
>
> Honza


Thanks, that makes sense.  I'll send a v2 split into three patches:

  1) Quick race fix using rcu_read_lock() around the
     [inode_prepare_wbs_switch() ... wb_queue_isw()] window, plus a
     synchronize_rcu() in cgroup_writeback_umount().  rcu_barrier() is
     intentionally kept in this patch so it can be backported unchanged
     to older trees (5.10.y, 6.6.y, ...) that still queue the switch via
     queue_rcu_work() and rely on rcu_barrier() to drain those
     callbacks.  Cc: stable@xxxxxxxxxxxxxxx.

  2) Mainline-only cleanup that drops the now-unnecessary rcu_barrier(),
     with Fixes: e1b849cfa6b6 ("writeback: Avoid contention on
     wb->list_lock when switching inodes") since that's the commit that
     turned it into a no-op.

  3) Reverts the rcu_read_lock() extension from (1) and switches to a
     per-sb s_isw_nr_in_flight counter as a separate optimization.

Performance numbers for (3), full breakdown will be in the v2 cover
letter.  An idle target sb is mounted/umounted in a loop while 4
unrelated sbs run a "create memcg -> dd 1 MiB to a shared file ->
rmdir memcg" churner that keeps the global isw_nr_in_flight non-zero.
N=100 iterations:

  Idle target umount under cross-sb cgwb-switch pressure:
                              p50      p95      p99      max
    (1)+(2) synchronize_rcu  64.4 ms  95.8 ms 101.4 ms 110.5 ms
    (3) per-sb counter        5.3 ms   6.9 ms   7.4 ms   7.7 ms
    no-pressure baseline      5.2 ms   5.9 ms   6.0 ms   6.1 ms

  8 concurrent umounts of idle sbs under the same pressure:
                              p50      p95      max
    (1)+(2) synchronize_rcu  57.9 ms  82.1 ms  90.0 ms
    (3) per-sb counter        7.5 ms   7.8 ms   8.0 ms

  In-kernel cgroup_writeback_umount() time over 286 calls (bpftrace):
    (1)+(2)  8717 ms total (~30 ms / call)
    (3)         1.16 ms total (~4 us / call)


Cheers,
Baokun