Re: [PATCH] writeback: fix race between cgroup_writeback_umount() and inode_switch_wbs()

From: Jan Kara

Date: Thu May 14 2026 - 09:10:08 EST

On Thu 14-05-26 10:55:03, Baokun Li wrote:
> 在 2026/5/14 04:36, Tejun Heo 写道:
> > Hello,
> >
> > Resending - earlier send dropped the Cc list. Sorry for the noise.
> >
> > How rcu_barrier() got out of sync, as best I can reconstruct:
> >
> > - ec084de929e4 ("fs/writeback.c: use rcu_barrier() to wait for inflight
> > wb switches going into workqueue when umount", 2019) put the inc
> > after call_rcu(); rcu_barrier() worked from then.
> >
> > - 8826ee4fe750 ("writeback, cgroup: increment isw_nr_in_flight before
> > grabbing an inode", 2021) moved the inc back ahead to cover the prep
> > window, apparently reopening this gap.
> >
> > - e1b849cfa6b6 ("writeback: Avoid contention on wb->list_lock when
> > switching inodes", 2025) replaced call_rcu() with llist_add() +
> > queue_work(); rcu_barrier() looks like a no-op for this path since.
> >
> > Could SRCU work instead? srcu_read_lock around the publish (atomic_inc
> > through wb_queue_isw), with cgroup_writeback_umount() keeping the
> > counter gate but swapping rcu_barrier() for synchronize_srcu():
> >
> > if (atomic_read(&isw_nr_in_flight)) {
> > synchronize_srcu(&isw_srcu);
> > flush_workqueue(isw_wq);
> > }
> >
> > Thoughts?
>
> Thanks for the detailed analysis on how rcu_barrier() got out of sync,
> that matches my understanding as well.
>
> Regarding the SRCU idea: I considered it, but it has a key drawback.
> synchronize_srcu() waits for all read-side critical sections globally
> -- it cannot distinguish which superblock a given switcher is working
> on. So if sb A is being unmounted while unrelated switchers for sb B/C/D
> hold srcu_read_lock(), umount of A gets blocked unnecessarily. The
> global isw_nr_in_flight gate makes this worse: any non-zero count from
> any sb triggers synchronize_srcu(), even when the target sb has no
> in-flight switches at all.
>
> This is especially problematic in high-density container environments,
> where many containers with separate filesystems are being created and
> destroyed concurrently. Frequent cgroup migrations across multiple
> superblocks keep the global isw_nr_in_flight perpetually non-zero,
> causing every single umount to pay the synchronize_srcu() cost even
> when the target sb has zero in-flight switches.
>
> The per-sb counter avoids this entirely -- cgroup_writeback_umount()
> only waits for switches belonging to its own superblock to drain, and
> returns immediately when s_isw_nr_in_flight is zero. The global counter
> is retained solely for throttling (WB_FRN_MAX_IN_FLIGHT).
>
> The other trade-offs are roughly comparable: both need pairing on all
> paths, but the per-sb atomic_t gets zero-initialized by kzalloc for
> free, while SRCU needs init/cleanup lifecycle management. The per-cpu
> read lock advantage doesn't matter here since wb switching is
> infrequent.
>
> So I went with the per-sb counter for its precision and simplicity.
> That said, if you prefer the SRCU approach, I'm happy to spin a new
> version using it.

So I don't think we need a new SRCU. What really needs protection is this
snippet as you correctly identified in your changelog:

if (!inode_prepare_wbs_switch(inode, new_wb))
goto out_free;

isw->inodes[0] = inode;

trace_inode_switch_wbs_queue(inode->i_wb, new_wb, 1);
wb_queue_isw(new_wb, isw);

So between we __iget() the inode and we queue work. The rest gets solved
either by SB_ACTIVE check in inode_prepare_wbs_switch() or
flush_workqueue() in cgroup_writeback_umount(). Since there's no sleep in
this part, we can just wrap it within rcu_read_lock() (with appropriate
comment what's protected) and that fixes the race. We could even downgrade
rcu_barrier() in cgroup_writeback_umount() to synchronize_rcu() and it
would still work. So I'd prefer to fix the race like this (also to ease
backporting to older kernels).

Now the performance optimization to make isw_nr_in_flight per
superblock makes sense and I'm open to it as well. But it should be a
separate commit with proper justification and ideally some numbers backing
it showing the benefit.

Honza
--
Jan Kara <jack@xxxxxxxx>
SUSE Labs, CR