Re: [syzbot] [mm?] WARNING: bad unlock balance in do_wp_page

From: Andrew Morton

Date: Sun Apr 26 2026 - 13:55:40 EST

On Sun, 26 Apr 2026 23:57:42 +0800 Qi Zheng <qi.zheng@xxxxxxxxx> wrote:

> Hi Andrew,
>
> On 4/26/26 6:49 PM, Andrew Morton wrote:
> > On Sun, 26 Apr 2026 01:17:25 -0700 syzbot <syzbot+7d60b33a8a546263da7c@xxxxxxxxxxxxxxxxxxxxxxxxx> wrote:
> >
> >> Hello,
> >>
> >> syzbot found the following issue on:
> >>
> >> HEAD commit: 6596a02b2078 Merge tag 'drm-next-2026-04-22' of https://gi..
> >> git tree: upstream
> >> console output: https://syzkaller.appspot.com/x/log.txt?x=12483702580000
> >> kernel config: https://syzkaller.appspot.com/x/.config?x=24c8da4692f901cb
> >> dashboard link: https://syzkaller.appspot.com/bug?extid=7d60b33a8a546263da7c
> >> compiler: gcc (Debian 14.2.0-19) 14.2.0, GNU ld (GNU Binutils for Debian) 2.44
> >> userspace arch: i386
> >>
> >> Unfortunately, I don't have any reproducer for this issue yet.
> >
> > argh, that dreaded sentence.
> >
> > Thanks.
> >
> > Something's definitely amiss. This is at least the fifth report of
> > rcu_read_lock() imbalance post-7.0. Others:
> >
> > https://lore.kernel.org/69eab803.a00a0220.17a17.004a.GAE@xxxxxxxxxx
> > https://lore.kernel.org/69eab803.a00a0220.17a17.004b.GAE@xxxxxxxxxx
> > https://lore.kernel.org/69eafb0e.a00a0220.9259.0031.GAE@xxxxxxxxxx
> > https://lore.kernel.org/69ebcbe2.a00a0220.7773.0005.GAE@xxxxxxxxxx
>
> All the kernel configs mentioned above include 'CONFIG_MEMCG_V1=y'.
>
> Theoretically, a rebind_subsystems() can lead a rcu unbalance, see my
> previous discussion with Shakeel for details:
>
> https://lore.kernel.org/all/358c60e1-fa91-40a1-9e00-84c93340c04e@xxxxxxxxx/

Right, that looks similar.

The rcu locking under lruvec_stat_mod_folio() is very simple, and that
return in get_non_dying_memcg_end() does look super suspicious. Why
does it omit the unlock?

otoh, in
https://lore.kernel.org/all/69eafb0e.a00a0220.9259.0031.GAE@xxxxxxxxxx/
we're trying to release an rcu_read_lock() which isn't presently held.
But if cgroup_subsys_on_dfl() were to become false between the
get_non_dying_memcg_start/end pair, that's what would happen.

So yup, I agree, concurrent rebind_subsystems() activity could cause
all of this. The reports are pretty common - is there some debugging
patch we can temporarily add to confirm this theory? And/or is it
possible to cook up a selftest which will trigger this?

> However, in a production environment, this is practically impossible.

Can you expand on this?

sysbot isn't a production environment ;)

> So Shakeel and I chose to wait for a reproducer at the time. :(
>
> >
> > In some cases we released it too often, in other cases we failed to
> > release it.
> >
> > The first one is slightly more useful in that it tells us that the
> > not-released rcu_read_lock() was taken in folio_lruvec_lock_irqsave().
>
> I double-checked some callers of folio_lruvec_lock_irqsave() (such as
> folios_put_refs()), but didn't find anything suspicious. :(

Right - it's rare and smells of a race condition.