Re: [REGRESSION] Null pointer dereference while shrinking zswap
From: Johannes Weiner
Date: Fri Apr 19 2024 - 10:22:47 EST
On Thu, Apr 18, 2024 at 01:09:22PM -0700, Yosry Ahmed wrote:
> On Thu, Apr 18, 2024 at 5:40 AM Johannes Weiner <hannes@xxxxxxxxxxx> wrote:
> >
> > On Wed, Apr 17, 2024 at 07:18:14PM +0200, Christian Heusel wrote:
> > > On 24/04/17 10:33AM, Johannes Weiner wrote:
> > > >
> > > > Christian, can you please test the below patch on top of current
> > > > upstream?
> > > >
> > >
> > > Hey Johannes,
> > >
> > > I have applied your patch on top of 6.9-rc4 and it did solve the crash for
> > > me, thanks for hacking together a fix so quickly! 🤗
> > >
> > > Tested-By: Christian Heusel <christian@xxxxxxxxx>
> >
> > Thanks for confirming it, and sorry about the breakage.
> >
> > Andrew, can you please use the updated changelog below?
> >
> > ---
> >
> > From 52f67f5fab6a743c2aedfc8e04a582a9d1025c28 Mon Sep 17 00:00:00 2001
> > From: Johannes Weiner <hannes@xxxxxxxxxxx>
> > Date: Thu, 18 Apr 2024 08:26:28 -0400
> > Subject: [PATCH] mm: zswap: fix shrinker NULL crash with cgroup_disable=memory
> >
> > Christian reports a NULL deref in zswap that he bisected down to the
> > zswap shrinker. The issue also cropped up in the bug trackers of
> > libguestfs [1] and the Red Hat bugzilla [2].
> >
> > The problem is that when memcg is disabled with the boot time flag,
> > the zswap shrinker might get called with sc->memcg == NULL. This is
> > okay in many places, like the lruvec operations. But it crashes in
> > memcg_page_state() - which is only used due to the non-node accounting
> > of cgroup's the zswap memory to begin with.
> >
> > Nhat spotted that the memcg can be NULL in the memcg-disabled case,
> > and I was then able to reproduce the crash locally as well.
> >
> > [1] https://github.com/libguestfs/libguestfs/issues/139
> > [2] https://bugzilla.redhat.com/show_bug.cgi?id=2275252
> >
> > Fixes: b5ba474f3f51 ("zswap: shrink zswap pool based on memory pressure")
> > Cc: stable@xxxxxxxxxxxxxxx [v6.8]
> > Link: https://lkml.kernel.org/r/20240417143324.GA1055428@xxxxxxxxxxx
> > Reported-by: Christian Heusel <christian@xxxxxxxxx>
> > Debugged-by: Nhat Pham <nphamcs@xxxxxxxxx>
> > Suggested-by: Nhat Pham <nphamcs@xxxxxxxxx>
> > Tested-By: Christian Heusel <christian@xxxxxxxxx>
> > Signed-off-by: Johannes Weiner <hannes@xxxxxxxxxxx>
>
> Thanks for fixing this. A couple of comments/questions below, but anyway LGTM:
>
> Acked-by: Yosry Ahmed <yosryahmed@xxxxxxxxxx>
>
> > ---
> > mm/zswap.c | 25 ++++++++++++++++---------
> > 1 file changed, 16 insertions(+), 9 deletions(-)
> >
> > diff --git a/mm/zswap.c b/mm/zswap.c
> > index caed028945b0..6f8850c44b61 100644
> > --- a/mm/zswap.c
> > +++ b/mm/zswap.c
> > @@ -1331,15 +1331,22 @@ static unsigned long zswap_shrinker_count(struct shrinker *shrinker,
> > if (!gfp_has_io_fs(sc->gfp_mask))
> > return 0;
> >
> > -#ifdef CONFIG_MEMCG_KMEM
> > - mem_cgroup_flush_stats(memcg);
> > - nr_backing = memcg_page_state(memcg, MEMCG_ZSWAP_B) >> PAGE_SHIFT;
> > - nr_stored = memcg_page_state(memcg, MEMCG_ZSWAPPED);
> > -#else
> > - /* use pool stats instead of memcg stats */
> > - nr_backing = zswap_pool_total_size >> PAGE_SHIFT;
> > - nr_stored = atomic_read(&zswap_nr_stored);
> > -#endif
> > + /*
> > + * For memcg, use the cgroup-wide ZSWAP stats since we don't
> > + * have them per-node and thus per-lruvec. Careful if memcg is
> > + * runtime-disabled: we can get sc->memcg == NULL, which is ok
> > + * for the lruvec, but not for memcg_page_state().
> > + *
> > + * Without memcg, use the zswap pool-wide metrics.
> > + */
> > + if (!mem_cgroup_disabled()) {
>
> With the current shrinker code, it seems like we cannot get sc->memcg
> == NULL unless mem_cgroup_disabled() is true indeed. However, maybe
> it's better to check if sc->memcg is not NULL directly instead?
>
> This would be more resilient in case the shrinker code changes and
> passing sc->memcg == NULL becomes possible in other cases (e.g. during
> global shrinking). It seems like other shrinkers do this, for example
> see count_shadow_nodes() and deferred_split_count().
Eh, I'm not sure it's better or worse, so it's a bit hard to care. We
shouldn't get NULL here when memcg is enabled, and if somebody
introduces that bug it's better to catch it early than run into subtle
priority inversions when the kernel is deployed to millions of hosts.
> I am also wondering if we should also check !mem_cgroup_is_root()
> here? We can avoid the expensive global flush and use the global stats
> directly in this case. I could also send a follow up patch for this if
> that's preferred.
I'd rather not proliferate more memcg internals in this code. If this
is a concern, optimizing it in the flush and stat functions would make
more sense. Reclaim already flushes the subtree before getting here,
so odds are good this is a no-op in most cases.