Re: [PATCH 2/2] mm: zswap: remove unnecessary tree cleanups in zswap_swapoff()
From: Yosry Ahmed
Date: Thu Jan 25 2024 - 17:34:49 EST
On Thu, Jan 25, 2024 at 2:31 PM Chris Li <chrisl@xxxxxxxxxx> wrote:
>
> Hi Yosry,
>
> On Thu, Jan 25, 2024 at 12:58 PM Yosry Ahmed <yosryahmed@xxxxxxxxxx> wrote:
> >
> > On Thu, Jan 25, 2024 at 10:55 AM Chris Li <chrisl@xxxxxxxxxx> wrote:
> > >
> > > Hi Yosry,
> > >
> > > On Wed, Jan 24, 2024 at 11:59 PM Yosry Ahmed <yosryahmed@xxxxxxxxxx> wrote:
> > > >
> > > > On Wed, Jan 24, 2024 at 9:29 PM Chris Li <chriscli@xxxxxxxxxx> wrote:
> > > > >
> > > > > Hi Yosry,
> > > > >
> > > > > On Tue, Jan 23, 2024 at 10:58 PM Yosry Ahmed <yosryahmed@xxxxxxxxxx> wrote:
> > > > > >
> > > > >
> > > > > > >
> > > > > > > Thanks for the great analysis, I missed the swapoff/swapon race myself :)
> > > > > > >
> > > > > > > The first solution that came to mind for me was refcounting the zswap
> > > > > > > tree with RCU with percpu-refcount, similar to how cgroup refs are
> > > > > > > handled (init in zswap_swapon() and kill in zswap_swapoff()). I think
> > > > > > > the percpu-refcount may be an overkill in terms of memory usage
> > > > > > > though. I think we can still do our own refcounting with RCU, but it
> > > > > > > may be more complicated.
> > > > > >
> > > > > > FWIW, I was able to reproduce the problem in a vm with the following
> > > > > > kernel diff:
> > > > >
> > > > > Thanks for the great find.
> > > > >
> > > > > I was worry about the usage after free situation in this email:
> > > > >
> > > > > https://lore.kernel.org/lkml/CAF8kJuOvOmn7wmKxoqpqSEk4gk63NtQG1Wc+Q0e9FZ9OFiUG6g@xxxxxxxxxxxxxx/
> > > > >
> > > > > Glad you are able to find a reproducible case. That is one of the
> > > > > reasons I change the free to invalidate entries in my xarray patch.
> > > > >
> > > > > I think the swap_off code should remove the entry from the tree, just
> > > > > wait for each zswap entry to drop to zero. Then free it.
> > > >
> > > > This doesn't really help. The swapoff code is already removing all the
> > > > entries from the trees before zswap_swapoff() is called through
> > > > zswap_invalidate(). The race I described occurs because the writeback
> > > > code is accessing the entries through the LRU, not the tree. The
> > >
> > > Why?
> > > Entry not in the tree is fine. What you describe is that, swap_off
> > > code will not see the entry because it is already not in the tree.
> > > The last one holding the reference count would free it.
> > >
> > > > writeback code could have isolated a zswap entry from the LRU before
> > > > swapoff, then tried to access it after swapoff. Although the zswap
> > >
> > > The writeback should have a reference count to the zswap entry it
> > > holds. The entry will not be free until the LRU is done and drop the
> > > reference count to zero.
> > >
> > > > entry itself is referenced and safe to use, accessing the tree to grab
> > > > the tree lock and check if the entry is still in the tree is the
> > > > problem.
> > >
> > > The swap off should wait until all the LRU list from that tree has
> > > been consumed before destroying the tree.
> > > In swap off, it walks all the process MM anyway, walking all the memcg
> > > and finding all the zswap entries in zswap LRU should solve that
> > > problem.
> >
> > At that point, the entry is isolated from the zswap LRU list as well.
> > So even if swap off iterates the zswap LRUs, it cannot find it to wait
>
> It just means that we need to defer removing the entry from LRU, only
> remove it after most of the write back is complete to some critical
> steps.
>
> > for it. Take a closer look at the race condition I described. The
>
> I take a closer look at the sequence Chengming describe, it has the
> element of delay removing entry from the LRU as well.
>
> > problem is that after the entry is isolated from the zswap LRU, we
> > need to grab the tree lock to make sure it's still there and get a
> > ref, and just trying to lock the tree may be a UAF if we race with
> > swapoff.
>
> I feel it is very wrong to have the tree freed while having
> outstanding entry allocationed from the tree pending.
> I would want to avoid that situation if possible.
This should be the case with Chengming's solution.
>
> >
> > > Anyway, I think it is easier to reason about the behavior that way.
> > > Swap off will take the extra hit, but not the normal access of the
> > > tree.
> >
> > I think this adds a lot of unnecessary complexity tbh. I think the
> > operations reordering that Chengming described may be the way to go
> > here. It does not include adding more logic or synchronization
>
> Does not require adding the tree reference count right?
No, just reordering of operations in the writeback path.
>
> > primitives, just reordering operations to be closer to what we do in
> > zswap_load() and leverage existing synchronization.
>
> The complexity is mostly for avoiding the tree reference count. If we
> don't add the tree refcount and we don't need the extra complexity in
> the tree waiting for LRU, that sounds great to me.
I think that's what Chengming's proposal provides.