Re: [PATCH 2/2] mm: zswap.c: remove RB tree

From: Chris Li
Date: Fri Jan 19 2024 - 16:32:41 EST


On Fri, Jan 19, 2024 at 11:37 AM Yosry Ahmed <yosryahmed@xxxxxxxxxx> wrote:
> > > I think using the xas_* APIs can be avoided here. The only reason we
> > > need it is that we want to check if there's an existing entry first,
> > > and return -EEXIST. However, in that case, the caller will replace it
> > > anyway (and do some operations on the dupentry):
> >
> > We might be able to for the insert case if we don't mind changing the
> > code behavior a bit. My original intent is to keep close to the
> > original zswap code and not stir the pot too much for the xarray
> > replacement. We can always make more adjustment once the RB tree is
> > gone.
>
> I don't see how this changes code behavior though. The current code in
> zswap_store() will do the following:

I am referring to the log and update counter happening after the zswap
mapping was updated. Maybe nobody actually cares about that behavior
difference. In my mind, there is a difference.

>
> - Hold the tree lock to make sure no one else modifies it.
> - Try to insert, check if there is already a dupentry at the index and
> return -EEXIST.
> - Warn, increment zswap_duplicate_entry, and invalidate the dupentry.
> - Try to insert again (this should always succeed since we are holding
> the lock).
>
> What I am proposing is:
> - zswap_xa_insert() is a thin wrapper around xa_store() (or we can
> remove it completely).
> - zswap_store() does the following:
> - Use zswap_xa_insert() and check if there is a returned dupentry.
> - Warn, increment zswap_duplicate_entry, and invalidate the dupentry.
>
> Either way, we always place the entry we have in the tree, and if
> there is a dupentry we warn and invalidate it. If anything, the latter
> is more straightforward.
>
> Am I missing something?

No, that is what I would do if I have to use the xa_* API.


>
> > >
> > > > }
> > > >
> > > > static bool zswap_erase(struct zswap_tree *tree, struct zswap_entry *entry)
> > > > {
> > > > + struct zswap_entry *e;
> > > > pgoff_t offset = swp_offset(entry->swpentry);
> > > > - if (!RB_EMPTY_NODE(&entry->rbnode)) {
> > > > - struct zswap_entry *old;
> > > > - old = xa_erase(&tree->xarray, offset);
> > > > - BUG_ON(old != entry);
> > > > - rb_erase(&entry->rbnode, &tree->rbroot);
> > > > - RB_CLEAR_NODE(&entry->rbnode);
> > > > - return true;
> > > > - }
> > > > - return false;
> > > > + XA_STATE(xas, &tree->xarray, offset);
> > > > +
> > > > + do {
> > > > + xas_lock_irq(&xas);
> > > > + do {
> > > > + e = xas_load(&xas);
> > > > + } while (xas_retry(&xas, e));
> > > > + if (xas_valid(&xas) && e != entry) {
> > > > + xas_unlock_irq(&xas);
> > > > + return false;
> > > > + }
> > > > + xas_store(&xas, NULL);
> > > > + xas_unlock_irq(&xas);
> > > > + } while (xas_nomem(&xas, GFP_KERNEL));
> > > > + return !xas_error(&xas);
> > > > }
> > >
> > > Same here, I think we just want:
> > >
> > > return !!xa_erase(..);
> >
> > For the erase case it is tricky.
> > The current zswap code does not erase an entry if the tree entry at
> > the same offset has been changed. It should be fine if the new entry
> > is NULL. Basically some race to remove the entry already. However, if
> > the entry is not NULL, then force resetting it to NULL will change
> > behavior compared to the current.
>
> I see, very good point. I think we can use xa_cmpxchg() and pass in NULL?
>
That is certainly possible. Thanks for bringing it up.
Let me try to combine the tree->lock with xarray lock first. If
xa_cmpxchg() can simplify the result there, I will use it.


> Handling large folios in zswap is a much larger topic that involves a
> lot more than this xa_* vs. xas_* apis dispute. Let's not worry about
> this for now.

Ack. One more reason to use the XAS interface is that zswap currently
does multiple lookups on typical zswap_load(). It finds entries by
offset, for the entry (lookup one). Then after folio install to swap
cache, it deletes the entry, it will performan another lookup to
delete the entry (look up two). Using XAS might be able to cache the
node location for the second lookup to avoid the full node walk. That
is not in my current patch and can be a later improvement patch as
well.

Chris