Re: [PATCH v7] zswap: replace RB tree with xarray
From: Yosry Ahmed
Date: Wed Mar 20 2024 - 16:12:38 EST
On Wed, Mar 20, 2024 at 04:03:22PM -0400, Johannes Weiner wrote:
> On Wed, Mar 20, 2024 at 07:34:36PM +0000, Yosry Ahmed wrote:
> > On Wed, Mar 20, 2024 at 03:25:58PM -0400, Johannes Weiner wrote:
> > > On Wed, Mar 20, 2024 at 07:11:38PM +0000, Yosry Ahmed wrote:
> > > > On Wed, Mar 20, 2024 at 06:08:03AM -0400, Johannes Weiner wrote:
> > > > > On Wed, Mar 20, 2024 at 07:24:27AM +0000, Yosry Ahmed wrote:
> > > > > > [..]
> > > > > > > > > - /* map */
> > > > > > > > > - spin_lock(&tree->lock);
> > > > > > > > > /*
> > > > > > > > > - * The folio may have been dirtied again, invalidate the
> > > > > > > > > - * possibly stale entry before inserting the new entry.
> > > > > > > > > + * We finish initializing the entry while it's already in xarray.
> > > > > > > > > + * This is safe because:
> > > > > > > > > + *
> > > > > > > > > + * 1. Concurrent stores and invalidations are excluded by folio lock.
> > > > > > > > > + *
> > > > > > > > > + * 2. Writeback is excluded by the entry not being on the LRU yet.
> > > > > > > > > + * The publishing order matters to prevent writeback from seeing
> > > > > > > > > + * an incoherent entry.
> > > > > > > >
> > > > > > > > As I mentioned before, writeback is also protected by the folio lock.
> > > > > > > > Concurrent writeback will find the folio in the swapcache and abort. The
> > > > > > > > fact that the entry is not on the LRU yet is just additional protection,
> > > > > > > > so I don't think the publishing order actually matters here. Right?
> > > > > > >
> > > > > > > Right. This comment is explaining why this publishing order does not
> > > > > > > matter. I think we are talking about the same thing here?
> > > > > >
> > > > > > The comment literally says "the publishing order matters.." :)
> > > > > >
> > > > > > I believe Johannes meant that we should only publish the entry to the
> > > > > > LRU once it is fully initialized, to prevent writeback from using a
> > > > > > partially initialized entry.
> > > > > >
> > > > > > What I am saying is that, even if we add a partially initialized entry
> > > > > > to the zswap LRU, writeback will skip it anyway because the folio is
> > > > > > locked in the swapcache.
> > > > > >
> > > > > > So basically I think the comment should say:
> > > > > >
> > > > > > /*
> > > > > > * We finish initializing the entry while it's already in the
> > > > > > * xarray. This is safe because the folio is locked in the swap
> > > > > > * cache, which should protect against concurrent stores,
> > > > > > * invalidations, and writeback.
> > > > > > */
> > > > > >
> > > > > > Johannes, what do you think?
> > > > >
> > > > > I don't think that's quite right.
> > > > >
> > > > > Writeback will bail on swapcache insert, yes, but it will access the
> > > > > entry before attempting it. If LRU publishing happened before setting
> > > > > entry->swpentry e.g., we'd have a problem, while your comment suggets
> > > > > it would be safe to rearrange the code like this.
> > > > >
> > > > > So LRU publishing order does matter.
> > > >
> > > > Ah yes, you are right. entry->swpentry should be set to make sure we
> > > > lookup the correct entry in the swapcache and the tree.
> > > >
> > > > Perhaps we should spell this out in the comment and make the
> > > > initialization ordering more explicit? Maybe something like:
> > > >
> > > > diff --git a/mm/zswap.c b/mm/zswap.c
> > > > index d8a14b27adcd7..70924b437743a 100644
> > > > --- a/mm/zswap.c
> > > > +++ b/mm/zswap.c
> > > > @@ -1472,9 +1472,6 @@ bool zswap_store(struct folio *folio)
> > > > goto put_pool;
> > > >
> > > > insert_entry:
> > > > - entry->swpentry = swp;
> > > > - entry->objcg = objcg;
> > > > -
> > > > old = xa_store(tree, offset, entry, GFP_KERNEL);
> > > > if (xa_is_err(old)) {
> > > > int err = xa_err(old);
> > > > @@ -1491,6 +1488,7 @@ bool zswap_store(struct folio *folio)
> > > > if (old)
> > > > zswap_entry_free(old);
> > > >
> > > > + entry->objcg = objcg;
> > > > if (objcg) {
> > > > obj_cgroup_charge_zswap(objcg, entry->length);
> > > > count_objcg_event(objcg, ZSWPOUT);
> > > > @@ -1498,15 +1496,16 @@ bool zswap_store(struct folio *folio)
> > > >
> > > > /*
> > > > * We finish initializing the entry while it's already in xarray.
> > > > - * This is safe because:
> > > > - *
> > > > - * 1. Concurrent stores and invalidations are excluded by folio lock.
> > > > + * This is safe because the folio is locked in the swapcache, which
> > > > + * protects against concurrent stores and invalidations.
> > > > *
> > > > - * 2. Writeback is excluded by the entry not being on the LRU yet.
> > > > - * The publishing order matters to prevent writeback from seeing
> > > > - * an incoherent entry.
> > > > + * Concurrent writeback is not possible until we add the entry to the
> > > > + * LRU. We need to at least initialize entry->swpentry *before* adding
> > > > + * the entry to the LRU to make sure writeback looks up the correct
> > > > + * entry in the swapcache.
> > > > */
> > > > if (entry->length) {
> > > > + entry->swpentry = swp;
> > > > INIT_LIST_HEAD(&entry->lru);
> > > > zswap_lru_add(&zswap_list_lru, entry);
> > > > atomic_inc(&zswap_nr_stored);
> > > >
> > > >
> > > > This also got me wondering, do we need a write barrier between
> > > > initializing entry->swpentry and zswap_lru_add()?
> > > >
> > > > I guess if we read the wrong swpentry in zswap_writeback_entry() we will
> > > > eventually fail the xa_cmpxchg() and drop it anyway, but it seems
> > > > bug-prone.
> > >
> > > I think it's more robust the way Chris has it now. Writeback only
> > > derefs ->swpentry today, but who knows if somebody wants to make a
> > > changes that relies on a different member. Having submembers follow
> > > different validity rules and timelines is error prone and makes the
> > > code less hackable without buying all that much. The concept of
> > > "publishing" an object like this is more common: if you can see it,
> > > you can expect it to be coherent.
> >
> > Fair enough, but don't we still need a barrier there? Couldn't some
> > initializations still be reorder after zswap_lru_add()?
>
> Only if it were lockless. The LRU unlocking in zswap_store() implies
> RELEASE, the LRU locking in writeback implies ACQUIRE. Those force the
> desired ordering - nothing can bleed after RELEASE, nothing can bleed
> before ACQUIRE.
Ah yes, I found smp_store_release() deep in the spin_unlock() code.
Thanks for pointing this out, all is good in the world.