Re: [PATCH v7] zswap: replace RB tree with xarray
From: Yosry Ahmed
Date: Wed Mar 20 2024 - 15:34:47 EST
On Wed, Mar 20, 2024 at 03:25:58PM -0400, Johannes Weiner wrote:
> On Wed, Mar 20, 2024 at 07:11:38PM +0000, Yosry Ahmed wrote:
> > On Wed, Mar 20, 2024 at 06:08:03AM -0400, Johannes Weiner wrote:
> > > On Wed, Mar 20, 2024 at 07:24:27AM +0000, Yosry Ahmed wrote:
> > > > [..]
> > > > > > > - /* map */
> > > > > > > - spin_lock(&tree->lock);
> > > > > > > /*
> > > > > > > - * The folio may have been dirtied again, invalidate the
> > > > > > > - * possibly stale entry before inserting the new entry.
> > > > > > > + * We finish initializing the entry while it's already in xarray.
> > > > > > > + * This is safe because:
> > > > > > > + *
> > > > > > > + * 1. Concurrent stores and invalidations are excluded by folio lock.
> > > > > > > + *
> > > > > > > + * 2. Writeback is excluded by the entry not being on the LRU yet.
> > > > > > > + * The publishing order matters to prevent writeback from seeing
> > > > > > > + * an incoherent entry.
> > > > > >
> > > > > > As I mentioned before, writeback is also protected by the folio lock.
> > > > > > Concurrent writeback will find the folio in the swapcache and abort. The
> > > > > > fact that the entry is not on the LRU yet is just additional protection,
> > > > > > so I don't think the publishing order actually matters here. Right?
> > > > >
> > > > > Right. This comment is explaining why this publishing order does not
> > > > > matter. I think we are talking about the same thing here?
> > > >
> > > > The comment literally says "the publishing order matters.." :)
> > > >
> > > > I believe Johannes meant that we should only publish the entry to the
> > > > LRU once it is fully initialized, to prevent writeback from using a
> > > > partially initialized entry.
> > > >
> > > > What I am saying is that, even if we add a partially initialized entry
> > > > to the zswap LRU, writeback will skip it anyway because the folio is
> > > > locked in the swapcache.
> > > >
> > > > So basically I think the comment should say:
> > > >
> > > > /*
> > > > * We finish initializing the entry while it's already in the
> > > > * xarray. This is safe because the folio is locked in the swap
> > > > * cache, which should protect against concurrent stores,
> > > > * invalidations, and writeback.
> > > > */
> > > >
> > > > Johannes, what do you think?
> > >
> > > I don't think that's quite right.
> > >
> > > Writeback will bail on swapcache insert, yes, but it will access the
> > > entry before attempting it. If LRU publishing happened before setting
> > > entry->swpentry e.g., we'd have a problem, while your comment suggets
> > > it would be safe to rearrange the code like this.
> > >
> > > So LRU publishing order does matter.
> >
> > Ah yes, you are right. entry->swpentry should be set to make sure we
> > lookup the correct entry in the swapcache and the tree.
> >
> > Perhaps we should spell this out in the comment and make the
> > initialization ordering more explicit? Maybe something like:
> >
> > diff --git a/mm/zswap.c b/mm/zswap.c
> > index d8a14b27adcd7..70924b437743a 100644
> > --- a/mm/zswap.c
> > +++ b/mm/zswap.c
> > @@ -1472,9 +1472,6 @@ bool zswap_store(struct folio *folio)
> > goto put_pool;
> >
> > insert_entry:
> > - entry->swpentry = swp;
> > - entry->objcg = objcg;
> > -
> > old = xa_store(tree, offset, entry, GFP_KERNEL);
> > if (xa_is_err(old)) {
> > int err = xa_err(old);
> > @@ -1491,6 +1488,7 @@ bool zswap_store(struct folio *folio)
> > if (old)
> > zswap_entry_free(old);
> >
> > + entry->objcg = objcg;
> > if (objcg) {
> > obj_cgroup_charge_zswap(objcg, entry->length);
> > count_objcg_event(objcg, ZSWPOUT);
> > @@ -1498,15 +1496,16 @@ bool zswap_store(struct folio *folio)
> >
> > /*
> > * We finish initializing the entry while it's already in xarray.
> > - * This is safe because:
> > - *
> > - * 1. Concurrent stores and invalidations are excluded by folio lock.
> > + * This is safe because the folio is locked in the swapcache, which
> > + * protects against concurrent stores and invalidations.
> > *
> > - * 2. Writeback is excluded by the entry not being on the LRU yet.
> > - * The publishing order matters to prevent writeback from seeing
> > - * an incoherent entry.
> > + * Concurrent writeback is not possible until we add the entry to the
> > + * LRU. We need to at least initialize entry->swpentry *before* adding
> > + * the entry to the LRU to make sure writeback looks up the correct
> > + * entry in the swapcache.
> > */
> > if (entry->length) {
> > + entry->swpentry = swp;
> > INIT_LIST_HEAD(&entry->lru);
> > zswap_lru_add(&zswap_list_lru, entry);
> > atomic_inc(&zswap_nr_stored);
> >
> >
> > This also got me wondering, do we need a write barrier between
> > initializing entry->swpentry and zswap_lru_add()?
> >
> > I guess if we read the wrong swpentry in zswap_writeback_entry() we will
> > eventually fail the xa_cmpxchg() and drop it anyway, but it seems
> > bug-prone.
>
> I think it's more robust the way Chris has it now. Writeback only
> derefs ->swpentry today, but who knows if somebody wants to make a
> changes that relies on a different member. Having submembers follow
> different validity rules and timelines is error prone and makes the
> code less hackable without buying all that much. The concept of
> "publishing" an object like this is more common: if you can see it,
> you can expect it to be coherent.
Fair enough, but don't we still need a barrier there? Couldn't some
initializations still be reorder after zswap_lru_add()?