Re: [PATCH v7] zswap: replace RB tree with xarray
From: Johannes Weiner
Date: Wed Mar 20 2024 - 15:26:17 EST
On Wed, Mar 20, 2024 at 07:11:38PM +0000, Yosry Ahmed wrote:
> On Wed, Mar 20, 2024 at 06:08:03AM -0400, Johannes Weiner wrote:
> > On Wed, Mar 20, 2024 at 07:24:27AM +0000, Yosry Ahmed wrote:
> > > [..]
> > > > > > - /* map */
> > > > > > - spin_lock(&tree->lock);
> > > > > > /*
> > > > > > - * The folio may have been dirtied again, invalidate the
> > > > > > - * possibly stale entry before inserting the new entry.
> > > > > > + * We finish initializing the entry while it's already in xarray.
> > > > > > + * This is safe because:
> > > > > > + *
> > > > > > + * 1. Concurrent stores and invalidations are excluded by folio lock.
> > > > > > + *
> > > > > > + * 2. Writeback is excluded by the entry not being on the LRU yet.
> > > > > > + * The publishing order matters to prevent writeback from seeing
> > > > > > + * an incoherent entry.
> > > > >
> > > > > As I mentioned before, writeback is also protected by the folio lock.
> > > > > Concurrent writeback will find the folio in the swapcache and abort. The
> > > > > fact that the entry is not on the LRU yet is just additional protection,
> > > > > so I don't think the publishing order actually matters here. Right?
> > > >
> > > > Right. This comment is explaining why this publishing order does not
> > > > matter. I think we are talking about the same thing here?
> > >
> > > The comment literally says "the publishing order matters.." :)
> > >
> > > I believe Johannes meant that we should only publish the entry to the
> > > LRU once it is fully initialized, to prevent writeback from using a
> > > partially initialized entry.
> > >
> > > What I am saying is that, even if we add a partially initialized entry
> > > to the zswap LRU, writeback will skip it anyway because the folio is
> > > locked in the swapcache.
> > >
> > > So basically I think the comment should say:
> > >
> > > /*
> > > * We finish initializing the entry while it's already in the
> > > * xarray. This is safe because the folio is locked in the swap
> > > * cache, which should protect against concurrent stores,
> > > * invalidations, and writeback.
> > > */
> > >
> > > Johannes, what do you think?
> >
> > I don't think that's quite right.
> >
> > Writeback will bail on swapcache insert, yes, but it will access the
> > entry before attempting it. If LRU publishing happened before setting
> > entry->swpentry e.g., we'd have a problem, while your comment suggets
> > it would be safe to rearrange the code like this.
> >
> > So LRU publishing order does matter.
>
> Ah yes, you are right. entry->swpentry should be set to make sure we
> lookup the correct entry in the swapcache and the tree.
>
> Perhaps we should spell this out in the comment and make the
> initialization ordering more explicit? Maybe something like:
>
> diff --git a/mm/zswap.c b/mm/zswap.c
> index d8a14b27adcd7..70924b437743a 100644
> --- a/mm/zswap.c
> +++ b/mm/zswap.c
> @@ -1472,9 +1472,6 @@ bool zswap_store(struct folio *folio)
> goto put_pool;
>
> insert_entry:
> - entry->swpentry = swp;
> - entry->objcg = objcg;
> -
> old = xa_store(tree, offset, entry, GFP_KERNEL);
> if (xa_is_err(old)) {
> int err = xa_err(old);
> @@ -1491,6 +1488,7 @@ bool zswap_store(struct folio *folio)
> if (old)
> zswap_entry_free(old);
>
> + entry->objcg = objcg;
> if (objcg) {
> obj_cgroup_charge_zswap(objcg, entry->length);
> count_objcg_event(objcg, ZSWPOUT);
> @@ -1498,15 +1496,16 @@ bool zswap_store(struct folio *folio)
>
> /*
> * We finish initializing the entry while it's already in xarray.
> - * This is safe because:
> - *
> - * 1. Concurrent stores and invalidations are excluded by folio lock.
> + * This is safe because the folio is locked in the swapcache, which
> + * protects against concurrent stores and invalidations.
> *
> - * 2. Writeback is excluded by the entry not being on the LRU yet.
> - * The publishing order matters to prevent writeback from seeing
> - * an incoherent entry.
> + * Concurrent writeback is not possible until we add the entry to the
> + * LRU. We need to at least initialize entry->swpentry *before* adding
> + * the entry to the LRU to make sure writeback looks up the correct
> + * entry in the swapcache.
> */
> if (entry->length) {
> + entry->swpentry = swp;
> INIT_LIST_HEAD(&entry->lru);
> zswap_lru_add(&zswap_list_lru, entry);
> atomic_inc(&zswap_nr_stored);
>
>
> This also got me wondering, do we need a write barrier between
> initializing entry->swpentry and zswap_lru_add()?
>
> I guess if we read the wrong swpentry in zswap_writeback_entry() we will
> eventually fail the xa_cmpxchg() and drop it anyway, but it seems
> bug-prone.
I think it's more robust the way Chris has it now. Writeback only
derefs ->swpentry today, but who knows if somebody wants to make a
changes that relies on a different member. Having submembers follow
different validity rules and timelines is error prone and makes the
code less hackable without buying all that much. The concept of
"publishing" an object like this is more common: if you can see it,
you can expect it to be coherent.