RE: [PATCH 7/8] zswap: add to mm/

From: Dan Magenheimer
Date: Wed Jan 02 2013 - 12:09:27 EST

Next message: Jerome Glisse: "Re: radeon 0000:02:00.0: GPU lockup CP stall for more than 10000msec"
Previous message: Stephane Eranian: "Re: [PATCH v4 01/18] perf, x86: Support CPU specific sysfs events"
In reply to: Seth Jennings: "Re: [PATCH 7/8] zswap: add to mm/"
Next in thread: Seth Jennings: "Re: [PATCH 7/8] zswap: add to mm/"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]

> From: Seth Jennings [mailto:sjenning@xxxxxxxxxxxxxxxxxx]
> Subject: Re: [PATCH 7/8] zswap: add to mm/
>
> > I am eagerly studying one of the details of your zswap "flush"
> > code in this patch to see how you solved a problem or two that
> > I was struggling with for the similar mechanism RFC'ed for zcache
> > (see https://lkml.org/lkml/2012/10/3/558). I like the way
> > that you force the newly-uncompressed to-be-flushed page immediately
> > into a swap bio in zswap_flush_entry via the call to swap_writepage,
> > though I'm not entirely convinced that there aren't some race
> > conditions there. However, won't swap_writepage simply call
> > frontswap_store instead and re-compress the page back into zswap?
>
> I break part of swap_writepage() into a bottom half called
> __swap_writepage() that doesn't include the call to frontswap_store().
> __swap_writepage() is what is called from zswap_flush_entry(). That
> is how I avoid flushed pages recycling back into zswap and the
> potential recursion mentioned.

OK, I missed that. Nice. I will see if I can use the
same with zcache and, if so, would be happy to support
the change to swap_writepage.

In your next version, maybe you could break out that chunk
into a separate distinct patch so it can be pulled separately
into Andrew's tree?

> > A second related issue that concerns me is that, although you
> > are now, like zcache2, using an LRU queue for compressed pages
> > (aka "zpages"), there is no relationship between that queue and
> > physical pageframes. In other words, you may free up 100 zpages
> > out of zswap via zswap_flush_entries, but not free up a single
> > pageframe. This seems like a significant design issue. Or am
> > I misunderstanding the code?
>
> You understand correctly. There is room for optimization here and it
> is something I'm working on right now.
>
> What I'm looking to do is give zswap a little insight into zsmalloc
> internals,

Not to be at all snide, but had you been as eager to break
the zsmalloc abstraction last spring, a lot of unpleasantness
and extra work might have been avoided. :v(

> namely the ability figure out what class size a particular
> allocation is in and, in the event the store can't be satisfied, flush
> an entry from that exact class size so that we can be assured the
> store will succeed with minimal flushing work. In this solution,
> there would be an LRU list per zsmalloc class size tracked in zswap.
> The result is LRU-ish flushing overall with class size being the first
> flush selection criteria and LRU as the second.

Clever and definitely useful, though I think there are two related
problems and IIUC this solves only one of them. The problem it _does_
solve is (A) where to put a new zpage: Move a zpage from the same
class to real-swap-disk and then fill its slot with the new zpage.
The problem it _doesn't_ solve is (B) how to shrink the total number
of pageframes used by zswap, even by a single page. I believe
(though cannot prove right now) that this latter problem will
need to be solved to implement any suitable MM policy for balancing
pages-used-for-compression vs pages-not-used-for-compression.

I fear that problem (B) is the fundamental concern with
using a high-density storage allocator such as zsmalloc, which
is why I abandoned zsmalloc in favor of a more-predictable-but-
less-dense allocator (zbud). However, if you have a solution
for (B) as well, I would gladly abandon zbud in zcache (for _both_
cleancache and frontswap pages) and our respective in-kernel
compression efforts would be more easy to merge into one solution
in the future.

> > A third concern is about scalability... the locking seems very
> > coarse-grained. In zcache, you personally observed and fixed
> > hashbucket contention (see https://lkml.org/lkml/2011/9/29/215).
> > Doesn't zswap's tree_lock essentially use a single tree (per
> > swaptype), i.e. no scalability?
>
> The reason the coarse lock isn't a problem for zswap like the hash
> bucket locks where in zcache is that the lock is not held for long
> periods time as it is in zcache. It is only held while operating on
> the tree, not during compression/decompression and larger memory
> operations.

Hmmm... IIRC, to avoid races in zcache, it was necessary to
update both the data (zpage) and meta-data ("tree" in zswap,
and tmem-data-structure in zcache) atomically. I will need
to study your code more to understand how zswap avoids this
requirement. Or if it is obvious to you, I would be grateful
if you would point it out to me.

Thanks,
Dan
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

Next message: Jerome Glisse: "Re: radeon 0000:02:00.0: GPU lockup CP stall for more than 10000msec"
Previous message: Stephane Eranian: "Re: [PATCH v4 01/18] perf, x86: Support CPU specific sysfs events"
In reply to: Seth Jennings: "Re: [PATCH 7/8] zswap: add to mm/"
Next in thread: Seth Jennings: "Re: [PATCH 7/8] zswap: add to mm/"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]