Re: [PATCH 7/8] zswap: add to mm/

From: Seth Jennings
Date: Tue Jan 01 2013 - 12:52:21 EST

On 12/31/2012 05:06 PM, Dan Magenheimer wrote:
>> From: Seth Jennings [mailto:sjenning@xxxxxxxxxxxxxxxxxx]
>> Subject: [PATCH 7/8] zswap: add to mm/
>> zswap is a thin compression backend for frontswap. It receives
>> pages from frontswap and attempts to store them in a compressed
>> memory pool, resulting in an effective partial memory reclaim and
>> dramatically reduced swap device I/O.
> Hi Seth --
> Happy (almost) New Year!

You too :) Thanks for taking a look at the code.

> I am eagerly studying one of the details of your zswap "flush"
> code in this patch to see how you solved a problem or two that
> I was struggling with for the similar mechanism RFC'ed for zcache
> (see I like the way
> that you force the newly-uncompressed to-be-flushed page immediately
> into a swap bio in zswap_flush_entry via the call to swap_writepage,
> though I'm not entirely convinced that there aren't some race
> conditions there. However, won't swap_writepage simply call
> frontswap_store instead and re-compress the page back into zswap?

I break part of swap_writepage() into a bottom half called
__swap_writepage() that doesn't include the call to frontswap_store().
__swap_writepage() is what is called from zswap_flush_entry(). That
is how I avoid flushed pages recycling back into zswap and the
potential recursion mentioned.

> A second related issue that concerns me is that, although you
> are now, like zcache2, using an LRU queue for compressed pages
> (aka "zpages"), there is no relationship between that queue and
> physical pageframes. In other words, you may free up 100 zpages
> out of zswap via zswap_flush_entries, but not free up a single
> pageframe. This seems like a significant design issue. Or am
> I misunderstanding the code?

You understand correctly. There is room for optimization here and it
is something I'm working on right now.

What I'm looking to do is give zswap a little insight into zsmalloc
internals, namely the ability figure out what class size a particular
allocation is in and, in the event the store can't be satisfied, flush
an entry from that exact class size so that we can be assured the
store will succeed with minimal flushing work. In this solution,
there would be an LRU list per zsmalloc class size tracked in zswap.
The result is LRU-ish flushing overall with class size being the first
flush selection criteria and LRU as the second.

> A third concern is about scalability... the locking seems very
> coarse-grained. In zcache, you personally observed and fixed
> hashbucket contention (see
> Doesn't zswap's tree_lock essentially use a single tree (per
> swaptype), i.e. no scalability?

The reason the coarse lock isn't a problem for zswap like the hash
bucket locks where in zcache is that the lock is not held for long
periods time as it is in zcache. It is only held while operating on
the tree, not during compression/decompression and larger memory

Also, I've done some lockstat checks and the zswap tree lock is way
down on the list contributing <1% of the lock contention wait time on
a 4-core system. The anon_vma lock is the primary bottleneck.


To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at
Please read the FAQ at