zsmalloc limitations and related topics

From: Dan Magenheimer
Date: Wed Feb 27 2013 - 18:25:11 EST


Hi all --

I've been doing some experimentation on zsmalloc in preparation
for my topic proposed for LSFMM13 and have run across some
perplexing limitations. Those familiar with the intimate details
of zsmalloc might be well aware of these limitations, but they
aren't documented or immediately obvious, so I thought it would
be worthwhile to air them publicly. I've also included some
measurements from the experimentation and some related thoughts.

(Some of the terms here are unusual and may be used inconsistently
by different developers so a glossary of definitions of the terms
used here is appended.)

ZSMALLOC LIMITATIONS

Zsmalloc is used for two zprojects: zram and the out-of-tree
zswap. Zsmalloc can achieve high density when "full". But:

1) Zsmalloc has a worst-case density of 0.25 (one zpage per
four pageframes).
2) When not full and especially when nearly-empty _after_
being full, density may fall below 1.0 as a result of
fragmentation.
3) Zsmalloc has a density of exactly 1.0 for any number of
zpages with zsize >= 0.8.
4) Zsmalloc contains several compile-time parameters;
the best value of these parameters may be very workload
dependent.

If density == 1.0, that means we are paying the overhead of
compression+decompression for no space advantage. If
density < 1.0, that means using zsmalloc is detrimental,
resulting in worse memory pressure than if it were not used.

WORKLOAD ANALYSIS

These limitations emphasize that the workload used to evaluate
zsmalloc is very important. Benchmarks that measure data
throughput or CPU utilization are of questionable value because
it is the _content_ of the data that is particularly relevant
for compression. Even more precisely, it is the "entropy"
of the data that is relevant, because the amount of
compressibility in the data is related to the entropy:
I.e. an entirely random pagefull of bits will compress poorly
and a highly-regular pagefull of bits will compress well.
Since the zprojects manage a large number of zpages, both
the mean and distribution of zsize of the workload should
be "representative".

The workload most widely used to publish results for
the various zprojects is a kernel-compile using "make -jN"
where N is artificially increased to impose memory pressure.
By adding some debug code to zswap, I was able to analyze
this workload and found the following:

1) The average page compressed by almost a factor of six
(mean zsize == 694, stddev == 474)
2) Almost eleven percent of the pages were zero pages. A
zero page compresses to 28 bytes.
3) On average, 77% of the bytes (3156) in the pages-to-be-
compressed contained a byte-value of zero.
4) Despite the above, mean density of zsmalloc was measured at
3.2 zpages/pageframe, presumably losing nearly half of
available space to fragmentation.

I have no clue if these measurements are representative
of a wide range of workloads over the lifetime of a booted
machine, but I am suspicious that they are not. For example,
the lzo1x compression algorithm claims to compress data by
about a factor of two.

I would welcome ideas on how to evaluate workloads for
"representativeness". Personally I don't believe we should
be making decisions about selecting the "best" algorithms
or merging code without an agreement on workloads.

PAGEFRAME EVACUATION AND RECLAIM

I've repeatedly stated the opinion that managing the number of
pageframes containing compressed pages will be valuable for
managing MM interaction/policy when compression is used in
the kernel. After the experimentation above and some brainstorming,
I still do not see an effective method for zsmalloc evacuating and
reclaiming pageframes, because both are complicated by high density
and page-crossing. In other words, zsmalloc's strengths may
also be its Achilles heels. For zram, as far as I can see,
pageframe evacuation/reclaim is irrelevant except perhaps
as part of mass defragmentation. For zcache and zswap, where
writethrough is used, pageframe evacuation/reclaim is very relevant.
(Note: The writeback implemented in zswap does _zpage_ evacuation
without pageframe reclaim.)

CLOSING THOUGHT

Since zsmalloc and zbud have different strengths and weaknesses,
I wonder if some combination or hybrid might be more optimal?
But unless/until we have and can measure a representative workload,
only intuition can answer that.

GLOSSARY

zproject -- a kernel project using compression (zram, zcache, zswap)
zpage -- a compressed sequence of PAGE_SIZE bytes
zsize -- the number of bytes in a compressed page
pageframe -- the term "page" is widely used both to describe
either (1) PAGE_SIZE bytes of data, or (2) a physical RAM
area with size=PAGE_SIZE which is PAGE_SIZE-aligned,
as represented in the kernel by a struct page. To be explicit,
we refer to (2) as a pageframe.
density -- zpages per pageframe; higher is (presumably) better
zsmalloc -- a slab-based allocator written by Nitin Gupta to
efficiently store zpages and designed to allow zpages
to be split across two non-contiguous pageframes
zspage -- a grouping of N non-contiguous pageframes managed
as a unit by zsmalloc to store zpages for which zsize
falls within a certain range. (The compile-time
default maximum size for N is 4).
zbud -- a buddy-based allocator written by Dan Magenheimer
(specifically for zcache) to predictably store zpages;
no more than two zpages are stored in any pageframe
pageframe evacuation/reclaim -- the process of removing
zpages from one or more pageframes, including pointers/nodes
from any data structures referencing those zpages,
so that the pageframe(s) can be freed for use by
the rest of the kernel
writeback -- the process of transferring zpages from
storage in a zproject to a backing swap device
lzo1x -- a compression algorithm used by default by all the
zprojects; the kernel implementation resides in lib/lzo.c
entropy -- randomness of data to be compressed; higher entropy
means worse data compression
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/