Re: [PATCH 3/4] zram: implement deduplication in zram
From: Minchan Kim
Date: Tue Mar 21 2017 - 19:41:34 EST
Hi Joonsoo,
On Thu, Mar 16, 2017 at 11:46:37AM +0900, js1304@xxxxxxxxx wrote:
> From: Joonsoo Kim <iamjoonsoo.kim@xxxxxxx>
>
> This patch implements deduplication feature in zram. The purpose
> of this work is naturally to save amount of memory usage by zram.
>
> Android is one of the biggest users to use zram as swap and it's
> really important to save amount of memory usage. There is a paper
> that reports that duplication ratio of Android's memory content is
> rather high [1]. And, there is a similar work on zswap that also
> reports that experiments has shown that around 10-15% of pages
> stored in zswp are duplicates and deduplicate them provides some
> benefits [2].
>
> Also, there is a different kind of workload that uses zram as blockdev
> and store build outputs into it to reduce wear-out problem of real
> blockdev. In this workload, deduplication hit is very high
> although I don't know exact reason about it.
Hmm, Isn't it due to binary composed by obj files so that many of
part between binary and object are sharable?
I'd like to clear it out because dedup was not useful for swap workload
for the testing in android unlike papers you mentioned.
Of course, it depends on the workload so someone might find it's
huge useful for his swap workload. However, I want to focus clear
winning case scenario rather than "might be better".
That's why I want to know clear reason the saving. If my assumption
is right(i.e., object file vs. binary file), it would be enough
justfication to merge this feature because user can turn on the feature
with reasonable expectation.
>
> Anyway, if we can detect duplicated content and avoid to store duplicated
> content at different memory space, we can save memory. This patch
> tries to do that.
>
> Implementation is almost simple and intuitive but I should note
> one thing about implementation detail.
>
> To check duplication, this patch uses checksum of the page and
> collision of this checksum could be possible. There would be
> many choices to handle this situation but this patch chooses
> to allow entry with duplicated checksum to be added to the hash,
> but, not to compare all entries with duplicated checksum
> when checking duplication. I guess that checksum collision is quite
Hmm, if there are many duplicated checksum, what a specific checksum
is compared among them?
> rare event and we don't need to pay any attention to such a case.
If it's rare event, why can't we iterate all of entries?
> Therefore, I decided the most simplest way to implement the feature.
> If there is a different opinion, I can accept and go that way.
>
> Following is the result of this patch.
>
> Test result #1 (Swap):
> Android Marshmallow, emulator, x86_64, Backporting to kernel v3.18
>
> orig_data_size: 145297408
> compr_data_size: 32408125
> mem_used_total: 32276480
> dup_data_size: 3188134
> meta_data_size: 1444272
>
> Last two metrics added to mm_stat are related to this work.
> First one, dup_data_size, is amount of saved memory by avoiding
> to store duplicated page. Later one, meta_data_size, is the amount of
> data structure to support deduplication. If dup > meta, we can judge
> that the patch improves memory usage.
>
> In Adnroid, we can save 5% of memory usage by this work.
>
> Test result #2 (Blockdev):
> build the kernel and store output to ext4 FS on zram
>
> <no-dedup>
> Elapsed time: 249 s
> mm_stat: 430845952 191014886 196898816 0 196898816 28320 0 0 0
>
> <dedup>
> Elapsed time: 250 s
> mm_stat: 430505984 190971334 148365312 0 148365312 28404 0 47287038 3945792
>
> There is no performance degradation and save 23% memory.
>
> Test result #3 (Blockdev):
> copy android build output dir(out/host) to ext4 FS on zram
>
> <no-dedup>
> Elapsed time: out/host: 88 s
> mm_stat: 8834420736 3658184579 3834208256 0 3834208256 32889 0 0 0
>
> <dedup>
> Elapsed time: out/host: 100 s
> mm_stat: 8832929792 3657329322 2832015360 0 2832015360 32609 0 952568877 80880336
>
> It shows performance degradation roughly 13% and save 24% memory. Maybe,
> it is due to overhead of calculating checksum and comparison.
>
> Test result #4 (Blockdev):
> copy android build output dir(out/target/common) to ext4 FS on zram
>
> <no-dedup>
> Elapsed time: out/host: 203 s
> mm_stat: 4041678848 2310355010 2346577920 0 2346582016 500 4 0 0
>
> <dedup>
> Elapsed time: out/host: 201 s
> mm_stat: 4041666560 2310488276 1338150912 0 1338150912 476 0 989088794 24564336
>
> Memory is saved by 42% and performance is the same. Even if there is overhead
> of calculating checksum and comparison, large hit ratio compensate it since
> hit leads to less compression attempt.
Cool! It would help a lot for users have used zram to build output directory.
>
> Anyway, benefit seems to be largely dependent on the workload so
> following patch will make this feature optional. However, this feature
> can help some usecases so is deserved to be merged.
>
> [1]: MemScope: Analyzing Memory Duplication on Android Systems,
> dl.acm.org/citation.cfm?id=2797023
> [2]: zswap: Optimize compressed pool memory utilization,
> lkml.kernel.org/r/1341407574.7551.1471584870761.JavaMail.weblogic@epwas3p2
Overall, it looks good. Thanks!
However, I want to change function naming/structuring into my preference style
to maintain because it's non-trival feature. So, please look below.
>
> Signed-off-by: Joonsoo Kim <iamjoonsoo.kim@xxxxxxx>
> ---
> drivers/block/zram/zram_drv.c | 212 ++++++++++++++++++++++++++++++++++++++----
> drivers/block/zram/zram_drv.h | 18 ++++
> 2 files changed, 214 insertions(+), 16 deletions(-)
>
> diff --git a/drivers/block/zram/zram_drv.c b/drivers/block/zram/zram_drv.c
> index a3d9cbca..012425f 100644
> --- a/drivers/block/zram/zram_drv.c
> +++ b/drivers/block/zram/zram_drv.c
> @@ -32,6 +32,7 @@
> #include <linux/idr.h>
> #include <linux/sysfs.h>
> #include <linux/cpuhotplug.h>
> +#include <linux/jhash.h>
>
> #include "zram_drv.h"
>
> @@ -385,14 +386,16 @@ static ssize_t mm_stat_show(struct device *dev,
> max_used = atomic_long_read(&zram->stats.max_used_pages);
>
> ret = scnprintf(buf, PAGE_SIZE,
> - "%8llu %8llu %8llu %8lu %8ld %8llu %8lu\n",
> + "%8llu %8llu %8llu %8lu %8ld %8llu %8lu %8llu %8llu\n",
> orig_size << PAGE_SHIFT,
> (u64)atomic64_read(&zram->stats.compr_data_size),
> mem_used << PAGE_SHIFT,
> zram->limit_pages << PAGE_SHIFT,
> max_used << PAGE_SHIFT,
> (u64)atomic64_read(&zram->stats.same_pages),
> - pool_stats.pages_compacted);
> + pool_stats.pages_compacted,
> + (u64)atomic64_read(&zram->stats.dup_data_size),
zram_dedeup_stat_read?
The reason to use wrapper function is I reallly want to remove any dedup
related code unless users turns on the feature in Kconfig.
Wrapper function would be more clean rather than ifdef.
> + (u64)atomic64_read(&zram->stats.meta_data_size));
> up_read(&zram->init_lock);
>
> return ret;
> @@ -419,29 +422,165 @@ static ssize_t debug_stat_show(struct device *dev,
> static DEVICE_ATTR_RO(mm_stat);
> static DEVICE_ATTR_RO(debug_stat);
>
> -static struct zram_entry *zram_entry_alloc(struct zram_meta *meta,
> +static u32 zram_calc_checksum(unsigned char *mem)
zram_dedup_checksum.
I want to use 'dedup' term all of dedup related to functions.
> +{
> + return jhash(mem, PAGE_SIZE, 0);
> +}
> +
> +static struct zram_entry *zram_entry_alloc(struct zram *zram,
> unsigned int len, gfp_t flags)
> {
> + struct zram_meta *meta = zram->meta;
> struct zram_entry *entry;
> + unsigned long handle;
>
> - entry = kzalloc(sizeof(*entry), flags);
> - if (!entry)
> + handle = zs_malloc(meta->mem_pool, len, flags);
> + if (!handle)
> return NULL;
Separate allocate zram_entry and dedeup.
IOW,
struct zram_entry *entry = zram_entry_alloc(zram, xxx);
if (!zram_dedup_init(zram, entry, xxx))
zram_entry_free(entry);
>
> - entry->handle = zs_malloc(meta->mem_pool, len, flags);
> - if (!entry->handle) {
> - kfree(entry);
> + entry = kzalloc(sizeof(*entry), flags);
> + if (!entry) {
> + zs_free(meta->mem_pool, handle);
> return NULL;
> }
>
> + entry->handle = handle;
> + RB_CLEAR_NODE(&entry->rb_node);
> + entry->refcount = 1;
> + entry->len = len;
> + atomic64_add(sizeof(*entry), &zram->stats.meta_data_size);
> +
> return entry;
> }
>
> -static inline void zram_entry_free(struct zram_meta *meta,
> - struct zram_entry *entry)
> +static void zram_entry_insert(struct zram *zram, struct zram_entry *new,
zram_dedup_insert or add?
> + u32 checksum)
> +{
> + struct zram_meta *meta = zram->meta;
> + struct zram_hash *hash;
> + struct rb_root *rb_root;
> + struct rb_node **rb_node, *parent = NULL;
> + struct zram_entry *entry;
> +
> + new->checksum = checksum;
> + hash = &meta->hash[checksum % meta->hash_size];
> + rb_root = &hash->rb_root;
> +
> + spin_lock(&hash->lock);
> + rb_node = &rb_root->rb_node;
> + while (*rb_node) {
> + parent = *rb_node;
> + entry = rb_entry(parent, struct zram_entry, rb_node);
> + if (checksum < entry->checksum)
> + rb_node = &parent->rb_left;
> + else if (checksum > entry->checksum)
> + rb_node = &parent->rb_right;
> + else
> + rb_node = &parent->rb_left;
> + }
> +
> + rb_link_node(&new->rb_node, parent, rb_node);
> + rb_insert_color(&new->rb_node, rb_root);
> + spin_unlock(&hash->lock);
> +}
> +
> +static bool zram_entry_match(struct zram *zram, struct zram_entry *entry,
Ditto.
zram_dedup_match?
> + unsigned char *mem)
> +{
> + bool match = false;
> + unsigned char *cmem;
> + struct zram_meta *meta = zram->meta;
> + struct zcomp_strm *zstrm;
> +
> + cmem = zs_map_object(meta->mem_pool, entry->handle, ZS_MM_RO);
> + if (entry->len == PAGE_SIZE) {
> + match = !memcmp(mem, cmem, PAGE_SIZE);
> + } else {
> + zstrm = zcomp_stream_get(zram->comp);
> + if (!zcomp_decompress(zstrm, cmem, entry->len, zstrm->buffer))
> + match = !memcmp(mem, zstrm->buffer, PAGE_SIZE);
> + zcomp_stream_put(zram->comp);
> + }
> + zs_unmap_object(meta->mem_pool, entry->handle);
> +
> + return match;
> +}
> +
> +static inline void zram_entry_free(struct zram *zram, struct zram_meta *meta,
> + struct zram_entry *entry)
> {
> zs_free(meta->mem_pool, entry->handle);
> kfree(entry);
> + if (zram)
> + atomic64_sub(sizeof(*entry), &zram->stats.meta_data_size);
> +}
> +
> +static bool zram_entry_put(struct zram *zram, struct zram_meta *meta,
> + struct zram_entry *entry, bool populated)
Please, separate entry and dedup part like zram_entry_alloc.
Then, Can't we remove 'populated'?
> +{
> + struct zram_hash *hash;
> + u32 checksum;
> +
> + if (!populated)
> + goto free;
> +
> + checksum = entry->checksum;
> + hash = &meta->hash[checksum % meta->hash_size];
> +
> + spin_lock(&hash->lock);
> + entry->refcount--;
> + if (!entry->refcount) {
> + populated = false;
> + rb_erase(&entry->rb_node, &hash->rb_root);
> + RB_CLEAR_NODE(&entry->rb_node);
> + }
> + spin_unlock(&hash->lock);
> +
> +free:
> + if (!populated)
> + zram_entry_free(zram, meta, entry);
> +
> + return populated;
> +}
> +
> +static struct zram_entry *zram_entry_get(struct zram *zram,
We can rename it to zram_entry_match.
> + unsigned char *mem, u32 checksum)
> +{
> + struct zram_meta *meta = zram->meta;
> + struct zram_hash *hash;
> + struct zram_entry *entry;
> + struct rb_node *rb_node;
> +
> + hash = &meta->hash[checksum % meta->hash_size];
> +
> + spin_lock(&hash->lock);
> + rb_node = hash->rb_root.rb_node;
> + while (rb_node) {
> + entry = rb_entry(rb_node, struct zram_entry, rb_node);
> + if (checksum == entry->checksum) {
> + entry->refcount++;
> + atomic64_add(entry->len, &zram->stats.dup_data_size);
> + spin_unlock(&hash->lock);
> +
> + if (zram_entry_match(zram, entry, mem))
__zram_entry_match? Or just open-code.
> + return entry;
> +
> + if (zram_entry_put(zram, meta, entry, true)) {
> + atomic64_sub(entry->len,
> + &zram->stats.dup_data_size);
> + }
> +
> + return NULL;
> + }
> +
> + if (checksum < entry->checksum)
> + rb_node = rb_node->rb_left;
> + else
> + rb_node = rb_node->rb_right;
> + }
> + spin_unlock(&hash->lock);
> +
> + return NULL;
> }
>
> static void zram_meta_free(struct zram_meta *meta, u64 disksize)
> @@ -459,18 +598,31 @@ static void zram_meta_free(struct zram_meta *meta, u64 disksize)
> if (!entry || zram_test_flag(meta, index, ZRAM_SAME))
> continue;
>
> - zram_entry_free(meta, entry);
> + zram_entry_put(NULL, meta, entry, true);
> }
>
> zs_destroy_pool(meta->mem_pool);
> + vfree(meta->hash);
> vfree(meta->table);
> kfree(meta);
> }
>
> +static void zram_meta_init(struct zram_meta *meta)
> +{
> + int i;
> + struct zram_hash *hash;
> +
> + for (i = 0; i < meta->hash_size; i++) {
> + hash = &meta->hash[i];
> + spin_lock_init(&hash->lock);
> + hash->rb_root = RB_ROOT;
> + }
> +}
> +
> static struct zram_meta *zram_meta_alloc(char *pool_name, u64 disksize)
> {
> size_t num_pages;
> - struct zram_meta *meta = kmalloc(sizeof(*meta), GFP_KERNEL);
> + struct zram_meta *meta = kzalloc(sizeof(*meta), GFP_KERNEL);
>
> if (!meta)
> return NULL;
> @@ -482,15 +634,27 @@ static struct zram_meta *zram_meta_alloc(char *pool_name, u64 disksize)
> goto out_error;
> }
>
> + meta->hash_size = num_pages >> ZRAM_HASH_SHIFT;
> + meta->hash_size = min_t(size_t, ZRAM_HASH_SIZE_MAX, meta->hash_size);
> + meta->hash_size = max_t(size_t, ZRAM_HASH_SIZE_MIN, meta->hash_size);
> + meta->hash = vzalloc(meta->hash_size * sizeof(struct zram_hash));
> + if (!meta->hash) {
> + pr_err("Error allocating zram entry hash\n");
> + goto out_error;
> + }
> +
> meta->mem_pool = zs_create_pool(pool_name);
> if (!meta->mem_pool) {
> pr_err("Error creating memory pool\n");
> goto out_error;
> }
>
> + zram_meta_init(meta);
zram_dedup_init?
Can't we put above hash initialization routines to zram_meta_init?
> +
> return meta;
>
> out_error:
> + vfree(meta->hash);
> vfree(meta->table);
> kfree(meta);
> return NULL;
> @@ -520,7 +684,8 @@ static void zram_free_page(struct zram *zram, size_t index)
> if (!entry)
> return;
>
> - zram_entry_free(meta, entry);
> + if (zram_entry_put(zram, meta, entry, true))
> + atomic64_sub(entry->len, &zram->stats.dup_data_size);
Hmm,
I want to put dedup stat logic in dedup functions without exporting
higher level functions like zram_free_page.
So,
zram_dedup_put:
...
...
atomic64_sub(entry->len, &zram->stats.dup_data_size);
zram_free_page:
...
if (zram_dedup_put())
zram_entry_free
>
> atomic64_sub(zram_get_obj_size(meta, index),
> &zram->stats.compr_data_size);
> @@ -631,6 +796,7 @@ static int zram_bvec_write(struct zram *zram, struct bio_vec *bvec, u32 index,
> struct zcomp_strm *zstrm = NULL;
> unsigned long alloced_pages;
> unsigned long element;
> + u32 checksum;
>
> page = bvec->bv_page;
> if (is_partial_io(bvec)) {
> @@ -674,6 +840,18 @@ static int zram_bvec_write(struct zram *zram, struct bio_vec *bvec, u32 index,
> goto out;
> }
>
> + checksum = zram_calc_checksum(uncmem);
> + if (!entry) {
> + entry = zram_entry_get(zram, uncmem, checksum);
entry = zram_dedup_get
> + if (entry) {
> + if (!is_partial_io(bvec))
> + kunmap_atomic(user_mem);
> +
> + clen = entry->len;
> + goto found_duplication;
> + }
> + }
> +
> zstrm = zcomp_stream_get(zram->comp);
> ret = zcomp_compress(zstrm, uncmem, &clen);
> if (!is_partial_io(bvec)) {
> @@ -708,7 +886,7 @@ static int zram_bvec_write(struct zram *zram, struct bio_vec *bvec, u32 index,
> * from the slow path and entry has already been allocated.
> */
> if (!entry) {
> - entry = zram_entry_alloc(meta, clen,
> + entry = zram_entry_alloc(zram, clen,
> __GFP_KSWAPD_RECLAIM | __GFP_NOWARN);
> }
>
> @@ -718,7 +896,7 @@ static int zram_bvec_write(struct zram *zram, struct bio_vec *bvec, u32 index,
>
> atomic64_inc(&zram->stats.writestall);
>
> - entry = zram_entry_alloc(meta, clen, GFP_NOIO);
> + entry = zram_entry_alloc(zram, clen, GFP_NOIO);
> if (entry)
> goto compress_again;
>
> @@ -732,7 +910,7 @@ static int zram_bvec_write(struct zram *zram, struct bio_vec *bvec, u32 index,
> update_used_max(zram, alloced_pages);
>
> if (zram->limit_pages && alloced_pages > zram->limit_pages) {
> - zram_entry_free(meta, entry);
> + zram_entry_put(zram, meta, entry, false);
> ret = -ENOMEM;
> goto out;
> }
> @@ -750,7 +928,9 @@ static int zram_bvec_write(struct zram *zram, struct bio_vec *bvec, u32 index,
> zcomp_stream_put(zram->comp);
> zstrm = NULL;
> zs_unmap_object(meta->mem_pool, entry->handle);
> + zram_entry_insert(zram, entry, checksum);
>
> +found_duplication:
Nit:
I prefer found_dup.
> /*
> * Free memory associated with this sector
> * before overwriting unused sectors.
> diff --git a/drivers/block/zram/zram_drv.h b/drivers/block/zram/zram_drv.h
> index a7ae46c..07d1f8d 100644
> --- a/drivers/block/zram/zram_drv.h
> +++ b/drivers/block/zram/zram_drv.h
> @@ -18,6 +18,7 @@
> #include <linux/rwsem.h>
> #include <linux/zsmalloc.h>
> #include <linux/crypto.h>
> +#include <linux/spinlock.h>
>
> #include "zcomp.h"
>
> @@ -45,6 +46,10 @@
> #define ZRAM_SECTOR_PER_LOGICAL_BLOCK \
> (1 << (ZRAM_LOGICAL_BLOCK_SHIFT - SECTOR_SHIFT))
>
> +/* One slot will contain 128 pages theoretically */
> +#define ZRAM_HASH_SHIFT 7
> +#define ZRAM_HASH_SIZE_MIN (1 << 10)
> +#define ZRAM_HASH_SIZE_MAX (1 << 31)
>
> /*
> * The lower ZRAM_FLAG_SHIFT bits of table.value is for
> @@ -70,6 +75,10 @@ enum zram_pageflags {
> /*-- Data structures */
>
> struct zram_entry {
> + struct rb_node rb_node;
> + u32 len;
> + u32 checksum;
> + unsigned long refcount;
It should be compiled out if users doesn't turn on the feature in Kconfig.
I hope you makes Kconfig option to [4/4].