Re: [PATCH v3] zram: remove global tb_lock with fine grain lock

From: Weijie Yang
Date: Tue Jun 03 2014 - 03:59:13 EST


On Mon, Jun 2, 2014 at 8:43 AM, Minchan Kim <minchan@xxxxxxxxxx> wrote:
> Hello Weijie,
>
> Thanks for resending.
> Below are mostly nitpicks.
>
> On Fri, May 30, 2014 at 04:34:44PM +0800, Weijie Yang wrote:
>> Currently, we use a rwlock tb_lock to protect concurrent access to
>> the whole zram meta table. However, according to the actual access model,
>> there is only a small chance for upper user to access the same table[index],
>> so the current lock granularity is too big.
>>
>> The idea of optimization is to change the lock granularity from whole
>> meta table to per table entry (table -> table[index]), so that we can
>> protect concurrent access to the same table[index], meanwhile allow
>> the maximum concurrency.
>> With this in mind, several kinds of locks which could be used as a
>> per-entry lock were tested and compared:
>>
>> Test environment:
>> x86-64 Intel Core2 Q8400, system memory 4GB, Ubuntu 12.04,
>> kernel v3.15.0-rc3 as base, zram with 4 max_comp_streams LZO.
>>
>> iozone test:
>> iozone -t 4 -R -r 16K -s 200M -I +Z
>> (1GB zram with ext4 filesystem, take the average of 10 tests, KB/s)
>>
>> Test base CAS spinlock rwlock bit_spinlock
>> -------------------------------------------------------------------
>> Initial write 1381094 1425435 1422860 1423075 1421521
>> Rewrite 1529479 1641199 1668762 1672855 1654910
>> Read 8468009 11324979 11305569 11117273 10997202
>> Re-read 8467476 11260914 11248059 11145336 10906486
>> Reverse Read 6821393 8106334 8282174 8279195 8109186
>> Stride read 7191093 8994306 9153982 8961224 9004434
>> Random read 7156353 8957932 9167098 8980465 8940476
>> Mixed workload 4172747 5680814 5927825 5489578 5972253
>> Random write 1483044 1605588 1594329 1600453 1596010
>> Pwrite 1276644 1303108 1311612 1314228 1300960
>> Pread 4324337 4632869 4618386 4457870 4500166
>>
>> To enhance the possibility of access the same table[index] concurrently,
>> set zram a small disksize(10MB) and let threads run with large loop count.
>>
>> fio test:
>> fio --bs=32k --randrepeat=1 --randseed=100 --refill_buffers
>> --scramble_buffers=1 --direct=1 --loops=3000 --numjobs=4
>> --filename=/dev/zram0 --name=seq-write --rw=write --stonewall
>> --name=seq-read --rw=read --stonewall --name=seq-readwrite
>> --rw=rw --stonewall --name=rand-readwrite --rw=randrw --stonewall
>> (10MB zram raw block device, take the average of 10 tests, KB/s)
>>
>> Test base CAS spinlock rwlock bit_spinlock
>> -------------------------------------------------------------
>> seq-write 933789 999357 1003298 995961 1001958
>> seq-read 5634130 6577930 6380861 6243912 6230006
>> seq-rw 1405687 1638117 1640256 1633903 1634459
>> rand-rw 1386119 1614664 1617211 1609267 1612471
>>
>> All the optimization methods show a higher performance than the base,
>> however, it is hard to say which method is the most appropriate.
>>
>> On the other hand, zram is mostly used on small embedded system, so we
>> don't want to increase any memory footprint.
>>
>> This patch pick the bit_spinlock method, pack object size and page_flag
>> into an unsigned long table.value, so as to not increase any memory
>> overhead on both 32-bit and 64-bit system.
>>
>> On the third hand, even though different kinds of locks have different
>> performances, we can ignore this difference, because:
>> if zram is used as zram swapfile, the swap subsystem can prevent concurrent
>> access to the same swapslot;
>> if zram is used as zram-blk for set up filesystem on it, the upper filesystem
>> and the page cache also prevent concurrent access of the same block mostly.
>> So we can ignore the different performances among locks.
>
> Nice description. :)
>
>>
>> Changes since v1: https://lkml.org/lkml/2014/5/5/1
>> - replace CAS method with bit_spinlock method
>> - rename zram_test_flag() to zram_test_zero()
>> - add some comments
>>
>> Changes since v2: https://lkml.org/lkml/2014/5/15/113
>> - change size type from int to size_t in zram_set_obj_size()
>> - refactor zram_set_obj_size() to make it readable
>> - add comments
>>
>> Signed-off-by: Weijie Yang <weijie.yang@xxxxxxxxxxx>
>> ---
>> drivers/block/zram/zram_drv.c | 89 ++++++++++++++++++++++++-----------------
>> drivers/block/zram/zram_drv.h | 22 +++++++---
>> 2 files changed, 68 insertions(+), 43 deletions(-)
>>
>> diff --git a/drivers/block/zram/zram_drv.c b/drivers/block/zram/zram_drv.c
>> index 9849b52..166e882 100644
>> --- a/drivers/block/zram/zram_drv.c
>> +++ b/drivers/block/zram/zram_drv.c
>> @@ -179,23 +179,32 @@ static ssize_t comp_algorithm_store(struct device *dev,
>> return len;
>> }
>>
>> -/* flag operations needs meta->tb_lock */
>> -static int zram_test_flag(struct zram_meta *meta, u32 index,
>> - enum zram_pageflags flag)
>> +static int zram_test_zero(struct zram_meta *meta, u32 index)
>
> Why do you want to create specific function for zero?
> It would be one of usecase for various potential flags.
> Do you want to create new functions whenever we define new flag?
> Or something do you have a mind?
>

As you see, this patch adds a new flag ZRAM_ACCESS, which is
accessed through different method from ZRAM_ZERO.
I think it is hard to use a general method to access all kinds of flags,
to eliminate some potential ambiguity or wrong usage, I use specific
function to access different flags.

>
>> {
>> - return meta->table[index].flags & BIT(flag);
>> + return meta->table[index].value & BIT(ZRAM_ZERO);
>> }
>>
>> -static void zram_set_flag(struct zram_meta *meta, u32 index,
>> - enum zram_pageflags flag)
>> +static void zram_set_zero(struct zram_meta *meta, u32 index)
>> {
>> - meta->table[index].flags |= BIT(flag);
>> + meta->table[index].value |= BIT(ZRAM_ZERO);
>> }
>>
>> -static void zram_clear_flag(struct zram_meta *meta, u32 index,
>> - enum zram_pageflags flag)
>> +static void zram_clear_zero(struct zram_meta *meta, u32 index)
>> {
>> - meta->table[index].flags &= ~BIT(flag);
>> + meta->table[index].value &= ~BIT(ZRAM_ZERO);
>> +}
>> +
>> +static size_t zram_get_obj_size(struct zram_meta *meta, u32 index)
>> +{
>> + return meta->table[index].value & (BIT(ZRAM_FLAG_SHIFT) - 1);
>> +}
>> +
>> +static void zram_set_obj_size(struct zram_meta *meta,
>> + u32 index, size_t size)
>> +{
>> + unsigned long flags = meta->table[index].value >> ZRAM_FLAG_SHIFT;
>> +
>> + meta->table[index].value = (flags << ZRAM_FLAG_SHIFT) | size;
>> }
>>
>> static inline int is_partial_io(struct bio_vec *bvec)
>> @@ -255,7 +264,6 @@ static struct zram_meta *zram_meta_alloc(u64 disksize)
>> goto free_table;
>> }
>>
>> - rwlock_init(&meta->tb_lock);
>> return meta;
>>
>> free_table:
>> @@ -304,19 +312,24 @@ static void handle_zero_page(struct bio_vec *bvec)
>> flush_dcache_page(page);
>> }
>>
>> -/* NOTE: caller should hold meta->tb_lock with write-side */
>> +/*
>> + * To protect concurrent access to the same index entry,
>> + * caller should hold this table index entry's bit_spinlock to
>> + * indicate this index entry is accessing.
>> + */
>> static void zram_free_page(struct zram *zram, size_t index)
>> {
>> struct zram_meta *meta = zram->meta;
>> unsigned long handle = meta->table[index].handle;
>> + size_t size;
>>
>> if (unlikely(!handle)) {
>> /*
>> * No memory is allocated for zero filled pages.
>> * Simply clear zero page flag.
>> */
>> - if (zram_test_flag(meta, index, ZRAM_ZERO)) {
>> - zram_clear_flag(meta, index, ZRAM_ZERO);
>> + if (zram_test_zero(meta, index)) {
>> + zram_clear_zero(meta, index);
>> atomic64_dec(&zram->stats.zero_pages);
>> }
>> return;
>> @@ -324,27 +337,28 @@ static void zram_free_page(struct zram *zram, size_t index)
>>
>> zs_free(meta->mem_pool, handle);
>>
>> - atomic64_sub(meta->table[index].size, &zram->stats.compr_data_size);
>> + size = zram_get_obj_size(meta, index);
>> + atomic64_sub(size, &zram->stats.compr_data_size);
>> atomic64_dec(&zram->stats.pages_stored);
>>
>> meta->table[index].handle = 0;
>> - meta->table[index].size = 0;
>> + zram_set_obj_size(meta, index, 0);
>> }
>>
>> static int zram_decompress_page(struct zram *zram, char *mem, u32 index)
>> {
>> - int ret = 0;
>
> Unnecessary change.

I want to compact the memory usage on stack, so I put ret and size variables
together. On 64-bit system, it will be helpful.

>> unsigned char *cmem;
>> struct zram_meta *meta = zram->meta;
>> unsigned long handle;
>> - u16 size;
>
> I'm not sure it's good idea to use size_t instead of u16 because we apparently
> have a limitation to express range of size due to packing it into unsigned long
> so u16 is more clear to show the limiation and someone might find a problem
> more easily in future if we break something subtle.
>
>> + size_t size;
>> + int ret = 0;
>>
>> - read_lock(&meta->tb_lock);
>> + bit_spin_lock(ZRAM_ACCESS, &meta->table[index].value);
>> handle = meta->table[index].handle;
>> - size = meta->table[index].size;
>> + size = zram_get_obj_size(meta, index);
>>
>> - if (!handle || zram_test_flag(meta, index, ZRAM_ZERO)) {
>> - read_unlock(&meta->tb_lock);
>> + if (!handle || zram_test_zero(meta, index)) {
>> + bit_spin_unlock(ZRAM_ACCESS, &meta->table[index].value);
>> clear_page(mem);
>> return 0;
>> }
>> @@ -355,7 +369,7 @@ static int zram_decompress_page(struct zram *zram, char *mem, u32 index)
>> else
>> ret = zcomp_decompress(zram->comp, cmem, size, mem);
>> zs_unmap_object(meta->mem_pool, handle);
>> - read_unlock(&meta->tb_lock);
>> + bit_spin_unlock(ZRAM_ACCESS, &meta->table[index].value);
>>
>> /* Should NEVER happen. Return bio error if it does. */
>> if (unlikely(ret)) {
>> @@ -376,14 +390,14 @@ static int zram_bvec_read(struct zram *zram, struct bio_vec *bvec,
>> struct zram_meta *meta = zram->meta;
>> page = bvec->bv_page;
>>
>> - read_lock(&meta->tb_lock);
>> + bit_spin_lock(ZRAM_ACCESS, &meta->table[index].value);
>> if (unlikely(!meta->table[index].handle) ||
>> - zram_test_flag(meta, index, ZRAM_ZERO)) {
>> - read_unlock(&meta->tb_lock);
>> + zram_test_zero(meta, index)) {
>> + bit_spin_unlock(ZRAM_ACCESS, &meta->table[index].value);
>> handle_zero_page(bvec);
>> return 0;
>> }
>> - read_unlock(&meta->tb_lock);
>> + bit_spin_unlock(ZRAM_ACCESS, &meta->table[index].value);
>>
>> if (is_partial_io(bvec))
>> /* Use a temporary buffer to decompress the page */
>> @@ -461,10 +475,10 @@ static int zram_bvec_write(struct zram *zram, struct bio_vec *bvec, u32 index,
>> if (page_zero_filled(uncmem)) {
>> kunmap_atomic(user_mem);
>> /* Free memory associated with this sector now. */
>> - write_lock(&zram->meta->tb_lock);
>> + bit_spin_lock(ZRAM_ACCESS, &meta->table[index].value);
>> zram_free_page(zram, index);
>> - zram_set_flag(meta, index, ZRAM_ZERO);
>> - write_unlock(&zram->meta->tb_lock);
>> + zram_set_zero(meta, index);
>> + bit_spin_unlock(ZRAM_ACCESS, &meta->table[index].value);
>>
>> atomic64_inc(&zram->stats.zero_pages);
>> ret = 0;
>> @@ -514,12 +528,12 @@ static int zram_bvec_write(struct zram *zram, struct bio_vec *bvec, u32 index,
>> * Free memory associated with this sector
>> * before overwriting unused sectors.
>> */
>> - write_lock(&zram->meta->tb_lock);
>> + bit_spin_lock(ZRAM_ACCESS, &meta->table[index].value);
>> zram_free_page(zram, index);
>>
>> meta->table[index].handle = handle;
>> - meta->table[index].size = clen;
>> - write_unlock(&zram->meta->tb_lock);
>> + zram_set_obj_size(meta, index, clen);
>> + bit_spin_unlock(ZRAM_ACCESS, &meta->table[index].value);
>>
>> /* Update stats */
>> atomic64_add(clen, &zram->stats.compr_data_size);
>> @@ -560,6 +574,7 @@ static void zram_bio_discard(struct zram *zram, u32 index,
>> int offset, struct bio *bio)
>> {
>> size_t n = bio->bi_iter.bi_size;
>> + struct zram_meta *meta = zram->meta;
>>
>> /*
>> * zram manages data in physical block size units. Because logical block
>> @@ -584,9 +599,9 @@ static void zram_bio_discard(struct zram *zram, u32 index,
>> * Discard request can be large so the lock hold times could be
>> * lengthy. So take the lock once per page.
>> */
>> - write_lock(&zram->meta->tb_lock);
>> + bit_spin_lock(ZRAM_ACCESS, &meta->table[index].value);
>> zram_free_page(zram, index);
>> - write_unlock(&zram->meta->tb_lock);
>> + bit_spin_unlock(ZRAM_ACCESS, &meta->table[index].value);
>> index++;
>> n -= PAGE_SIZE;
>> }
>> @@ -804,9 +819,9 @@ static void zram_slot_free_notify(struct block_device *bdev,
>> zram = bdev->bd_disk->private_data;
>> meta = zram->meta;
>>
>> - write_lock(&meta->tb_lock);
>> + bit_spin_lock(ZRAM_ACCESS, &meta->table[index].value);
>> zram_free_page(zram, index);
>> - write_unlock(&meta->tb_lock);
>> + bit_spin_unlock(ZRAM_ACCESS, &meta->table[index].value);
>> atomic64_inc(&zram->stats.notify_free);
>> }
>>
>> diff --git a/drivers/block/zram/zram_drv.h b/drivers/block/zram/zram_drv.h
>> index 7f21c14..71bc4ad 100644
>> --- a/drivers/block/zram/zram_drv.h
>> +++ b/drivers/block/zram/zram_drv.h
>> @@ -51,10 +51,22 @@ static const size_t max_zpage_size = PAGE_SIZE / 4 * 3;
>> #define ZRAM_SECTOR_PER_LOGICAL_BLOCK \
>> (1 << (ZRAM_LOGICAL_BLOCK_SHIFT - SECTOR_SHIFT))
>>
>> -/* Flags for zram pages (table[page_no].flags) */
>> +/*
>> + * The lower ZRAM_FLAG_SHIFT bits of table.value is for
>> + * object size (excluding header), the higher bits is for
>> + * zram_pageflags. By this means, it won't increase any
>> + * memory overhead on both 32-bit and 64-bit system.
>
> Comment on "By this means, ~ 64 bit system" is unncessary because
> someone read this line but don't know history couldn't understand
> what's the old structure.

I will remove it.

>> + * zram is mostly used on small embedded system, so we
>> + * don't want to increase memory footprint. That is why
>> + * we pack size and flag into table.value.
>> + */
>
> IMHO, it would be more clear but not sure if native speakers look at. ;-)
>
> * zram is mainly used for memory efficiency so we want to keep memory
> * footprint small so we can squeeze size and flags into a field.
> * The lower ZRAM_FLAG_SHIFT bits is for object size (excluding header),
> * the higher bits is for zram_pageflags.
>

That is a better comment.

>> +#define ZRAM_FLAG_SHIFT 24
>
> Why is it 24? We have used for 16-bit for size.
> Do you think it's too small for size?

The reason why I choose 24-bit and size_t is that when I checked the PAGE_SHIFT
on all kinds of architectures, I found on some architectures such as powerpc and
hexagon, PAGE_SHIFT would be 18 or 20, so I think 16-bit could be small to use.

>> +
>> +/* Flags for zram pages (table[page_no].value) */
>> enum zram_pageflags {
>> /* Page consists entirely of zeros */
>> - ZRAM_ZERO,
>> + ZRAM_ZERO = ZRAM_FLAG_SHIFT + 1,
>> + ZRAM_ACCESS, /* page in now accessed */
>>
>> __NR_ZRAM_PAGEFLAGS,
>> };
>> @@ -64,9 +76,8 @@ enum zram_pageflags {
>> /* Allocated for each disk page */
>> struct table {
>> unsigned long handle;
>> - u16 size; /* object size (excluding header) */
>> - u8 flags;
>> -} __aligned(4);
>> + unsigned long value;
>> +};
>>
>> struct zram_stats {
>> atomic64_t compr_data_size; /* compressed size of pages stored */
>> @@ -81,7 +92,6 @@ struct zram_stats {
>> };
>>
>> struct zram_meta {
>> - rwlock_t tb_lock; /* protect table */
>> struct table *table;
>> struct zs_pool *mem_pool;
>> };
>> --
>> 1.7.10.4
>>
>>
>> --
>> To unsubscribe, send a message with 'unsubscribe linux-mm' in
>> the body to majordomo@xxxxxxxxxx For more info on Linux MM,
>> see: http://www.linux-mm.org/ .
>> Don't email: <a href=mailto:"dont@xxxxxxxxx";> email@xxxxxxxxx </a>
>
> --
> Kind regards,
> Minchan Kim
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/