Re: [PATCH] zram: remove global tb_lock by using lock-free CAS
From: Weijie Yang
Date: Wed May 07 2014 - 05:16:58 EST
On Wed, May 7, 2014 at 4:57 PM, Minchan Kim <minchan@xxxxxxxxxx> wrote:
> On Wed, May 07, 2014 at 03:51:35PM +0800, Weijie Yang wrote:
>> On Tue, May 6, 2014 at 6:22 AM, Davidlohr Bueso <davidlohr@xxxxxx> wrote:
>> > On Mon, 2014-05-05 at 13:46 -0700, Andrew Morton wrote:
>> >> On Mon, 05 May 2014 11:00:44 -0700 Davidlohr Bueso <davidlohr@xxxxxx> wrote:
>> >>
>> >> > > > @@ -339,12 +338,14 @@ static int zram_decompress_page(struct zram *zram, char *mem, u32 index)
>> >> > > > unsigned long handle;
>> >> > > > u16 size;
>> >> > > >
>> >> > > > - read_lock(&meta->tb_lock);
>> >> > > > + while(atomic_cmpxchg(&meta->table[index].state, IDLE, ACCESS) != IDLE)
>> >> > > > + cpu_relax();
>> >> > > > +
>> >> > >
>> >> > > So... this might be dumb question, but this looks like a spinlock
>> >> > > implementation.
>> >> > >
>> >> > > What advantage does this have over a standard spinlock?
>> >> >
>> >> > I was wondering the same thing. Furthermore by doing this you'll loose
>> >> > the benefits of sharing the lock... your numbers do indicate that it is
>> >> > for the better. Also, note that hopefully rwlock_t will soon be updated
>> >> > to be fair and perform up to par with spinlocks, something which is long
>> >> > overdue. So you could reduce the critical region by implementing the
>> >> > same granularity, just don't implement your own locking schemes, like
>> >> > this.
>>
>> Actually, the main reason I use a CAS rather than a standard lock here is
>> that I want to minimize the meta table memory overhead. A tiny reason is
>> my fuzzy memory that CAS is more efficient than spinlock (please correct me
>> if I am wrong).
>>
>> Anyway, I changed the CAS to spinlock and rwlock, re-test them:
>>
>> Test lock-free spinlock rwlock
>> ------------------------------------------------------
>> Initial write 1424141.62 1426372.84 1423019.21
>> Rewrite 1652504.81 1623307.14 1653682.04
>> Read 11404668.35 11242885.05 10938125.00
>> Re-read 11555483.75 11253906.6 10837773.50
>> Reverse Read 8394478.17 8277250.34 7768057.39
>> Stride read 9372229.95 9010498.53 8692871.77
>> Random read 9187221.90 8988080.55 8661184.60
>> Mixed workload 5843370.85 5414729.54 5451055.03
>> Random write 1608947.04 1572276.64 1588866.51
>> Pwrite 1311055.32 1302463.04 1302001.06
>> Pread 4652056.11 4555802.18 4469672.34
>
> I'd like to clear it out.
> The spinlock and rwlock you mentioned is per-meta entry lock like state
> you added or global lock for meta? If it's latter, rwlock means base?
The spinlock and rwlock is per-meta entry lock like state.
Because the base data is added in the first mail, I don't mention
them here.
>>
>> And I cann't say which one is the best, they have the similar performance.
>
> Most popular use of zram is the in-memory swap for small embedded system
> so I don't want to increase memory footprint without good reason although
> it makes synthetic benchmark. Alhought it's 1M for 1G, it isn't small if we
> consider compression ratio and real free memory after boot(But data I have
> an interest is mixed workload enhancement. It would be important for heavy
> memory pressure for latency so it attractives me a lot. Anyway, I need number
> for back up the justification with real swap usecase rather than zram-blk.
> Recently, I have considered per-process reclaim based on zram so maybe I will
> have a test for that).
> But recently, I have received private mail from some server folks to use
> zram-blk, not zram-swap so in case of that, such enhancement would be
> desirable so my point is I'm not saying the drop of the patch and let's
> find proper solution to meet both and gather more data.
>
>>
>> Wait, iozone will create temporary files for every test thread, so there is no
>> possibility that these threads access the same table[index] concurrenctly.
>> So, I use fio to test the raw zram block device.
>> To enhance the possibility of access the same table[index] conflictly, I set zram
>> with a small disksize(10M) and let thread run with large loop count.
>>
>> On the same test machine, the fio test command is:
>> fio --bs=32k --randrepeat=1 --randseed=100 --refill_buffers
>> --scramble_buffers=1 --direct=1 --loops=3000 --numjobs=4
>> --filename=/dev/zram0 --name=seq-write --rw=write --stonewall
>> --name=seq-read --rw=read --stonewall --name=seq-readwrite
>> --rw=rw --stonewall --name=rand-readwrite --rw=randrw --stonewall
>>
>> Test base lock-free spinlock rwlock
>> ------------------------------------------------------
>> seq-write 935109.2 999580.5 998134.8 994384.6
>> seq-read 5598064.6 6444011.5 6243184.6 6197514.2
>> seq-rw 1403963.0 1635673.0 1633823.0 1635972.2
>> rand-rw 1389864.4 1612520.4 1613403.6 1612129.8
>
> What's the difference between base and rwlock?
> Base means global rwlock while rwlock means per-meta entry rwlock?
>
Base means global rwlock, it is the 3.15.0-rc3 code.
rwlock means per-meta entry rwlock.
Sorry to confuse you.
>>
>> This result(KB/s, average of 5 tests) shows the performance improvement
>> on base version, however, I cann't say which method is the best.
>>
>> >>
>> >> It sounds like seqlocks will match this access pattern pretty well?
>> >
>> > Indeed. And after a closer look, except for zram_slot_free_notify(),
>> > that lock is always shared. So, unless fine graining it implies taking
>> > the lock exclusively like in this patch (if so, that needs to be
>> > explicitly documented in the changelog), we would ideally continue to
>> > share it. That _should_ provide nicer performance numbers when using the
>> > correct lock.
>> >
>>
>> Andrew mentioned seqlocks, however, I think it is hard the use seqlocks here
>> after I recheck the codes. No matter use it as a meta global lock or a
>> table[index] lock. The main reason is the writer will free the handle rather than
>> just change some value.
>> If I misunderstand you, please let me know.
>>
>> Now, I am in a delimma. For minimizing the memory overhead, I like to use CAS.
>> However, it is not a standard way.
>>
>> Any complaint or suggestions are welcomed.
>>
>> Regards,
>>
>> >
>> >
>>
>>
>> --
>> To unsubscribe, send a message with 'unsubscribe linux-mm' in
>> the body to majordomo@xxxxxxxxxx For more info on Linux MM,
>> see: http://www.linux-mm.org/ .
>> Don't email: <a href=mailto:"dont@xxxxxxxxx"> email@xxxxxxxxx </a>
>
> --
> Kind regards,
> Minchan Kim
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/