Re: [PATCH 2/3] zram: support page-based parallel write
From: Minchan Kim
Date: Fri Oct 07 2016 - 02:33:34 EST
Hi Sergey,
On Thu, Oct 06, 2016 at 05:29:15PM +0900, Sergey Senozhatsky wrote:
> Hello Minchan,
>
> On (10/05/16 11:01), Minchan Kim wrote:
> [..]
> > 1. just changed ordering of test execution - hope to reduce testing time due to
> > block population before the first reading or reading just zero pages
> > 2. used sync_on_close instead of direct io
> > 3. Don't use perf to avoid noise
> > 4. echo 0 > /sys/block/zram0/use_aio to test synchronous IO for old behavior
>
> ok, will use it in the tests below.
>
> > 1. ZRAM_SIZE=3G ZRAM_COMP_ALG=lzo LOG_SUFFIX=async FIO_LOOPS=2 MAX_ITER=1 ./zram-fio-test.sh
> > 2. modify script to disable aio via /sys/block/zram0/use_aio
> > ZRAM_SIZE=3G ZRAM_COMP_ALG=lzo LOG_SUFFIX=sync FIO_LOOPS=2 MAX_ITER=1 ./zram-fio-test.sh
> >
> > seq-write 380930 474325 124.52%
> > rand-write 286183 357469 124.91%
> > seq-read 266813 265731 99.59%
> > rand-read 211747 210670 99.49%
> > mixed-seq(R) 145750 171232 117.48%
> > mixed-seq(W) 145736 171215 117.48%
> > mixed-rand(R) 115355 125239 108.57%
> > mixed-rand(W) 115371 125256 108.57%
>
> no_aio use_aio
>
> WRITE: 1432.9MB/s 1511.5MB/s
> WRITE: 1173.9MB/s 1186.9MB/s
> READ: 912699KB/s 912170KB/s
> WRITE: 912497KB/s 911968KB/s
> READ: 725658KB/s 726747KB/s
> READ: 579003KB/s 594543KB/s
> READ: 373276KB/s 373719KB/s
> WRITE: 373572KB/s 374016KB/s
>
> seconds elapsed 45.399702511 44.280199716
>
> > LZO compression is fast and a CPU for queueing while 3 CPU for compressing
> > it cannot saturate CPU full bandwidth. Nonetheless, it shows 24% enhancement.
> > It could be more in slow CPU like embedded.
> >
> > I tested it with deflate. The result is 300% enhancement.
> >
> > seq-write 33598 109882 327.05%
> > rand-write 32815 102293 311.73%
> > seq-read 154323 153765 99.64%
> > rand-read 129978 129241 99.43%
> > mixed-seq(R) 15887 44995 283.22%
> > mixed-seq(W) 15885 44990 283.22%
> > mixed-rand(R) 25074 55491 221.31%
> > mixed-rand(W) 25078 55499 221.31%
> >
> > So, curious with your test.
> > Am my test sync with yours? If you cannot see enhancment in job1, could
> > you test with deflate? It seems your CPU is really fast.
>
> interesting observation.
>
> no_aio use_aio
> WRITE: 47882KB/s 158931KB/s
> WRITE: 47714KB/s 156484KB/s
> READ: 42914KB/s 137997KB/s
> WRITE: 42904KB/s 137967KB/s
> READ: 333764KB/s 332828KB/s
> READ: 293883KB/s 294709KB/s
> READ: 51243KB/s 129701KB/s
> WRITE: 51284KB/s 129804KB/s
>
> seconds elapsed 480.869169882 181.678431855
>
> yes, looks like with lzo CPU manages to process bdi writeback fast enough
> to keep fio-template-static-buffer worker active.
>
> to prove this theory: direct=1 cures zram-deflate.
>
> no_aio use_aio
> WRITE: 41873KB/s 34257KB/s
> WRITE: 41455KB/s 34087KB/s
> READ: 36705KB/s 28960KB/s
> WRITE: 36697KB/s 28954KB/s
> READ: 327902KB/s 327270KB/s
> READ: 316217KB/s 316886KB/s
> READ: 35980KB/s 28131KB/s
> WRITE: 36008KB/s 28153KB/s
>
> seconds elapsed 515.575252170 629.114626795
>
>
>
> as soon as wb flush kworker can't keep up anymore things are going off
> the rails. most of the time, fio-template-static-buffer are in D state,
> while the biggest bdi flush kworker is doing the job (a lot of job):
>
> PID USER PR NI VIRT RES %CPU %MEM TIME+ S COMMAND
> 6274 root 20 0 0.0m 0.0m 100.0 0.0 1:15.60 R [kworker/u8:1]
> 11169 root 20 0 718.1m 1.6m 16.6 0.0 0:01.88 D fio ././conf/fio-template-static-buffer
> 11171 root 20 0 718.1m 1.6m 3.3 0.0 0:01.15 D fio ././conf/fio-template-static-buffer
> 11170 root 20 0 718.1m 3.3m 2.6 0.1 0:00.98 D fio ././conf/fio-template-static-buffer
>
>
> and still working...
>
> 6274 root 20 0 0.0m 0.0m 100.0 0.0 3:05.49 R [kworker/u8:1]
> 12048 root 20 0 718.1m 1.6m 16.7 0.0 0:01.80 R fio ././conf/fio-template-static-buffer
> 12047 root 20 0 718.1m 1.6m 3.3 0.0 0:01.12 D fio ././conf/fio-template-static-buffer
> 12049 root 20 0 718.1m 1.6m 3.3 0.0 0:01.12 D fio ././conf/fio-template-static-buffer
> 12050 root 20 0 718.1m 1.6m 2.0 0.0 0:00.98 D fio ././conf/fio-template-static-buffer
>
> and working...
>
>
> [ 4159.338731] CPU: 0 PID: 105 Comm: kworker/u8:4
> [ 4159.338734] Workqueue: writeback wb_workfn (flush-254:0)
> [ 4159.338746] [<ffffffffa01d8cff>] zram_make_request+0x4a3/0x67b [zram]
> [ 4159.338748] [<ffffffff810543fe>] ? try_to_wake_up+0x201/0x213
> [ 4159.338750] [<ffffffff810ae9d3>] ? mempool_alloc+0x5e/0x124
> [ 4159.338752] [<ffffffff811a9922>] generic_make_request+0xb8/0x156
> [ 4159.338753] [<ffffffff811a9aaf>] submit_bio+0xef/0xf8
> [ 4159.338755] [<ffffffff81121a97>] submit_bh_wbc.isra.10+0x16b/0x178
> [ 4159.338757] [<ffffffff811223ec>] __block_write_full_page+0x1b2/0x2a6
> [ 4159.338758] [<ffffffff8112403e>] ? bh_submit_read+0x5a/0x5a
> [ 4159.338760] [<ffffffff81120f9a>] ? end_buffer_write_sync+0x36/0x36
> [ 4159.338761] [<ffffffff8112403e>] ? bh_submit_read+0x5a/0x5a
> [ 4159.338763] [<ffffffff811226d8>] block_write_full_page+0xf6/0xff
> [ 4159.338765] [<ffffffff81124342>] blkdev_writepage+0x13/0x15
> [ 4159.338767] [<ffffffff810b498c>] __writepage+0xe/0x26
> [ 4159.338768] [<ffffffff810b65aa>] write_cache_pages+0x28c/0x376
> [ 4159.338770] [<ffffffff810b497e>] ? __wb_calc_thresh+0x83/0x83
> [ 4159.338772] [<ffffffff810b66dc>] generic_writepages+0x48/0x67
> [ 4159.338773] [<ffffffff81124318>] blkdev_writepages+0x9/0xb
> [ 4159.338775] [<ffffffff81124318>] ? blkdev_writepages+0x9/0xb
> [ 4159.338776] [<ffffffff810b6716>] do_writepages+0x1b/0x24
> [ 4159.338778] [<ffffffff8111b12c>] __writeback_single_inode+0x3d/0x155
> [ 4159.338779] [<ffffffff8111b407>] writeback_sb_inodes+0x1c3/0x32c
> [ 4159.338781] [<ffffffff8111b5e1>] __writeback_inodes_wb+0x71/0xa9
> [ 4159.338783] [<ffffffff8111b7ce>] wb_writeback+0x10f/0x1a1
> [ 4159.338785] [<ffffffff8111be32>] wb_workfn+0x1c9/0x24c
> [ 4159.338786] [<ffffffff8111be32>] ? wb_workfn+0x1c9/0x24c
> [ 4159.338788] [<ffffffff8104a2e2>] process_one_work+0x1a4/0x2a7
> [ 4159.338790] [<ffffffff8104ae32>] worker_thread+0x23b/0x37c
> [ 4159.338792] [<ffffffff8104abf7>] ? rescuer_thread+0x2eb/0x2eb
> [ 4159.338793] [<ffffffff8104f285>] kthread+0xce/0xd6
> [ 4159.338794] [<ffffffff8104f1b7>] ? kthread_create_on_node+0x1ad/0x1ad
> [ 4159.338796] [<ffffffff8145ad12>] ret_from_fork+0x22/0x30
>
>
> so the question is -- can we move this parallelization out of zram
> and instead flush bdi in more than one kthread? how bad that would
> be? can anyone else benefit from this?
Isn't it blk-mq you mentioned? With blk-mq, I have some concerns.
1. read speed degradation
2. no work with rw_page
3. more memory footprint by bio/request queue allocation
Having said, it's worth to look into it in detail more.
I will have time to see that approach to know what I can do
with that.
Thanks!
>
> [1] https://lwn.net/Articles/353844/
> [2] https://lwn.net/Articles/354852/
>
> -ss