Re: [PATCH 2/3] zram: support page-based parallel write

From: Minchan Kim
Date: Tue Oct 04 2016 - 22:02:28 EST

Next message: Adam Ford: "[PATCH V2] OMAPDSS: Kconfig: Add HDMI for OMAP4 and OMAP5 dependencies"
Previous message: Linus Torvalds: "Re: [RFC] fs: add userspace critical mounts event support"
In reply to: Minchan Kim: "Re: [PATCH 2/3] zram: support page-based parallel write"
Next in thread: Sergey Senozhatsky: "Re: [PATCH 2/3] zram: support page-based parallel write"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]

Hi Sergey,

On Tue, Oct 04, 2016 at 01:43:14PM +0900, Sergey Senozhatsky wrote:

< snip >

> TEST
> ****
>
> new tests results; same tests, same conditions, same .config.
> 4-way test:
> - BASE zram, fio direct=1
> - BASE zram, fio fsync_on_close=1
> - NEW zram, fio direct=1
> - NEW zram, fio fsync_on_close=1
>
>
>
> and what I see is that:
> - new zram is x3 times slower when we do a lot of direct=1 IO
> and
> - 10% faster when we use buffered IO (fsync_on_close); but not always;
> for instance, test execution time is longer (a reproducible behavior)
> when the number of jobs equals the number of CPUs - 4.
>
>
>
> if flushing is a problem for new zram during direct=1 test, then I would
> assume that writing a huge number of small files (creat/write 4k/close)
> would probably have same fsync_on_close=1 performance as direct=1.
>
>
> ENV
> ===
>
> x86_64 SMP (4 CPUs), "bare zram" 3g, lzo, static compression buffer.
>
>
> TEST COMMAND
> ============
>
> ZRAM_SIZE=3G ZRAM_COMP_ALG=lzo LOG_SUFFIX={NEW, OLD} FIO_LOOPS=2 ./zram-fio-test.sh
>
>
> EXECUTED TESTS
> ==============
>
> - [seq-read]
> - [rand-read]
> - [seq-write]
> - [rand-write]
> - [mixed-seq]
> - [mixed-rand]
>
>
> fio-perf-o-meter.sh test-fio-zram-OLD test-fio-zram-OLD-flush test-fio-zram-NEW test-fio-zram-NEW-flush
> Processing test-fio-zram-OLD
> Processing test-fio-zram-OLD-flush
> Processing test-fio-zram-NEW
> Processing test-fio-zram-NEW-flush
>
> BASE BASE NEW NEW
> direct=1 fsync_on_close=1 direct=1 fsync_on_close=1
>
> #jobs1
> READ: 2345.1MB/s 2177.2MB/s 2373.2MB/s 2185.8MB/s
> READ: 1948.2MB/s 1417.7MB/s 1987.7MB/s 1447.4MB/s
> WRITE: 1292.7MB/s 1406.1MB/s 275277KB/s 1521.1MB/s
> WRITE: 1047.5MB/s 1143.8MB/s 257140KB/s 1202.4MB/s
> READ: 429530KB/s 779523KB/s 175450KB/s 782237KB/s
> WRITE: 429840KB/s 780084KB/s 175576KB/s 782800KB/s
> READ: 414074KB/s 408214KB/s 164091KB/s 383426KB/s
> WRITE: 414402KB/s 408539KB/s 164221KB/s 383730KB/s

I tested your benchmark for job 1 on my 4 CPU mahcine with this diff.

Nothing different.

1. just changed ordering of test execution - hope to reduce testing time due to
block population before the first reading or reading just zero pages
2. used sync_on_close instead of direct io
3. Don't use perf to avoid noise
4. echo 0 > /sys/block/zram0/use_aio to test synchronous IO for old behavior

diff --git a/conf/fio-template-static-buffer b/conf/fio-template-static-buffer
index 1a9a473..22ddee8 100644
--- a/conf/fio-template-static-buffer
+++ b/conf/fio-template-static-buffer
@@ -1,7 +1,7 @@
[global]
bs=${BLOCK_SIZE}k
ioengine=sync
-direct=1
+fsync_on_close=1
nrfiles=${NRFILES}
size=${SIZE}
numjobs=${NUMJOBS}
@@ -14,18 +14,18 @@ new_group
group_reporting
threads=1

-[seq-read]
-rw=read
-
-[rand-read]
-rw=randread
-
[seq-write]
rw=write

[rand-write]
rw=randwrite

+[seq-read]
+rw=read
+
+[rand-read]
+rw=randread
+
[mixed-seq]
rw=rw

diff --git a/zram-fio-test.sh b/zram-fio-test.sh
index 39c11b3..ca2d065 100755
--- a/zram-fio-test.sh
+++ b/zram-fio-test.sh
@@ -1,4 +1,4 @@
-#!/bin/sh
+#!/bin/bash

# Sergey Senozhatsky. sergey.senozhatsky@xxxxxxxxx
@@ -37,6 +37,7 @@ function create_zram
echo $ZRAM_COMP_ALG > /sys/block/zram0/comp_algorithm
cat /sys/block/zram0/comp_algorithm

+ echo 0 > /sys/block/zram0/use_aio
echo $ZRAM_SIZE > /sys/block/zram0/disksize
if [ $? != 0 ]; then
return -1
@@ -137,7 +138,7 @@ function main
echo "#jobs$i fio" >> $LOG

BLOCK_SIZE=4 SIZE=100% NUMJOBS=$i NRFILES=$i FIO_LOOPS=$FIO_LOOPS \
- $PERF stat -o $LOG-perf-stat $FIO ./$FIO_TEMPLATE >> $LOG
+ $FIO ./$FIO_TEMPLATE > $LOG

echo -n "perfstat jobs$i" >> $LOG
cat $LOG-perf-stat >> $LOG

And got following result.

1. ZRAM_SIZE=3G ZRAM_COMP_ALG=lzo LOG_SUFFIX=async FIO_LOOPS=2 MAX_ITER=1 ./zram-fio-test.sh
2. modify script to disable aio via /sys/block/zram0/use_aio
ZRAM_SIZE=3G ZRAM_COMP_ALG=lzo LOG_SUFFIX=sync FIO_LOOPS=2 MAX_ITER=1 ./zram-fio-test.sh

seq-write 380930 474325 124.52%
rand-write 286183 357469 124.91%
seq-read 266813 265731 99.59%
rand-read 211747 210670 99.49%
mixed-seq(R) 145750 171232 117.48%
mixed-seq(W) 145736 171215 117.48%
mixed-rand(R) 115355 125239 108.57%
mixed-rand(W) 115371 125256 108.57%

LZO compression is fast and a CPU for queueing while 3 CPU for compressing
it cannot saturate CPU full bandwidth. Nonetheless, it shows 24% enhancement.
It could be more in slow CPU like embedded.

I tested it with deflate. The result is 300% enhancement.

seq-write 33598 109882 327.05%
rand-write 32815 102293 311.73%
seq-read 154323 153765 99.64%
rand-read 129978 129241 99.43%
mixed-seq(R) 15887 44995 283.22%
mixed-seq(W) 15885 44990 283.22%
mixed-rand(R) 25074 55491 221.31%
mixed-rand(W) 25078 55499 221.31%

So, curious with your test.
Am my test sync with yours? If you cannot see enhancment in job1, could
you test with deflate? It seems your CPU is really fast.

Next message: Adam Ford: "[PATCH V2] OMAPDSS: Kconfig: Add HDMI for OMAP4 and OMAP5 dependencies"
Previous message: Linus Torvalds: "Re: [RFC] fs: add userspace critical mounts event support"
In reply to: Minchan Kim: "Re: [PATCH 2/3] zram: support page-based parallel write"
Next in thread: Sergey Senozhatsky: "Re: [PATCH 2/3] zram: support page-based parallel write"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]