Re: [PATCH V6 00/30] block: support multipage bvec

From: Ming Lei
Date: Wed Jun 20 2018 - 21:17:35 EST


On Fri, Jun 15, 2018 at 02:59:19PM +0200, Gi-Oh Kim wrote:
> >
> > - bio size can be increased and it should improve some high-bandwidth IO
> > case in theory[4].
> >
>
> Hi,
>
> I would like to report your patch set works well on my system based on v4.14.48.
> I thought the multipage bvec could improve the performance of my system.
> (FYI, my system has v4.14.48 and provides KVM-base virtualization service.)

Thanks for your test!

>
> So I did back-porting your patches to v4.14.48.
> It has done without any serious problem.
> I only needed to cherry-pick "blk-merge: compute
> bio->bi_seg_front_size efficiently" and
> "block: move bio_alloc_pages() to bcache" patches before back-porting
> to prevent conflicts.

Not sure I understand your point, you have to backport all patches.

> And I ran my own test-suit for checking features of md and RAID1 layer.
> There was no problem. All test cases passed.
> (If you want, I will send you the back-ported patches.)
>
> Then I did two performance test as following.
> To say the conclusion first, I failed to show performance improvement
> of the patch set.
> Of course, my test cases would not be suitable to test your patch set.
> Or maybe I did test wrong.
> Please inform me which tools are suitable, then I will try them.
>
> 1. fio
>
> First I ran fio with null device to check the performance of the block-layer.
> I am not sure those test is suitable to show the performance
> improvement or degradation.
> Nevertheless there was a little (-6%) performance degradation.
>
> If it is not much trouble to you, please review my options for fio and
> inform me if I used wrong or incorrect options.
> Then I will run the test again.
>
> 1.1 Following is my options for fio.
>
> gkim@ib1:~/pb-ltp/benchmark/fio$ cat go_local.sh
> #!/bin/bash
> echo "fio start : $(date)"
> echo "kernel info : $(uname -a)"
> echo "fio version : $(fio --version)"
>
> # set "none" io-scheduler
> modprobe -r null_blk
> modprobe null_blk
> echo "none" > /sys/block/nullb0/queue/scheduler
>
> FIO_OPTION="--direct=1 --rw=randrw:2 --time_based=1 --group_reporting \
> --ioengine=libaio --iodepth=64 --name=fiotest --numjobs=8 \
> --bssplit=512/20:1k/16:2k/9:4k/12:8k/19:16k/10:32k/8:64k/4 \
> --fadvise_hint=0 --iodepth_batch_submit=64
> --iodepth_batch_complete=64"
> # fio test null_blk device, so it is not necessary to run long.
> fio $FIO_OPTION --filename=/dev/nullb0 --runtime=600
>
> 1.2 Following is the result before porting.
>
> fio start : Mon Jun 11 04:30:01 CEST 2018
> kernel info : Linux ib1 4.14.48-1-pserver
> #4.14.48-1.1+feature+daily+update+20180607.0857+1bbde0b~deb8 SMP
> x86_64 GNU/Linux
> fio version : fio-2.2.10
> fiotest: (g=0): rw=randrw, bs=512-64K/512-64K/512-64K,
> ioengine=libaio, iodepth=64
> ...
> fio-2.2.10
> Starting 8 processes
>
> fiotest: (groupid=0, jobs=8): err= 0: pid=1655: Mon Jun 11 04:40:02 2018
> read : io=7133.2GB, bw=12174MB/s, iops=1342.1K, runt=600001msec
> slat (usec): min=1, max=15750, avg=123.78, stdev=153.79
> clat (usec): min=0, max=15758, avg=24.70, stdev=77.93
> lat (usec): min=2, max=15782, avg=148.49, stdev=167.54
> clat percentiles (usec):
> | 1.00th=[ 0], 5.00th=[ 1], 10.00th=[ 1], 20.00th=[ 1],
> | 30.00th=[ 2], 40.00th=[ 2], 50.00th=[ 2], 60.00th=[ 6],
> | 70.00th=[ 22], 80.00th=[ 36], 90.00th=[ 72], 95.00th=[ 107],
> | 99.00th=[ 173], 99.50th=[ 203], 99.90th=[ 932], 99.95th=[ 1416],
> | 99.99th=[ 2960]
> bw (MB /s): min= 1096, max= 2147, per=12.51%, avg=1522.69, stdev=253.89
> write: io=7131.3GB, bw=12171MB/s, iops=1343.6K, runt=600001msec
> slat (usec): min=1, max=15751, avg=124.73, stdev=154.11
> clat (usec): min=0, max=15758, avg=24.69, stdev=77.84
> lat (usec): min=2, max=15780, avg=149.43, stdev=167.82
> clat percentiles (usec):
> | 1.00th=[ 0], 5.00th=[ 1], 10.00th=[ 1], 20.00th=[ 1],
> | 30.00th=[ 2], 40.00th=[ 2], 50.00th=[ 2], 60.00th=[ 6],
> | 70.00th=[ 22], 80.00th=[ 36], 90.00th=[ 72], 95.00th=[ 107],
> | 99.00th=[ 173], 99.50th=[ 203], 99.90th=[ 932], 99.95th=[ 1416],
> | 99.99th=[ 2960]
> bw (MB /s): min= 1080, max= 2121, per=12.51%, avg=1522.33, stdev=253.96
> lat (usec) : 2=21.63%, 4=37.80%, 10=2.12%, 20=6.43%, 50=16.70%
> lat (usec) : 100=8.86%, 250=6.07%, 500=0.17%, 750=0.08%, 1000=0.05%
> lat (msec) : 2=0.06%, 4=0.02%, 10=0.01%, 20=0.01%
> cpu : usr=22.39%, sys=64.19%, ctx=15425825, majf=0, minf=97
> IO depths : 1=1.8%, 2=1.8%, 4=8.8%, 8=14.4%, 16=12.3%, 32=41.7%, >=64=19.3%
> submit : 0=0.0%, 4=5.8%, 8=9.7%, 16=15.0%, 32=18.0%, 64=51.5%, >=64=0.0%
> complete : 0=0.0%, 4=0.1%, 8=0.0%, 16=0.1%, 32=0.1%, 64=100.0%, >=64=0.0%
> issued : total=r=805764385/w=806127393/d=0, short=r=0/w=0/d=0,
> drop=r=0/w=0/d=0
> latency : target=0, window=0, percentile=100.00%, depth=64
>
> Run status group 0 (all jobs):
> READ: io=7133.2GB, aggrb=12174MB/s, minb=12174MB/s, maxb=12174MB/s,
> mint=600001msec, maxt=600001msec
> WRITE: io=7131.3GB, aggrb=12171MB/s, minb=12171MB/s, maxb=12171MB/s,
> mint=600001msec, maxt=600001msec
>
> Disk stats (read/write):
> nullb0: ios=442461761/442546060, merge=363197836/363473703,
> ticks=12280990/12452480, in_queue=2740, util=0.43%
>
> 1.3 Following is the result after porting.
>
> fio start : Fri Jun 15 12:42:47 CEST 2018
> kernel info : Linux ib1 4.14.48-1-pserver-mpbvec+ #12 SMP Fri Jun 15
> 12:21:36 CEST 2018 x86_64 GNU/Linux
> fio version : fio-2.2.10
> fiotest: (g=0): rw=randrw, bs=512-64K/512-64K/512-64K,
> ioengine=libaio, iodepth=64
> ...
> fio-2.2.10
> Starting 8 processes
> Jobs: 4 (f=0): [m(1),_(2),m(1),_(1),m(2),_(1)] [100.0% done]
> [8430MB/8444MB/0KB /s] [961K/963K/0 iops] [eta 00m:00s]
> fiotest: (groupid=0, jobs=8): err= 0: pid=14096: Fri Jun 15 12:52:48 2018
> read : io=6633.8GB, bw=11322MB/s, iops=1246.9K, runt=600005msec
> slat (usec): min=1, max=16939, avg=135.34, stdev=156.23
> clat (usec): min=0, max=16947, avg=26.10, stdev=78.50
> lat (usec): min=2, max=16957, avg=161.45, stdev=168.88
> clat percentiles (usec):
> | 1.00th=[ 0], 5.00th=[ 1], 10.00th=[ 1], 20.00th=[ 1],
> | 30.00th=[ 2], 40.00th=[ 2], 50.00th=[ 2], 60.00th=[ 5],
> | 70.00th=[ 23], 80.00th=[ 37], 90.00th=[ 79], 95.00th=[ 115],
> | 99.00th=[ 181], 99.50th=[ 211], 99.90th=[ 948], 99.95th=[ 1416],
> | 99.99th=[ 2864]
> bw (MB /s): min= 1106, max= 2031, per=12.51%, avg=1416.05, stdev=201.81
> write: io=6631.1GB, bw=11318MB/s, iops=1247.5K, runt=600005msec
> slat (usec): min=1, max=16938, avg=136.48, stdev=156.54
> clat (usec): min=0, max=16947, avg=26.08, stdev=78.43
> lat (usec): min=2, max=16957, avg=162.58, stdev=169.15
> clat percentiles (usec):
> | 1.00th=[ 0], 5.00th=[ 1], 10.00th=[ 1], 20.00th=[ 1],
> | 30.00th=[ 2], 40.00th=[ 2], 50.00th=[ 2], 60.00th=[ 5],
> | 70.00th=[ 23], 80.00th=[ 37], 90.00th=[ 79], 95.00th=[ 115],
> | 99.00th=[ 181], 99.50th=[ 211], 99.90th=[ 948], 99.95th=[ 1416],
> | 99.99th=[ 2864]
> bw (MB /s): min= 1084, max= 2044, per=12.51%, avg=1415.67, stdev=201.93
> lat (usec) : 2=20.98%, 4=38.82%, 10=2.15%, 20=5.08%, 50=16.91%
> lat (usec) : 100=8.75%, 250=6.91%, 500=0.19%, 750=0.09%, 1000=0.05%
> lat (msec) : 2=0.07%, 4=0.02%, 10=0.01%, 20=0.01%
> cpu : usr=21.02%, sys=65.53%, ctx=15321661, majf=0, minf=78
> IO depths : 1=1.9%, 2=1.9%, 4=9.5%, 8=13.6%, 16=11.2%, 32=42.1%, >=64=19.9%
> submit : 0=0.0%, 4=6.3%, 8=10.1%, 16=14.1%, 32=18.2%,
> 64=51.3%, >=64=0.0%
> complete : 0=0.0%, 4=0.1%, 8=0.0%, 16=0.1%, 32=0.1%, 64=100.0%, >=64=0.0%
> issued : total=r=748120019/w=748454509/d=0, short=r=0/w=0/d=0,
> drop=r=0/w=0/d=0
> latency : target=0, window=0, percentile=100.00%, depth=64
>
> Run status group 0 (all jobs):
> READ: io=6633.8GB, aggrb=11322MB/s, minb=11322MB/s, maxb=11322MB/s,
> mint=600005msec, maxt=600005msec
> WRITE: io=6631.1GB, aggrb=11318MB/s, minb=11318MB/s, maxb=11318MB/s,
> mint=600005msec, maxt=600005msec
>
> Disk stats (read/write):
> nullb0: ios=410911387/410974086, merge=337127604/337396176,
> ticks=12482050/12662790, in_queue=1780, util=0.27%
>
>
> 2. Unixbench
>
> Second I rand Unixbench to check general performance.
> I think there is no difference before and after porting the patches.
> Unixbench might not be suitable to check the performance improvement
> of the block layer.
> If you inform me which tools is suitable, I will try it on my system.
>
> 2.1 Following is the result before porting.
>
> BYTE UNIX Benchmarks (Version 5.1.3)
>
> System: ib1: GNU/Linux
> OS: GNU/Linux -- 4.14.48-1-pserver --
> #4.14.48-1.1+feature+daily+update+20180607.0857+1bbde0b~deb8 SMP
> Machine: x86_64 (unknown)
> Language: en_US.utf8 (charmap="UTF-8", collate="UTF-8")
> CPU 0: Intel(R) Xeon(R) CPU E3-1245 v5 @ 3.50GHz (7008.0 bogomips)
> Hyper-Threading, x86-64, MMX, Physical Address Ext,
> SYSENTER/SYSEXIT, SYSCALL/SYSRET, Intel virtualization
> CPU 1: Intel(R) Xeon(R) CPU E3-1245 v5 @ 3.50GHz (7008.0 bogomips)
> Hyper-Threading, x86-64, MMX, Physical Address Ext,
> SYSENTER/SYSEXIT, SYSCALL/SYSRET, Intel virtualization
> CPU 2: Intel(R) Xeon(R) CPU E3-1245 v5 @ 3.50GHz (7008.0 bogomips)
> Hyper-Threading, x86-64, MMX, Physical Address Ext,
> SYSENTER/SYSEXIT, SYSCALL/SYSRET, Intel virtualization
> CPU 3: Intel(R) Xeon(R) CPU E3-1245 v5 @ 3.50GHz (7008.0 bogomips)
> Hyper-Threading, x86-64, MMX, Physical Address Ext,
> SYSENTER/SYSEXIT, SYSCALL/SYSRET, Intel virtualization
> CPU 4: Intel(R) Xeon(R) CPU E3-1245 v5 @ 3.50GHz (7008.0 bogomips)
> Hyper-Threading, x86-64, MMX, Physical Address Ext,
> SYSENTER/SYSEXIT, SYSCALL/SYSRET, Intel virtualization
> CPU 5: Intel(R) Xeon(R) CPU E3-1245 v5 @ 3.50GHz (7008.0 bogomips)
> Hyper-Threading, x86-64, MMX, Physical Address Ext,
> SYSENTER/SYSEXIT, SYSCALL/SYSRET, Intel virtualization
> CPU 6: Intel(R) Xeon(R) CPU E3-1245 v5 @ 3.50GHz (7008.0 bogomips)
> Hyper-Threading, x86-64, MMX, Physical Address Ext,
> SYSENTER/SYSEXIT, SYSCALL/SYSRET, Intel virtualization
> CPU 7: Intel(R) Xeon(R) CPU E3-1245 v5 @ 3.50GHz (7008.0 bogomips)
> Hyper-Threading, x86-64, MMX, Physical Address Ext,
> SYSENTER/SYSEXIT, SYSCALL/SYSRET, Intel virtualization
> 05:00:01 up 3 days, 16:20, 2 users, load average: 0.00, 0.11,
> 1.11; runlevel 2018-06-07
>
> ------------------------------------------------------------------------
> Benchmark Run: Mon Jun 11 2018 05:00:01 - 05:28:54
> 8 CPUs in system; running 1 parallel copy of tests
>
> Dhrystone 2 using register variables 47158867.7 lps (10.0 s, 7 samples)
> Double-Precision Whetstone 3878.8 MWIPS (15.2 s, 7 samples)
> Execl Throughput 9203.9 lps (30.0 s, 2 samples)
> File Copy 1024 bufsize 2000 maxblocks 1490834.8 KBps (30.0 s, 2 samples)
> File Copy 256 bufsize 500 maxblocks 388784.2 KBps (30.0 s, 2 samples)
> File Copy 4096 bufsize 8000 maxblocks 3744780.2 KBps (30.0 s, 2 samples)
> Pipe Throughput 2682620.1 lps (10.0 s, 7 samples)
> Pipe-based Context Switching 263786.5 lps (10.0 s, 7 samples)
> Process Creation 19674.0 lps (30.0 s, 2 samples)
> Shell Scripts (1 concurrent) 16121.5 lpm (60.0 s, 2 samples)
> Shell Scripts (8 concurrent) 5623.5 lpm (60.0 s, 2 samples)
> System Call Overhead 4068991.3 lps (10.0 s, 7 samples)
>
> System Benchmarks Index Values BASELINE RESULT INDEX
> Dhrystone 2 using register variables 116700.0 47158867.7 4041.0
> Double-Precision Whetstone 55.0 3878.8 705.2
> Execl Throughput 43.0 9203.9 2140.4
> File Copy 1024 bufsize 2000 maxblocks 3960.0 1490834.8 3764.7
> File Copy 256 bufsize 500 maxblocks 1655.0 388784.2 2349.1
> File Copy 4096 bufsize 8000 maxblocks 5800.0 3744780.2 6456.5
> Pipe Throughput 12440.0 2682620.1 2156.4
> Pipe-based Context Switching 4000.0 263786.5 659.5
> Process Creation 126.0 19674.0 1561.4
> Shell Scripts (1 concurrent) 42.4 16121.5 3802.2
> Shell Scripts (8 concurrent) 6.0 5623.5 9372.5
> System Call Overhead 15000.0 4068991.3 2712.7
> ========
> System Benchmarks Index Score 2547.7
>
> ------------------------------------------------------------------------
> Benchmark Run: Mon Jun 11 2018 05:28:54 - 05:57:07
> 8 CPUs in system; running 8 parallel copies of tests
>
> Dhrystone 2 using register variables 234727639.9 lps (10.0 s, 7 samples)
> Double-Precision Whetstone 35350.9 MWIPS (10.7 s, 7 samples)
> Execl Throughput 43811.3 lps (30.0 s, 2 samples)
> File Copy 1024 bufsize 2000 maxblocks 1401373.1 KBps (30.0 s, 2 samples)
> File Copy 256 bufsize 500 maxblocks 366033.9 KBps (30.0 s, 2 samples)
> File Copy 4096 bufsize 8000 maxblocks 4360829.6 KBps (30.0 s, 2 samples)
> Pipe Throughput 12875165.6 lps (10.0 s, 7 samples)
> Pipe-based Context Switching 2431725.6 lps (10.0 s, 7 samples)
> Process Creation 97360.8 lps (30.0 s, 2 samples)
> Shell Scripts (1 concurrent) 58879.6 lpm (60.0 s, 2 samples)
> Shell Scripts (8 concurrent) 9232.5 lpm (60.0 s, 2 samples)
> System Call Overhead 9497958.7 lps (10.0 s, 7 samples)
>
> System Benchmarks Index Values BASELINE RESULT INDEX
> Dhrystone 2 using register variables 116700.0 234727639.9 20113.8
> Double-Precision Whetstone 55.0 35350.9 6427.4
> Execl Throughput 43.0 43811.3 10188.7
> File Copy 1024 bufsize 2000 maxblocks 3960.0 1401373.1 3538.8
> File Copy 256 bufsize 500 maxblocks 1655.0 366033.9 2211.7
> File Copy 4096 bufsize 8000 maxblocks 5800.0 4360829.6 7518.7
> Pipe Throughput 12440.0 12875165.6 10349.8
> Pipe-based Context Switching 4000.0 2431725.6 6079.3
> Process Creation 126.0 97360.8 7727.0
> Shell Scripts (1 concurrent) 42.4 58879.6 13886.7
> Shell Scripts (8 concurrent) 6.0 9232.5 15387.5
> System Call Overhead 15000.0 9497958.7 6332.0
> ========
> System Benchmarks Index Score 7803.5
>
>
> 2.2 Following is the result after porting.
>
> BYTE UNIX Benchmarks (Version 5.1.3)
>
> System: ib1: GNU/Linux
> OS: GNU/Linux -- 4.14.48-1-pserver-mpbvec+ -- #12 SMP Fri Jun 15
> 12:21:36 CEST 2018
> Machine: x86_64 (unknown)
> Language: en_US.utf8 (charmap="UTF-8", collate="UTF-8")
> CPU 0: Intel(R) Xeon(R) CPU E3-1245 v5 @ 3.50GHz (7008.0 bogomips)
> Hyper-Threading, x86-64, MMX, Physical Address Ext,
> SYSENTER/SYSEXIT, SYSCALL/SYSRET, Intel virtualization
> CPU 1: Intel(R) Xeon(R) CPU E3-1245 v5 @ 3.50GHz (7008.0 bogomips)
> Hyper-Threading, x86-64, MMX, Physical Address Ext,
> SYSENTER/SYSEXIT, SYSCALL/SYSRET, Intel virtualization
> CPU 2: Intel(R) Xeon(R) CPU E3-1245 v5 @ 3.50GHz (7008.0 bogomips)
> Hyper-Threading, x86-64, MMX, Physical Address Ext,
> SYSENTER/SYSEXIT, SYSCALL/SYSRET, Intel virtualization
> CPU 3: Intel(R) Xeon(R) CPU E3-1245 v5 @ 3.50GHz (7008.0 bogomips)
> Hyper-Threading, x86-64, MMX, Physical Address Ext,
> SYSENTER/SYSEXIT, SYSCALL/SYSRET, Intel virtualization
> CPU 4: Intel(R) Xeon(R) CPU E3-1245 v5 @ 3.50GHz (7008.0 bogomips)
> Hyper-Threading, x86-64, MMX, Physical Address Ext,
> SYSENTER/SYSEXIT, SYSCALL/SYSRET, Intel virtualization
> CPU 5: Intel(R) Xeon(R) CPU E3-1245 v5 @ 3.50GHz (7008.0 bogomips)
> Hyper-Threading, x86-64, MMX, Physical Address Ext,
> SYSENTER/SYSEXIT, SYSCALL/SYSRET, Intel virtualization
> CPU 6: Intel(R) Xeon(R) CPU E3-1245 v5 @ 3.50GHz (7008.0 bogomips)
> Hyper-Threading, x86-64, MMX, Physical Address Ext,
> SYSENTER/SYSEXIT, SYSCALL/SYSRET, Intel virtualization
> CPU 7: Intel(R) Xeon(R) CPU E3-1245 v5 @ 3.50GHz (7008.0 bogomips)
> Hyper-Threading, x86-64, MMX, Physical Address Ext,
> SYSENTER/SYSEXIT, SYSCALL/SYSRET, Intel virtualization
> 13:16:11 up 50 min, 1 user, load average: 0.00, 1.40, 3.46;
> runlevel 2018-06-15
>
> ------------------------------------------------------------------------
> Benchmark Run: Fri Jun 15 2018 13:16:11 - 13:45:04
> 8 CPUs in system; running 1 parallel copy of tests
>
> Dhrystone 2 using register variables 47103754.6 lps (10.0 s, 7 samples)
> Double-Precision Whetstone 3886.3 MWIPS (15.1 s, 7 samples)
> Execl Throughput 8965.0 lps (30.0 s, 2 samples)
> File Copy 1024 bufsize 2000 maxblocks 1510285.9 KBps (30.0 s, 2 samples)
> File Copy 256 bufsize 500 maxblocks 395196.9 KBps (30.0 s, 2 samples)
> File Copy 4096 bufsize 8000 maxblocks 3802788.0 KBps (30.0 s, 2 samples)
> Pipe Throughput 2670169.1 lps (10.0 s, 7 samples)
> Pipe-based Context Switching 275093.8 lps (10.0 s, 7 samples)
> Process Creation 19707.1 lps (30.0 s, 2 samples)
> Shell Scripts (1 concurrent) 16046.8 lpm (60.0 s, 2 samples)
> Shell Scripts (8 concurrent) 5600.8 lpm (60.0 s, 2 samples)
> System Call Overhead 4104142.0 lps (10.0 s, 7 samples)
>
> System Benchmarks Index Values BASELINE RESULT INDEX
> Dhrystone 2 using register variables 116700.0 47103754.6 4036.3
> Double-Precision Whetstone 55.0 3886.3 706.6
> Execl Throughput 43.0 8965.0 2084.9
> File Copy 1024 bufsize 2000 maxblocks 3960.0 1510285.9 3813.9
> File Copy 256 bufsize 500 maxblocks 1655.0 395196.9 2387.9
> File Copy 4096 bufsize 8000 maxblocks 5800.0 3802788.0 6556.5
> Pipe Throughput 12440.0 2670169.1 2146.4
> Pipe-based Context Switching 4000.0 275093.8 687.7
> Process Creation 126.0 19707.1 1564.1
> Shell Scripts (1 concurrent) 42.4 16046.8 3784.6
> Shell Scripts (8 concurrent) 6.0 5600.8 9334.6
> System Call Overhead 15000.0 4104142.0 2736.1
> ========
> System Benchmarks Index Score 2560.0
>
> ------------------------------------------------------------------------
> Benchmark Run: Fri Jun 15 2018 13:45:04 - 14:13:17
> 8 CPUs in system; running 8 parallel copies of tests
>
> Dhrystone 2 using register variables 237271982.6 lps (10.0 s, 7 samples)
> Double-Precision Whetstone 35186.8 MWIPS (10.7 s, 7 samples)
> Execl Throughput 42557.8 lps (30.0 s, 2 samples)
> File Copy 1024 bufsize 2000 maxblocks 1403922.0 KBps (30.0 s, 2 samples)
> File Copy 256 bufsize 500 maxblocks 367436.5 KBps (30.0 s, 2 samples)
> File Copy 4096 bufsize 8000 maxblocks 4380468.3 KBps (30.0 s, 2 samples)
> Pipe Throughput 12872664.6 lps (10.0 s, 7 samples)
> Pipe-based Context Switching 2451404.5 lps (10.0 s, 7 samples)
> Process Creation 97788.2 lps (30.0 s, 2 samples)
> Shell Scripts (1 concurrent) 58505.9 lpm (60.0 s, 2 samples)
> Shell Scripts (8 concurrent) 9195.4 lpm (60.0 s, 2 samples)
> System Call Overhead 9467372.2 lps (10.0 s, 7 samples)
>
> System Benchmarks Index Values BASELINE RESULT INDEX
> Dhrystone 2 using register variables 116700.0 237271982.6 20331.8
> Double-Precision Whetstone 55.0 35186.8 6397.6
> Execl Throughput 43.0 42557.8 9897.2
> File Copy 1024 bufsize 2000 maxblocks 3960.0 1403922.0 3545.3
> File Copy 256 bufsize 500 maxblocks 1655.0 367436.5 2220.2
> File Copy 4096 bufsize 8000 maxblocks 5800.0 4380468.3 7552.5
> Pipe Throughput 12440.0 12872664.6 10347.8
> Pipe-based Context Switching 4000.0 2451404.5 6128.5
> Process Creation 126.0 97788.2 7761.0
> Shell Scripts (1 concurrent) 42.4 58505.9 13798.6
> Shell Scripts (8 concurrent) 6.0 9195.4 15325.6
> System Call Overhead 15000.0 9467372.2 6311.6
> ========
> System Benchmarks Index Score 7794.3

At least now, BIO_MAX_PAGES can be fixed as 256 in case of CONFIG_THP_SWAP,
otherwise 2 pages may be allocated for holding the bvec table, so tests
in case of THP_SWAP may be improved.

Also filesystem may support IO to/from THP, and multipage bvec should
improve this case too.

Long term, there is opportunity to improve fs code by only allocating
'nr_segment' of bvec table, instead of 'nr_page' of bvec table because
physically contiguous pages are often allocated from mm for same
process.

So this patchset is just a start, and at the current stage, I am
focusing on making it stable since it is the correct approach to
only store the multipage segment instead of each pages.

Thanks again for your test.

Thanks,
Ming