Kernel 4.14 + has 100 times higher IO latency than Kernel 4.4 with raid1

From: Jinpu Wang
Date: Fri Aug 02 2019 - 10:49:10 EST


Hi All,

We found a problem regarding much higher IO latency when running
kernel 4.4.131 compare to 4.14.133, tried with latest upstream
5.3-rc2, same result.

Reproducer:
1 create md raid1 with 2 ram disks:
sudo mdadm -C /dev/md0 -l1 -n2 -e1.2 --bitmap=internal /dev/ram[0-1]
2 run fio command with rate_iops:
fio --rw=write --ioengine=libaio --iodepth=32 --size=1000MB
--rate_iops=5000 --direct=1 --numjobs=1 --runtime=20 --time_based
--name=write-test --filename=/dev/md0

result on 4.4 kernel:
root@ib2:~# fio --rw=write --ioengine=libaio --iodepth=32
--size=1000MB --rate_iops=5000 --direct=1 --numjobs=1 --runtime=20
--time_based --name=write-test --filename=/dev/md0
write-test: (g=0): rw=write, bs=4K-4K/4K-4K/4K-4K, ioengine=libaio, iodepth=32
fio-2.2.10
Starting 1 process
Jobs: 1 (f=1), CR=5000/0 IOPS: [W(1)] [100.0% done] [0KB/20008KB/0KB
/s] [0/5002/0 iops] [eta 00m:00s]
write-test: (groupid=0, jobs=1): err= 0: pid=3351: Fri Aug 2 15:31:26 2019
write: io=400004KB, bw=19999KB/s, iops=4999, runt= 20001msec
slat (usec): min=3, max=26, avg= 3.12, stdev= 0.36
clat (usec): min=0, max=116, avg= 2.04, stdev= 1.33
lat (usec): min=3, max=141, avg= 5.19, stdev= 1.39
clat percentiles (usec):
| 1.00th=[ 1], 5.00th=[ 2], 10.00th=[ 2], 20.00th=[ 2],
| 30.00th=[ 2], 40.00th=[ 2], 50.00th=[ 2], 60.00th=[ 2],
| 70.00th=[ 2], 80.00th=[ 2], 90.00th=[ 2], 95.00th=[ 3],
| 99.00th=[ 3], 99.50th=[ 3], 99.90th=[ 3], 99.95th=[ 3],
| 99.99th=[ 86]
bw (KB /s): min=20000, max=20008, per=100.00%, avg=20005.54, stdev= 3.74
lat (usec) : 2=3.37%, 4=96.60%, 10=0.01%, 20=0.01%, 50=0.01%
lat (usec) : 100=0.01%, 250=0.01%
cpu : usr=4.25%, sys=0.00%, ctx=198550, majf=0, minf=11
IO depths : 1=100.0%, 2=0.1%, 4=0.1%, 8=0.1%, 16=0.1%, 32=0.1%, >=64=0.0%
submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
complete : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
issued : total=r=0/w=100001/d=0, short=r=0/w=0/d=0, drop=r=0/w=0/d=0
latency : target=0, window=0, percentile=100.00%, depth=32

Run status group 0 (all jobs):
WRITE: io=400004KB, aggrb=19999KB/s, minb=19999KB/s, maxb=19999KB/s,
mint=20001msec, maxt=20001msec

Disk stats (read/write):
md0: ios=61/99539, merge=0/0, ticks=0/0, in_queue=0, util=0.00%,
aggrios=0/0, aggrmerge=0/0, aggrticks=0/0, aggrin_queue=0,
aggrutil=0.00%

result on 5.3 kernel
root@ib1:/home/jwang# fio --rw=write --ioengine=libaio --iodepth=32
--size=1000MB --rate_iops=5 --direct=1 --numjobs=1 --runtime=20
--time_based --name=write-test --filename=/dev/md0
write-test: (g=0): rw=write, bs=4K-4K/4K-4K/4K-4K, ioengine=libaio, iodepth=32
fio-2.2.10
Starting 1 process
Jobs: 1 (f=1), CR=5/0 IOPS: [W(1)] [100.0% done] [0KB/20KB/0KB /s]
[0/5/0 iops] [eta 00m:00s]
write-test: (groupid=0, jobs=1): err= 0: pid=1651: Fri Aug 2 17:16:18 2019
write: io=413696B, bw=20683B/s, iops=5, runt= 20001msec
slat (usec): min=2, max=51803, avg=1028.62, stdev=7250.96
clat (usec): min=0, max=91, avg=17.76, stdev=28.07
lat (usec): min=3, max=51892, avg=1046.50, stdev=7254.89
clat percentiles (usec):
| 1.00th=[ 0], 5.00th=[ 0], 10.00th=[ 0], 20.00th=[ 0],
| 30.00th=[ 1], 40.00th=[ 1], 50.00th=[ 1], 60.00th=[ 1],
| 70.00th=[ 19], 80.00th=[ 44], 90.00th=[ 68], 95.00th=[ 80],
| 99.00th=[ 88], 99.50th=[ 91], 99.90th=[ 91], 99.95th=[ 91],
| 99.99th=[ 91]
bw (KB /s): min= 20, max= 21, per=100.00%, avg=20.04, stdev= 0.21
lat (usec) : 2=67.33%, 10=0.99%, 20=1.98%, 50=11.88%, 100=17.82%
cpu : usr=0.00%, sys=0.00%, ctx=77, majf=0, minf=10
IO depths : 1=68.3%, 2=2.0%, 4=4.0%, 8=7.9%, 16=15.8%, 32=2.0%, >=64=0.0%
submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
complete : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
issued : total=r=0/w=101/d=0, short=r=0/w=0/d=0, drop=r=0/w=0/d=0
latency : target=0, window=0, percentile=100.00%, depth=32

Run status group 0 (all jobs):
WRITE: io=404KB, aggrb=20KB/s, minb=20KB/s, maxb=20KB/s,
mint=20001msec, maxt=20001msec

Disk stats (read/write):
md0: ios=0/100, merge=0/0, ticks=0/0, in_queue=0, util=0.00%,
aggrios=0/0, aggrmerge=0/0, aggrticks=0/0, aggrin_queue=0,
aggrutil=0.00%
ram0: ios=0/0, merge=0/0, ticks=0/0, in_queue=0, util=0.00%
ram1: ios=0/0, merge=0/0, ticks=0/0, in_queue=0, util=0.00%

During the tests the following kernel parameters are applied:
processor.max_cstate=0 idle=poll mitigations=off

Could anyone give us a hint, what could lead to such a huge difference?

Thank you!
Jack Wang