Re: [PATCH 0/6] RFC v2: mm: gup/dma tracking

From: Tom Talpey
Date: Tue Feb 05 2019 - 08:38:16 EST


On 2/5/2019 3:22 AM, John Hubbard wrote:
On 2/4/19 5:41 PM, Tom Talpey wrote:
On 2/4/2019 12:21 AM, john.hubbard@xxxxxxxxx wrote:
From: John Hubbard <jhubbard@xxxxxxxxxx>


Performance: here is an fio run on an NVMe drive, using this for the fio
configuration file:

ÂÂÂÂ [reader]
ÂÂÂÂ direct=1
ÂÂÂÂ ioengine=libaio
ÂÂÂÂ blocksize=4096
ÂÂÂÂ size=1g
ÂÂÂÂ numjobs=1
ÂÂÂÂ rw=read
ÂÂÂÂ iodepth=64

reader: (g=0): rw=read, bs=(R) 4096B-4096B, (W) 4096B-4096B, (T) 4096B-4096B, ioengine=libaio, iodepth=64
fio-3.3
Starting 1 process
Jobs: 1 (f=1)
reader: (groupid=0, jobs=1): err= 0: pid=7011: Sun Feb 3 20:36:51 2019
ÂÂÂ read: IOPS=190k, BW=741MiB/s (778MB/s)(1024MiB/1381msec)
ÂÂÂÂ slat (nsec): min=2716, max=57255, avg=4048.14, stdev=1084.10
ÂÂÂÂ clat (usec): min=20, max=12485, avg=332.63, stdev=191.77
ÂÂÂÂÂ lat (usec): min=22, max=12498, avg=336.72, stdev=192.07
ÂÂÂÂ clat percentiles (usec):
 | 1.00th=[ 322], 5.00th=[ 322], 10.00th=[ 322], 20.00th=[ 326],
ÂÂÂÂÂ | 30.00th=[Â 326], 40.00th=[Â 326], 50.00th=[Â 326], 60.00th=[ 326],
ÂÂÂÂÂ | 70.00th=[Â 326], 80.00th=[Â 330], 90.00th=[Â 330], 95.00th=[ 330],
ÂÂÂÂÂ | 99.00th=[Â 478], 99.50th=[Â 717], 99.90th=[ 1074], 99.95th=[ 1090],
ÂÂÂÂÂ | 99.99th=[12256]

These latencies are concerning. The best results we saw at the end of
November (previous approach) were MUCH flatter. These really start
spiking at three 9's, and are sky-high at four 9's. The "stdev" values
for clat and lat are about 10 times the previous. There's some kind
of serious queuing contention here, that wasn't there in November.

Hi Tom,

I think this latency problem is also there in the baseline kernel, but...


ÂÂÂ bw (Â KiB/s): min=730152, max=776512, per=99.22%, avg=753332.00, stdev=32781.47, samples=2
ÂÂÂ iopsÂÂÂÂÂÂÂ : min=182538, max=194128, avg=188333.00, stdev=8195.37, samples=2
ÂÂ lat (usec)ÂÂ : 50=0.01%, 100=0.01%, 250=0.07%, 500=99.26%, 750=0.38%
ÂÂ lat (usec)ÂÂ : 1000=0.02%
ÂÂ lat (msec)ÂÂ : 2=0.24%, 20=0.02%
ÂÂ cpuÂÂÂÂÂÂÂÂÂ : usr=15.07%, sys=84.13%, ctx=10, majf=0, minf=74

System CPU 84% is roughly double the November results of 45%. Ouch.

That's my fault. First of all, I had a few extra, supposedly minor debug
settings in the .config, which I'm removing now--I'm doing a proper run
with the original .config file from November, below. Second, I'm not
sure I controlled the run carefully enough.


Did you re-run the baseline on the new unpatched base kernel and can
we see the before/after?

Doing that now, I see:

-- No significant perf difference between before and after, but
-- Still high clat in the 99.99th

=======================================================================
Before: using commit 8834f5600cf3 ("Linux 5.0-rc5")
===================================================
reader: (g=0): rw=read, bs=(R) 4096B-4096B, (W) 4096B-4096B, (T) 4096B-4096B, ioengine=libaio, iodepth=64
fio-3.3
Starting 1 process
Jobs: 1 (f=1)
reader: (groupid=0, jobs=1): err= 0: pid=1829: Tue Feb 5 00:08:08 2019
ÂÂ read: IOPS=193k, BW=753MiB/s (790MB/s)(1024MiB/1359msec)
ÂÂÂ slat (nsec): min=1269, max=40309, avg=1493.66, stdev=534.83
ÂÂÂ clat (usec): min=127, max=12249, avg=329.83, stdev=184.92
ÂÂÂÂ lat (usec): min=129, max=12256, avg=331.35, stdev=185.06
ÂÂÂ clat percentiles (usec):
 | 1.00th=[ 326], 5.00th=[ 326], 10.00th=[ 326], 20.00th=[ 326],
ÂÂÂÂ | 30.00th=[Â 326], 40.00th=[Â 326], 50.00th=[Â 326], 60.00th=[Â 326],
ÂÂÂÂ | 70.00th=[Â 326], 80.00th=[Â 326], 90.00th=[Â 326], 95.00th=[Â 326],
ÂÂÂÂ | 99.00th=[Â 347], 99.50th=[Â 519], 99.90th=[Â 529], 99.95th=[Â 537],
ÂÂÂÂ | 99.99th=[12125]
ÂÂ bw (Â KiB/s): min=755032, max=781472, per=99.57%, avg=768252.00, stdev=18695.90, samples=2
ÂÂ iopsÂÂÂÂÂÂÂ : min=188758, max=195368, avg=192063.00, stdev=4673.98, samples=2
 lat (usec) : 250=0.08%, 500=99.18%, 750=0.72%
 lat (msec) : 20=0.02%
 cpu : usr=12.30%, sys=46.83%, ctx=253554, majf=0, minf=74
 IO depths : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=0.1%, 32=0.1%, >=64=100.0%
ÂÂÂÂ submitÂÂÂ : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
 complete : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.1%, >=64=0.0%
ÂÂÂÂ issued rwts: total=262144,0,0,0 short=0,0,0,0 dropped=0,0,0,0
ÂÂÂÂ latencyÂÂ : target=0, window=0, percentile=100.00%, depth=64

Run status group 0 (all jobs):
ÂÂ READ: bw=753MiB/s (790MB/s), 753MiB/s-753MiB/s (790MB/s-790MB/s), io=1024MiB (1074MB), run=1359-1359msec

Disk stats (read/write):
 nvme0n1: ios=221246/0, merge=0/0, ticks=71556/0, in_queue=704, util=91.35%

=======================================================================
After:
=======================================================================
reader: (g=0): rw=read, bs=(R) 4096B-4096B, (W) 4096B-4096B, (T) 4096B-4096B, ioengine=libaio, iodepth=64
fio-3.3
Starting 1 process
Jobs: 1 (f=1)
reader: (groupid=0, jobs=1): err= 0: pid=1803: Mon Feb 4 23:58:07 2019
ÂÂ read: IOPS=193k, BW=753MiB/s (790MB/s)(1024MiB/1359msec)
ÂÂÂ slat (nsec): min=1276, max=41900, avg=1505.36, stdev=565.26
ÂÂÂ clat (usec): min=177, max=12186, avg=329.88, stdev=184.03
ÂÂÂÂ lat (usec): min=178, max=12192, avg=331.42, stdev=184.16
ÂÂÂ clat percentiles (usec):
 | 1.00th=[ 326], 5.00th=[ 326], 10.00th=[ 326], 20.00th=[ 326],
ÂÂÂÂ | 30.00th=[Â 326], 40.00th=[Â 326], 50.00th=[Â 326], 60.00th=[Â 326],
ÂÂÂÂ | 70.00th=[Â 326], 80.00th=[Â 326], 90.00th=[Â 326], 95.00th=[Â 326],
ÂÂÂÂ | 99.00th=[Â 359], 99.50th=[Â 498], 99.90th=[Â 537], 99.95th=[Â 627],
ÂÂÂÂ | 99.99th=[12125]
ÂÂ bw (Â KiB/s): min=754656, max=781504, per=99.55%, avg=768080.00, stdev=18984.40, samples=2
ÂÂ iopsÂÂÂÂÂÂÂ : min=188664, max=195378, avg=192021.00, stdev=4747.51, samples=2
 lat (usec) : 250=0.12%, 500=99.40%, 750=0.46%
 lat (msec) : 20=0.02%
 cpu : usr=12.44%, sys=47.05%, ctx=252127, majf=0, minf=73
 IO depths : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=0.1%, 32=0.1%, >=64=100.0%
ÂÂÂÂ submitÂÂÂ : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
 complete : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.1%, >=64=0.0%
ÂÂÂÂ issued rwts: total=262144,0,0,0 short=0,0,0,0 dropped=0,0,0,0
ÂÂÂÂ latencyÂÂ : target=0, window=0, percentile=100.00%, depth=64

Run status group 0 (all jobs):
ÂÂ READ: bw=753MiB/s (790MB/s), 753MiB/s-753MiB/s (790MB/s-790MB/s), io=1024MiB (1074MB), run=1359-1359msec

Disk stats (read/write):
 nvme0n1: ios=221203/0, merge=0/0, ticks=71291/0, in_queue=704, util=91.19%

How's this look to you?

Ok, I'm satisfied the four-9's latency spike is in not your code. :-)
Results look good relative to baseline. Thanks for doublechecking!

Tom.