Re: [bug report] iommu_dma_unmap_sg() is very slow then running IO from remote numa node

From: John Garry
Date: Thu Jul 22 2021 - 06:05:13 EST


On 22/07/2021 08:58, Ming Lei wrote:
On Wed, Jul 21, 2021 at 12:07:22PM +0100, John Garry wrote:
On 21/07/2021 10:59, Ming Lei wrote:
I have now removed that from the tree, so please re-pull.
Now the kernel can be built successfully, but not see obvious improvement
on the reported issue:

[root@ampere-mtjade-04 ~]# uname -a
Linux ampere-mtjade-04.khw4.lab.eng.bos.redhat.com 5.14.0-rc2_smmu_fix+ #2 SMP Wed Jul 21 05:49:03 EDT 2021 aarch64 aarch64 aarch64 GNU/Linux

[root@ampere-mtjade-04 ~]# taskset -c 0 ~/git/tools/test/nvme/io_uring 10 1 /dev/nvme1n1 4k
+ fio --bs=4k --ioengine=io_uring --fixedbufs --registerfiles --hipri --iodepth=64 --iodepth_batch_submit=16 --iodepth_batch_complete_min=16 --filename=/dev/nvme1n1 --direct=1 --runtime=10 --numjobs=1 --rw=randread --name=test --group_reporting
test: (g=0): rw=randread, bs=(R) 4096B-4096B, (W) 4096B-4096B, (T) 4096B-4096B, ioengine=io_uring, iodepth=64
fio-3.27
Starting 1 process
Jobs: 1 (f=1): [r(1)][100.0%][r=1503MiB/s][r=385k IOPS][eta 00m:00s]
test: (groupid=0, jobs=1): err= 0: pid=3143: Wed Jul 21 05:58:14 2021
read: IOPS=384k, BW=1501MiB/s (1573MB/s)(14.7GiB/10001msec)
I am not sure what baseline you used previously, but you were getting 327K
then, so at least this would be an improvement.
Looks the improvement isn't from your patches, please see the test result on
v5.14-rc2:

[root@ampere-mtjade-04 ~]# uname -a
Linux ampere-mtjade-04.khw4.lab.eng.bos.redhat.com 5.14.0-rc2_linus #3 SMP Thu Jul 22 03:41:24 EDT 2021 aarch64 aarch64 aarch64 GNU/Linux
[root@ampere-mtjade-04 ~]# taskset -c 0 ~/git/tools/test/nvme/io_uring 20 1 /dev/nvme1n1 4k
+ fio --bs=4k --ioengine=io_uring --fixedbufs --registerfiles --hipri --iodepth=64 --iodepth_batch_submit=16 --iodepth_batch_complete_min=16 --filename=/dev/nvme1n1 --direct=1 --runtime=20 --numjobs=1 --rw=randread --name=test --group_reporting
test: (g=0): rw=randread, bs=(R) 4096B-4096B, (W) 4096B-4096B, (T) 4096B-4096B, ioengine=io_uring, iodepth=64
fio-3.27
Starting 1 process
Jobs: 1 (f=1): [r(1)][100.0%][r=1489MiB/s][r=381k IOPS][eta 00m:00s]
test: (groupid=0, jobs=1): err= 0: pid=3099: Thu Jul 22 03:53:04 2021
read: IOPS=381k, BW=1487MiB/s (1559MB/s)(29.0GiB/20001msec)

I'm a bit surprised at that.

Anyway, I don't see such an issue as you are seeing on my system. In general, running from different nodes doesn't make a huge difference. But note that the NVMe device is on NUMA node #2 on my 4-node system. I assume that the IOMMU is also located in that node.

sudo taskset -c 0 fio/fio --bs=4k --ioengine=io_uring --fixedbufs --registerfiles --hipri --iodepth=64 --iodepth_batch_submit=16 --iodepth_batch_complete_min=16 --filename=/dev/nvme0n1 --direct=1 --runtime=20 --numjobs=1 --rw=randread --name=test --group_reporting

read: IOPS=479k

---
sudo taskset -c 4 fio/fio --bs=4k --ioengine=io_uring --fixedbufs --registerfiles --hipri --iodepth=64 --iodepth_batch_submit=16 --iodepth_batch_complete_min=16 --filename=/dev/nvme0n1 --direct=1 --runtime=20 --numjobs=1 --rw=randread --name=test --group_reporting

read: IOPS=307k

---
sudo taskset -c 32 fio/fio --bs=4k --ioengine=io_uring --fixedbufs --registerfiles --hipri --iodepth=64 --iodepth_batch_submit=16 --iodepth_batch_complete_min=16 --filename=/dev/nvme0n1 --direct=1 --runtime=20 --numjobs=1 --rw=randread --name=test --group_reporting

read: IOPS=566k

--
sudo taskset -c 64 fio/fio --bs=4k --ioengine=io_uring --fixedbufs --registerfiles --hipri --iodepth=64 --iodepth_batch_submit=16 --iodepth_batch_complete_min=16 --filename=/dev/nvme0n1 --direct=1 --runtime=20 --numjobs=1 --rw=randread --name=test --group_reporting

read: IOPS=488k

---
sudo taskset -c 96 fio/fio --bs=4k --ioengine=io_uring --fixedbufs --registerfiles --hipri --iodepth=64 --iodepth_batch_submit=16 --iodepth_batch_complete_min=16 --filename=/dev/nvme0n1 --direct=1 --runtime=20 --numjobs=1 --rw=randread --name=test --group_reporting

read: IOPS=508k


If you check below, you can see that cpu4 services an NVMe irq. From checking htop, during the test that cpu is at 100% load, which I put the performance drop (vs cpu0) down to.

Here's some system info:

HW queue irq affinities:
PCI name is 81:00.0: nvme0n1
-eirq 298, cpu list 67, effective list 67
-eirq 299, cpu list 32-38, effective list 35
-eirq 300, cpu list 39-45, effective list 39
-eirq 301, cpu list 46-51, effective list 46
-eirq 302, cpu list 52-57, effective list 52
-eirq 303, cpu list 58-63, effective list 60
-eirq 304, cpu list 64-69, effective list 68
-eirq 305, cpu list 70-75, effective list 70
-eirq 306, cpu list 76-80, effective list 76
-eirq 307, cpu list 81-85, effective list 84
-eirq 308, cpu list 86-90, effective list 86
-eirq 309, cpu list 91-95, effective list 92
-eirq 310, cpu list 96-101, effective list 100
-eirq 311, cpu list 102-107, effective list 102
-eirq 312, cpu list 108-112, effective list 108
-eirq 313, cpu list 113-117, effective list 116
-eirq 314, cpu list 118-122, effective list 118
-eirq 315, cpu list 123-127, effective list 124
-eirq 316, cpu list 0-5, effective list 4
-eirq 317, cpu list 6-11, effective list 6
-eirq 318, cpu list 12-16, effective list 12
-eirq 319, cpu list 17-21, effective list 20
-eirq 320, cpu list 22-26, effective list 22
-eirq 321, cpu list 27-31, effective list 28


john@ubuntu:~$ lscpu | grep NUMA
NUMA node(s): 4
NUMA node0 CPU(s): 0-31
NUMA node1 CPU(s): 32-63
NUMA node2 CPU(s): 64-95
NUMA node3 CPU(s): 96-127

john@ubuntu:~$ lspci | grep -i non
81:00.0 Non-Volatile memory controller: Huawei Technologies Co., Ltd. Device 0123 (rev 45)

cat /sys/block/nvme0n1/device/device/numa_node
2

[ 52.968495] nvme 0000:81:00.0: Adding to iommu group 5
[ 52.980484] nvme nvme0: pci function 0000:81:00.0
[ 52.999881] nvme nvme0: 23/0/0 default/read/poll queues
[ 53.019821] nvme0n1: p1

john@ubuntu:~$ uname -a
Linux ubuntu 5.14.0-rc2-dirty #297 SMP PREEMPT Thu Jul 22 09:23:33 BST 2021 aarch64 aarch64 aarch64 GNU/Linux

Thanks,
John