On 01/07/2024 10:42, Ping Gan wrote:Before these patches, we had used linux's RPS to forward the packets
I suggest that you focus on that instead of what you proposed.Hey Ping Gan,hi Sagi Grimberg,
On 26/06/2024 11:28, Ping Gan wrote:
When running nvmf on SMP platform, current nvme target's RDMA andThis is NOT the way to go here.
TCP use kworker to handle IO. But if there is other high workload
in the system(eg: on kubernetes), the competition between the
kworker and other workload is very radical. And since the kworker
is scheduled by OS randomly, it's difficult to control OS resource
and also tune the performance. If target support to use delicated
polling task to handle IO, it's useful to control OS resource and
gain good performance. So it makes sense to add polling task in
rdma-rdma and rdma-tcp modules.
Both rdma and tcp are driven from workqueue context, which are bound
workqueues.
So there are two ways to go here:
1. Add generic port cpuset and use that to direct traffic to the
appropriate set of cores
(i.e. select an appropriate comp_vector for rdma and add an
appropriate
steering rule
for tcp).
2. Add options to rdma/tcp to use UNBOUND workqueues, and allow
users
to
control
these UNBOUND workqueues cpumask via sysfs.
(2) will not control interrupts to steer to other workloads cpus,
but
the handlers may
run on a set of dedicated cpus.
(1) is a better solution, but harder to implement.
You also should look into nvmet-fc as well (and nvmet-loop for that
matter).
Thanks for your reply, actually we had tried the first advice you
suggested, but we found the performance was poor when using spdk
as initiator.
What is the source of your poor performance?
to a fixed cpu set for nvmet-tcp. But when did that we can still not
cancel the competition between softirq and workqueue since nvme target's
kworker cpu core bind on socket's cpu which is from skb. Besides that
we found workqueue's wait latency was very high even we enabled polling
on nvmet-tcp by module parameter idle_poll_period_usecs. So when
initiator
is polling mode, the target of workqueue is the bottleneck. Below is
work's wait latency trace log of our test on our cluster(per node uses
4 numas 96 cores, 192G memory, one dual ports mellanox CX4LX(25Gbps X 2)
ethernet adapter and randrw 1M IO size) by RPS to 6 cpu cores. And
system's CPU and memory were used about 80%.
ogden-brown:~ #/usr/share/bcc/tools/wqlat -T -w nvmet_tcp_wq 1 2
01:06:59
usecs : count distribution
0 -> 1 : 0 | |
2 -> 3 : 0 | |
4 -> 7 : 0 | |
8 -> 15 : 3 | |
16 -> 31 : 10 | |
32 -> 63 : 3 | |
64 -> 127 : 2 | |
128 -> 255 : 0 | |
256 -> 511 : 5 | |
512 -> 1023 : 12 | |
1024 -> 2047 : 26 |* |
2048 -> 4095 : 34 |* |
4096 -> 8191 : 350 |************ |
8192 -> 16383 : 625 |******************************|
16384 -> 32767 : 244 |********* |
32768 -> 65535 : 39 |* |
01:07:00
usecs : count distribution
0 -> 1 : 1 | |
2 -> 3 : 0 | |
4 -> 7 : 4 | |
8 -> 15 : 3 | |
16 -> 31 : 8 | |
32 -> 63 : 10 | |
64 -> 127 : 3 | |
128 -> 255 : 6 | |
256 -> 511 : 8 | |
512 -> 1023 : 20 |* |
1024 -> 2047 : 19 |* |
2048 -> 4095 : 57 |** |
4096 -> 8191 : 325 |**************** |
8192 -> 16383 : 647 |******************************|
16384 -> 32767 : 228 |*********** |
32768 -> 65535 : 43 |** |
65536 -> 131071 : 1 | |
And the bandwidth of a node is only 3100MB. While we used the patch
and enable 6 polling task, the bandwidth can be 4000MB. It's a good
improvement.
Yes, target workqueue implementation will poll a quota; but when theYou know this patch is not only resolving OS resourceThat is incorrect, the work context polls the cq until it either drains
competition issue, but also the perf issue. We have analyzed if we
still use workqueue(kworker) as target when initiator is polling
driver(eg: spdk), then workqueue/kworker target is the bottleneck
since every nvmf request may have a wait latency from queuing on
workqueue to begin processing,
it
completely, or exhaust a quota of IB_POLL_BUDGET_WORKQUEUE (or
NVMET_TCP_IO_WORK_BUDGET). Not every command gets its own workqueue
queuing delay.
And, what does the spdk initiator has to do with it? Didn't
understand...
work
load was high we found many work will wait too long(some of them at
several
ms to hundred ms shown above histogram). We use the spdk initiator(by
polling mode) to send IO's read/write to nvme disks of a kubernetes
cluster's remote node.
Yes, in the begining we did use the spdk as polling target driver,and the latency can be traced by wqlatIf you need a target that polls all the time, you should probably
of bcc (https://github.com/iovisor/bcc/blob/master/tools/wqlat.py).
We think the latency is a disaster for the polling driver data plane,
right?
resort
to spdk.
If there is room for optimization in nvmet we'll gladly take it, but
this is not the
way to go IMO.
but we suffered from spdk target could not support disk hot plug/unplug
well, sometimes it will cause data loss when did disk hot plug/unplug.
So we switch to kernel target driver because in production customer's
data security is first priority. And for kernel's target it has no
polling mode target driver, so we implemented these patches.
Yes, we have studied io_uring's code before implementing the patches.So we think adding a polling task mode on nvmet side to handleI personally think that adding a polling kthread is questionable.
IO does really make sense; what's your opinion about this?
However there is a precedent, io_uring sqthreads. So please look
into what is done there. I don't mind having something like
IB_POLL_IOTASK (or
io_task threads in nvmet-tcp) if its done correctly (leverages common
code).
Actually we followed io_uring's design idea in these patches.