Hey Ping Gan,hi Sagi Grimberg,
On 26/06/2024 11:28, Ping Gan wrote:
When running nvmf on SMP platform, current nvme target's RDMA andThis is NOT the way to go here.
TCP use kworker to handle IO. But if there is other high workload
in the system(eg: on kubernetes), the competition between the
kworker and other workload is very radical. And since the kworker
is scheduled by OS randomly, it's difficult to control OS resource
and also tune the performance. If target support to use delicated
polling task to handle IO, it's useful to control OS resource and
gain good performance. So it makes sense to add polling task in
rdma-rdma and rdma-tcp modules.
Both rdma and tcp are driven from workqueue context, which are bound
workqueues.
So there are two ways to go here:
1. Add generic port cpuset and use that to direct traffic to the
appropriate set of cores
(i.e. select an appropriate comp_vector for rdma and add an appropriate
steering rule
for tcp).
2. Add options to rdma/tcp to use UNBOUND workqueues, and allow users
to
control
these UNBOUND workqueues cpumask via sysfs.
(2) will not control interrupts to steer to other workloads cpus, but
the handlers may
run on a set of dedicated cpus.
(1) is a better solution, but harder to implement.
You also should look into nvmet-fc as well (and nvmet-loop for that
matter).
Thanks for your reply, actually we had tried the first advice you
suggested, but we found the performance was poor when using spdk
as initiator.
You know this patch is not only resolving OS resource
competition issue, but also the perf issue. We have analyzed if we
still use workqueue(kworker) as target when initiator is polling
driver(eg: spdk), then workqueue/kworker target is the bottleneck
since every nvmf request may have a wait latency from queuing on
workqueue to begin processing,
and the latency can be traced by wqlat
of bcc (https://github.com/iovisor/bcc/blob/master/tools/wqlat.py).
We think the latency is a disaster for the polling driver data plane,
right?
So we think adding a polling task mode on nvmet side to handle
IO does really make sense; what's your opinion about this?
And you
mentioned we should also look into nvmet-fc, I agree with you.
However currently we have no nvmf-fc's testbed; if we get the testbed,
will do that.