On 7/4/24 11:10, Ping Gan wrote:Yes, we used IB_POLL_UNBOUND_WORKQUEUE to create ib CQ. And I observed
I'm a bit surprised that you see ~10% delta here. I would look intoOn 02/07/2024 13:02, Ping Gan wrote:Okay, thanks for your guide.
I'd try a simple unbound CPU case, steer packets to say cores [0-5]On 01/07/2024 10:42, Ping Gan wrote:Before these patches, we had used linux's RPS to forward the
I suggest that you focus on that instead of what you proposed.Hey Ping Gan,hi Sagi Grimberg,
On 26/06/2024 11:28, Ping Gan wrote:
When running nvmf on SMP platform, current nvme target's RDMAThis is NOT the way to go here.
and
TCP use kworker to handle IO. But if there is other high
workload
in the system(eg: on kubernetes), the competition between the
kworker and other workload is very radical. And since the
kworker
is scheduled by OS randomly, it's difficult to control OS
resource
and also tune the performance. If target support to use
delicated
polling task to handle IO, it's useful to control OS resource
and
gain good performance. So it makes sense to add polling task in
rdma-rdma and rdma-tcp modules.
Both rdma and tcp are driven from workqueue context, which are
bound
workqueues.
So there are two ways to go here:
1. Add generic port cpuset and use that to direct traffic to the
appropriate set of cores
(i.e. select an appropriate comp_vector for rdma and add an
appropriate
steering rule
for tcp).
2. Add options to rdma/tcp to use UNBOUND workqueues, and allow
users
to
control
these UNBOUND workqueues cpumask via sysfs.
(2) will not control interrupts to steer to other workloads
cpus,
but
the handlers may
run on a set of dedicated cpus.
(1) is a better solution, but harder to implement.
You also should look into nvmet-fc as well (and nvmet-loop for
that
matter).
Thanks for your reply, actually we had tried the first advice you
suggested, but we found the performance was poor when using spdk
as initiator.
What is the source of your poor performance?
packets
to a fixed cpu set for nvmet-tcp. But when did that we can still
not
cancel the competition between softirq and workqueue since nvme
target's
kworker cpu core bind on socket's cpu which is from skb. Besides
that
we found workqueue's wait latency was very high even we enabled
polling
on nvmet-tcp by module parameter idle_poll_period_usecs. So when
initiator
is polling mode, the target of workqueue is the bottleneck. Below
is
work's wait latency trace log of our test on our cluster(per node
uses
4 numas 96 cores, 192G memory, one dual ports mellanox CX4LX(25Gbps
X
2)
ethernet adapter and randrw 1M IO size) by RPS to 6 cpu cores. And
system's CPU and memory were used about 80%.
and
assign
the cpumask of the unbound workqueue to cores [6-11].
Yes, I remodified the nvmet-tcp/nvmet-rdma code for supportingogden-brown:~ #/usr/share/bcc/tools/wqlat -T -w nvmet_tcp_wq 1 2I think you will see similar performance with unbound workqueue and
01:06:59
usecs : count distribution
0 -> 1 : 0 | |
2 -> 3 : 0 | |
4 -> 7 : 0 | |
8 -> 15 : 3 | |
16 -> 31 : 10 | |
32 -> 63 : 3 | |
64 -> 127 : 2 | |
128 -> 255 : 0 | |
256 -> 511 : 5 | |
512 -> 1023 : 12 | |
1024 -> 2047 : 26 |* |
2048 -> 4095 : 34 |* |
4096 -> 8191 : 350 |************ |
8192 -> 16383 : 625 |******************************|
16384 -> 32767 : 244 |********* |
32768 -> 65535 : 39 |* |
01:07:00
usecs : count distribution
0 -> 1 : 1 | |
2 -> 3 : 0 | |
4 -> 7 : 4 | |
8 -> 15 : 3 | |
16 -> 31 : 8 | |
32 -> 63 : 10 | |
64 -> 127 : 3 | |
128 -> 255 : 6 | |
256 -> 511 : 8 | |
512 -> 1023 : 20 |* |
1024 -> 2047 : 19 |* |
2048 -> 4095 : 57 |** |
4096 -> 8191 : 325 |**************** |
8192 -> 16383 : 647 |******************************|
16384 -> 32767 : 228 |*********** |
32768 -> 65535 : 43 |** |
65536 -> 131071 : 1 | |
And the bandwidth of a node is only 3100MB. While we used the patch
and enable 6 polling task, the bandwidth can be 4000MB. It's a good
improvement.
rps.
unbound
workqueue, and in same prerequisites of above to run test, and
compared
the result of unbound workqueue and polling mode task. And I got a
good
performance for unbound workqueue. For unbound workqueue TCP we got
3850M/node, it's almost equal to polling task. And also tested
nvmet-rdma
we get 5100M/node for unbound workqueue RDMA versus 5600M for polling
task,
seems the diff is very small. Anyway, your advice is good.
what
is the root-cause of
this difference. If indeed the load is high, the overhead of the
workqueue mgmt should be
negligible. I'm assuming you used IB_POLL_UNBOUND_WORKQUEUE ?
3% CPU
usage of unbound workqueue versus 6% of polling task.
We used 24 IO queues to nvmet-rdma target. I think this may also beDo you thinkFor nvmet-tcp, I think there is merit to split socket processing from
we
should submit the unbound workqueue patches for nvmet-tcp and
nvmet-rdma
to upstream nvmet?
napi context. For nvmet-rdma
I think the only difference is if you have multiple CQs assigned with
the same comp_vector.
How many queues do you have in your test?
related to workqueue's wait latency. We still see some several ms wait
latency for unbound workqueue of RMDA. You can see below trace log.