Re: [PATCH 1/4] softirq: implement IRQ flood detection mechanism

From: Sagi Grimberg
Date: Fri Sep 20 2019 - 13:15:02 EST



Hey Ming,

Ok, so the real problem is per-cpu bounded tasks.

I share Thomas opinion about a NAPI like approach.

We already have that, its irq_poll, but it seems that for this
use-case, we get lower performance for some reason. I'm not entirely
sure why that is, maybe its because we need to mask interrupts
because we don't have an "arm" register in nvme like network devices
have?

Long observed that IOPS drops much too by switching to threaded irq.
If softirqd is waken up for handing softirq, the performance shouldn't
be better than threaded irq.

Its true that it shouldn't be any faster, but what irqpoll already has and we
don't need to reinvent is a proper budgeting mechanism that needs to occur
when multiple devices map irq vectors to the same cpu core.

irqpoll already maintains a percpu list and dispatch the ->poll with a budget
that the backend enforces and irqpoll multiplexes between them.
Having this mechanism in irq (hard or threaded) context sounds unnecessary a
bit.

It seems like we're attempting to stay in irq context for as long as we can
instead of scheduling to softirq/thread context if we have more than a
minimal amount of work to do. Without at least understanding why
softirq/thread degrades us so much this code seems like the wrong approach
to me. Interrupt context will always be faster, but it is not a sufficient reason
to spend as much time as possible there, is it?

We should also keep in mind, that the networking stack has been doing this
for years, I would try to understand why this cannot work for nvme before
dismissing.

Especially, Long found that context
switch is increased a lot after applying your irq poll patch.

https://nam06.safelinks.protection.outlook.com/?url=http%3A%2F%2Flists
.infradead.org%2Fpipermail%2Flinux-nvme%2F2019-
August%2F026788.html&am

p;data=02%7C01%7Clongli%40microsoft.com%7C20391b0810844821325908d73
59c

64d2%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C0%7C637036818140279
742&a

mp;sdata=GyBWILwPvwHYvrTGSAVZbdl%2Fcoz3twSXe2DrH2t1MeQ%3D&am
p;reserved
=0

Oh, I didn't see that one, wonder why... thanks!

5% improvement, I guess we can buy that for other users as is :)

If we suffer from lots of context switches while the CPU is flooded with
interrupts, then I would argue that we're re-raising softirq too much.
In this use-case, my assumption is that the cpu cannot keep up with the
interrupts and not that it doesn't reap enough (we also reap the first batch in
interrupt context...)

Perhaps making irqpoll continue until it must resched would improve things
further? Although this is a latency vs. efficiency tradeoff, looks like
MAX_SOFTIRQ_TIME is set to 2ms:

"
* The MAX_SOFTIRQ_TIME provides a nice upper bound in most cases, but in
* certain cases, such as stop_machine(), jiffies may cease to
* increment and so we need the MAX_SOFTIRQ_RESTART limit as
* well to make sure we eventually return from this method.
*
* These limits have been established via experimentation.
* The two things to balance is latency against fairness -
* we want to handle softirqs as soon as possible, but they
* should not be able to lock up the box.
"

Long, does this patch make any difference?

Sagi,

Sorry it took a while to bring my system back online.

With the patch, the IOPS is about the same drop with the 1st patch. I think the excessive context switches are causing the drop in IOPS.

The following are captured by "perf sched record" for 30 seconds during tests.

"perf sched latency"
With patch:
fio:(82) | 937632.706 ms | 1782255 | avg: 0.209 ms | max: 63.123 ms | max at: 768.274023 s

without patch:
fio:(82) |2348323.432 ms | 18848 | avg: 0.295 ms | max: 28.446 ms | max at: 6447.310255 s

Without patch means the proposed hard-irq patch?

If we are context switching too much, it means the soft-irq operation is
not efficient, not necessarily the fact that the completion path is
running in soft-irq..

Is your kernel compiled with full preemption or voluntary preemption?

Look closer at each CPU, we can see ksoftirqd is competing CPU with fio (and effectively throttle other fio processes)
(captured in /sys/kernel/debug/tracing, echo sched:* >set_event)

On CPU1 with patch: (note that the prev_state for fio is "R", it's preemptively scheduled)
<...>-4077 [001] d... 66456.805062: sched_switch: prev_comm=fio prev_pid=4077 prev_prio=120 prev_state=R ==> next_comm=ksoftirqd/1 next_pid=17 next_prio=120
<...>-17 [001] d... 66456.805859: sched_switch: prev_comm=ksoftirqd/1 prev_pid=17 prev_prio=120 prev_state=S ==> next_comm=fio next_pid=4077 next_prio=120
<...>-4077 [001] d... 66456.844049: sched_switch: prev_comm=fio prev_pid=4077 prev_prio=120 prev_state=R ==> next_comm=ksoftirqd/1 next_pid=17 next_prio=120
<...>-17 [001] d... 66456.844607: sched_switch: prev_comm=ksoftirqd/1 prev_pid=17 prev_prio=120 prev_state=S ==> next_comm=fio next_pid=4077 next_prio=120

On CPU1 without patch: (the prev_state for fio is "S", it's voluntarily scheduled)
<idle>-0 [001] d... 6725.392308: sched_switch: prev_comm=swapper/1 prev_pid=0 prev_prio=120 prev_state=R ==> next_comm=fio next_pid=14342 next_prio=120
fio-14342 [001] d... 6725.392332: sched_switch: prev_comm=fio prev_pid=14342 prev_prio=120 prev_state=S ==> next_comm=swapper/1 next_pid=0 next_prio=120
<idle>-0 [001] d... 6725.392356: sched_switch: prev_comm=swapper/1 prev_pid=0 prev_prio=120 prev_state=R ==> next_comm=fio next_pid=14342 next_prio=120
fio-14342 [001] d... 6725.392425: sched_switch: prev_comm=fio prev_pid=14342 prev_prio=120 prev_state=S ==> next_comm=swapper/1 next_pid=0 next_prio=12