[PATCH 0/5] genirq: threadable IRQ support

From: Paolo Abeni
Date: Wed Jun 15 2016 - 09:42:53 EST


This patch series adds a new genirq interface to allows the user space to change
the IRQ mode at runtime, switching to and from the threaded mode.

The configuration is performing on a per irqaction basis, writing into the
newly added procfs entry /proc/irq/<nr>/<irq action name>/threaded. Such entry
is created at IRQ request time, only if CONFIG_IRQ_FORCED_THREADING
is defined.

Upon IRQ creation, the device handling such IRQ may optionally provide, via
the newly added API irq_set_mode_notifier(), an additional callback to be
notified about IRQ mode change.
The device can use such callback to configure its internal state to behave
differently in threaded mode and in normal mode if required.

Additional IRQ flags are added to let the device specifies some default
aspects of the IRQ thread. The device can request a SCHED_NORMAL scheduling
policy and avoid the affinity setting for the IRQ thread. Both of such
options are beneficial for the first threadable IRQ user.

The initial user for this feature is the networking subsystem; some
infrastructure is added to the network core for such goal. A new napi field
storing an IRQ thread reference is used to mark a NAPI instance as threaded
and __napi_schedule is modified to invoke a poll loop directly instead of
raising a softirq when the related NAPI instance is in threaded mode, plus
a IRQ_mode_set callback is provided to notify the NAPI instance of the IRQ
mode change.

Each network device driver must be migrated explicitly to leverage the new
infrastructure. In this patch series, the Intel ixgbe is updated to invoke
irq_set_mode_notifier(), only when using msix IRQs.
This avoids other IRQ events to be delayed indefinitely when the rx IRQ is
processed in thread mode. The default behavior after the driver migration is
unchanged.

Running the rx packets processing inside a conventional kthread is beneficial
for different workload since it allows the process scheduler to nicely use
the available resources. With multiqueue NICs, the ksoftirq design does not allow
any running process to use 100% of a single CPU, under relevant network load,
because the softirq poll loop will be scheduled on each CPU.

The above can be experienced in a hypervisor/VMs scenario, when the guest is
under UDP flood. If the hypervisor's NIC has enough rx queues the guest will
compete with ksoftirqd on each CPU. Moreover, since the ksoftirqd CPU
utilization change with the ingress traffic, the scheduler try to migrate the
guest processes towards the CPUs with the highest capacity, further impacting
the guest ability to process rx packets.

Running the hypervisor rx packet processing inside a migrable kthread allows
the process scheduler to let the guest process[es] to fully use a single a
core each, migrating some rx threads as required.

The raw numbers, obtained with the netperf UDP_STREAM test, using a tun
device with a noqueue qdisc in the hypervisor, and using random IP addresses
as source in case of multiple flows, are as follow:

vanilla threaded
size/flow kpps kpps/delta
1/1 824 843/+2%
1/25 736 906/+23%
1/50 752 906/+20%
1/100 772 906/+17%
1/200 741 976/+31%
64/1 829 840/+1%
64/25 711 932/+31%
64/50 780 894/+14%
64/100 754 946/+25%
64/200 714 945/+32%
256/1 702 510/-27%
256/25 724 894/+23%
256/50 739 889/+20%
256/100 798 873/+9%
256/200 812 907/+11%
1400/1 720 727/+1%
1400/25 826 826/0
1400/50 827 833/0
1400/100 820 820/0
1400/200 796 799/0

The guest runs 2vCPU, so it's not prone to the userspace livelock issue
recently exposed here: http://thread.gmane.org/gmane.linux.kernel/2218719

There are relevant improvement in all cpu bounded scenarios with multiple flows
and significant regression with medium size packet, single flow. The latter
is due to the increased 'burstiness' of packet processing which cause the
single socket in the guest of overflow more easily, if the receiver application
is scheduled on the same cpu processing the incoming packets.

The kthread approach should give a lot of new advantages over the softirq
based approach:

* moving into a more dpdk-alike busy poll packet processing direction:
we can even use busy polling without the need of a connected UDP or TCP
socket and can leverage busy polling for forwarding setups. This could
very well increase latency and packet throughput without hurting other
processes if the networking stack gets more and more preemptive in the
future.

* possibility to acquire mutexes in the networking processing path: e.g.
we would need that to configure hw_breakpoints if we want to add
watchpoints in the memory based on some rules in the kernel

* more and better tooling to adjust the weight of the networking
kthreads, preferring certain networking cards or setting cpus affinity
on packet processing threads. Maybe also using deadline scheduling or
other scheduler features might be worthwhile.

* scheduler statistics can be used to observe network packet processing



Paolo Abeni (5):
genirq: implement support for runtime switch to threaded irqs
genirq: add flags for controlling the default threaded irq behavior
sched/preempt: cond_resched_softirq() must check for softirq
netdev: implement infrastructure for threadable napi irq
ixgbe: add support for threadable rx irq

drivers/net/ethernet/intel/ixgbe/ixgbe_main.c | 14 +-
include/linux/interrupt.h | 21 +++
include/linux/netdevice.h | 4 +
kernel/irq/internals.h | 3 +
kernel/irq/manage.c | 212 ++++++++++++++++++++++++--
kernel/irq/proc.c | 51 +++++++
kernel/sched/core.c | 3 +-
net/core/dev.c | 59 +++++++
8 files changed, 355 insertions(+), 12 deletions(-)

--
1.8.3.1