[RFC 00/12] Hyper-V guests use Linux IRQs for channel interrupts

From: mhkelley58
Date: Tue Jun 04 2024 - 01:11:00 EST


From: Michael Kelley <mhklinux@xxxxxxxxxxx>

This patch set makes significant changes in how Linux guests running on Hyper-V
handle interrupts from VMBus devices. It's "RFC" because of the level of change,
and because I'm interested in macro-level feedback on the whole idea. The usual
detailed-level code review feedback is also welcome.

Background
----------
Linux guests running on Hyper-V are presented with a range of synthetic
devices, including storage, networking, mouse, keyboard, etc. Guests are also
presented with management-related pseudo-devices, including time
synchronization, heartbeat, and host-initiated shutdown. These devices use the
VMBus connection between the host and the guest, and each device has one or
more VMBus channels for passing messages between the host and guest.
Separately, a control path exists for creating and tearing down VMBus channels
when needed. Hyper-V VMBus and its synthetic devices play a role similar to
virtio and virtio devices.

The host interrupts the guest when it sends a new message to the guest in an
otherwise empty channel, and when it sends a new control message to the guest.
All interrupts are multiplexed onto a single per-CPU architectural interrupt as
defined by the processor architecture. On arm64, this architectural interrupt
is a PPI, and is represented in Linux as a Linux per-CPU IRQ. The x86/x64
architecture does not provide per-CPU interrupts, so Linux reserves a fixed x86
interrupt vector (HYPERVISOR_CALLBACK_VECTOR) on all CPUs, with a hard-coded
interrupt handler. No Linux IRQ is assigned on x86/x64.

The interrupt handler for the architectural interrupt is the VMBus interrupt
handler, and it runs whenever the host generates an interrupt for a channel or
a control message. The VMBus interrupt handler consults data structures
provided by the per-CPU Hyper-V synthetic interrupt controller (synic) to
determine which channel or channels have a pending interrupt, or if there is a
pending control message. When doing this demultiplexing, the VMBus interrupt
handler directly invokes the per-device handlers, and processes any control
messages.

In this scheme, the individual VMBus devices and their channels are not
modelled as Linux IRQs. /proc/interrupts shows counts for the top-level
architectural interrupt, but not for the individual VMBus devices. The VMBus
driver has implemented separate reporting of per-device data under /sys.
Similarly, /proc/irq/<nn> does not show affinity information for individual
VMBus devices, and the affinity cannot be changed via /proc/irq. Again, a
separate interface under /sys is provided for changing the vCPU that a VMBus
channel should interrupt.

What's New
----------
Linux kernel community members have commented that it would be better for the
Hyper-V synic and the handling of VMBus channel interrupts to be modelled as
Linux IRQs. Then they would participate in existing IRQ handling mechanisms,
and not need something separate under /sys. This patch set provides such an
implementation.

With this patch set, the output of /proc/interrupts looks like this in a Linux
guest on a Hyper-V x86/x64 Generation 2 VM with 8 vCPUs (the columns for
CPUs 2 thru 5 are omitted due to text width considerations):

CPU0 CPU1 CPU6 CPU7
8: 0 0 0 0 IO-APIC 8-edge rtc0
9: 2 0 0 0 IO-APIC 9-fasteoi acpi
24: 16841 0 0 0 VMBus 7 heartbeat
25: 1 0 0 0 VMBus 9 shutdown
26: 6752 0 0 0 VMBus 10 timesync
27: 1 0 0 0 VMBus 11 vss
28: 22095 0 0 0 VMBus 14 pri@storvsc1
29: 0 85 0 0 VMBus 15 pri@storvsc0
30: 3 0 0 0 VMBus 1 balloon
31: 106 0 0 0 VMBus 3 mouse
32: 73 0 0 0 VMBus 4 keyboard
33: 6 0 0 0 VMBus 5 framebuffer
34: 0 0 0 0 VMBus 13 netvsc
35: 1 0 0 0 VMBus 8 kvp
36: 0 0 0 0 VMBus 16 netvsc
37: 0 0 0 0 VMBus 17 netvsc
38: 0 0 0 0 VMBus 18 netvsc
39: 0 0 2375 0 VMBus 19 netvsc
40: 0 0 0 1178 VMBus 20 netvsc
41: 1003 0 0 0 VMBus 21 netvsc
42: 0 4337 0 0 VMBus 22 netvsc
43: 0 0 0 0 VMBus 23 sub@storvsc1
44: 0 0 0 0 VMBus 24 sub@storvsc0
NMI: 0 0 0 0 Non-maskable interrupts
LOC: 0 0 0 0 Local timer interrupts
SPU: 0 0 0 0 Spurious interrupts
PMI: 0 0 0 0 Performance monitoring interrupts
IWI: 1 0 0 2 IRQ work interrupts
RTR: 0 0 0 0 APIC ICR read retries
RES: 411 233 349 318 Rescheduling interrupts
CAL: 135236 29211 99526 36424 Function call interrupts
TLB: 0 0 0 0 TLB shootdowns
TRM: 0 0 0 0 Thermal event interrupts
THR: 0 0 0 0 Threshold APIC interrupts
DFR: 0 0 0 0 Deferred Error APIC interrupts
MCE: 0 0 0 0 Machine check exceptions
MCP: 109 109 109 109 Machine check polls
HYP: 71 0 0 0 Hypervisor callback interrupts
HRE: 0 0 0 0 Hyper-V reenlightenment interrupts
HVS: 183391 109695 181539 339239 Hyper-V stimer0 interrupts
ERR: 0
MIS: 0
PIN: 0 0 0 0 Posted-interrupt notification event
NPI: 0 0 0 0 Nested posted-interrupt event
PIW: 0 0 0 0 Posted-interrupt wakeup event

IRQs 24 thru 44 are the VMBus channel IRQs that didn't previously exist. Some
devices, such as storvsc and netvsc, have multiple channels, each of which has
a separate IRQ. The HYP line now shows counts only for control messages instead
of for all VMBus interrupts. The HVS line continues to show Hyper-V synthetic
timer (stimer0) interrupts. (In older versions of Hyper-V where stimer0
interrupts are delivered as control messages, those interrupts are accounted
for on the HYP line. Since this is legacy behavior, it seemed unnecessary to
create a separate Linux IRQ to model these messages.)

The basic approach is to update the VMBus channel create/open/close/delete
infrastructure, and the VMBus interrupt handler, to create and process the
additional IRQs. The goal is that individual VMBus drivers require no code
changes to work with the new IRQs. However, a driver may optionally annotate
the IRQ using knowledge that only the driver has. For example, in the above
/proc/interrupts output, the storvsc driver for synthetic disks has been
updated to distinguish the primary channel from the subchannels, and to
distinguish controller 0 from controller 1. Since storvsc models a SCSI
controller, the numeric designations are set to match the "host0" and "host1"
SCSI controllers that could be seen with "lsscsi -H -v". Similarly annotating
the "netvsc" entries is a "to do" item.

In furtherance of the "no changes to existing VMBus drivers" goal, all VMBus
IRQs use the same interrupt handler. Each channel has a callback function
associated with it, and the interrupt handler invokes this callback in the same
way as before VMBus IRQs were introduced.

VMBus code creates a standalone linear IRQ domain for the VMBus IRQs. The hwirq
#'s are the VMBus channel relid's, which are integers in the range 1 to 2047.
Each relid uniquely identifies a VMBus channel, and a channel interrupt from
the host provides the relid. A mapping from IRQ to hwirq (relid) is created
when a new channel is created, and the mapping is deleted when the channel is
deleted, so adding and removing VMBus devices in a running VM is fully
functional.

Getting the IRQ counting correct requires a new IRQ flow handler in
/kernel/irq/chip.c. See Patch 6. The existing flow handlers don't handle this
case, and creating a custom flow handler outside of chip.c isn't feasible
because the needed components aren't available to code outside of /kernel/irq.

When a guest CPU is taken offline, Linux automatically changes the affinity of
IRQs assigned to that CPU so that the IRQ is handled on another CPU that is
still online. Unfortunately, VMBus channel IRQs are not able to take advantage
of that mechanism. At the point the mechanism runs, it isn't possible to know
when Hyper-V started interrupting the new CPU; hence pending interrupts may be
stranded on the old offline CPU. Existing VMBus code in Linux prevents a CPU
from going offline when a VMBus channel interrupt is affined to it, and that
code must be retained. Of course, VMBus channel IRQs can manually be affined to
a different CPU, which could then allow a CPU to be taken offline. I have an
idea for a somewhat convoluted way to allow the automatic Linux mechanism to
work, but that will be another patch set.

Having VMBus channel interrupts appear in /proc/irq means that user-space
irqbalance can change the affinity of the interrupts, where previously it could
not. For example, this gives irqbalance the ability to change (and perhaps
degrade) the default affinity configuration that was designed to maximize
network throughput. This is yet another reason to disable irqbalance in VM
images.

The custom VMBus entries under /sys are retained for backwards compatibility,
but they are integrated. For example, changing the affinity via /proc/irq is
reflected in /sys/bus/vmbus/devices/<guid>/channels/<nn>/cpu, and vice versa.

Patches
-------
The patches have been tested on x86/x64 Generation 1 and Generation 2 VMs, and
on arm64 VMs.

* Patches 1 and 2 fixes some error handling that is needed by subsequent
patches. They could be applied independent of this patch set.

* Patches 3,4, and 5 add support for IRQ names, and add annotations in the
storvsc and the Hyper-V vPCI driver so that descriptions in /proc/interrupts
convey useful information.

* Patch 6 adds a new IRQ flow handler needed to properly account for IRQs as
VMBus IRQs or as the top-level architectural interrupt.

* Patch 7 creates the new IRQ domain for VMBus channel IRQs.

* Patch 8 allocates the VMBus channel IRQs, and creates the mapping between
VMBus relid (used as the hwirq) and the Linux IRQ number. That mapping is then
used for lookups.

* Patch 9 does request_irq() for the new VMBus channel IRQs, and converts the
top-level VMBus architectural interrupt handler to use the new VMBus channel
IRQs for handling channel interrupts. With this patch, /proc/interrupts shows
counts for the VMBus channel IRQs.

* Patch 10 implements the irq_set_affinity() function for VMBus channel IRQs,
and integrates it with the existing Hyper-V-specific sysfs mechanism for
changing the CPU that a VMBus channel will interrupt.

* Patch 11 updates hv_synic_cleanup() during CPU offlining to handle the lack
of waiting for the MODIFYCHANNEL message in vmbus_irq_set_affinity() in Patch
10.

* Patch 12 fixes a race in VMBus channel interrupt assignments that existed
prior to this patch set when taking a CPU offline.

Open Topics
-----------
* The performance impact of the additional IRQ processing in the interrupt path
appears to be minimal, based on looking at the code. Measurements are still to
be taken to confirm this.

* Does the netvsc driver have sufficient information to annotate what appears
in /proc/interrupts, particularly when multiple synthetic NICs are present? At
first glance, it appears that the identifying info isn't generated until after
vmbus_open() is called, and by then it's too late to do the IRQ annotation.
Need some input from a netvsc expert.

* The VMBus irq domain and irq chip data structures are placed in the
vmbus_connection structure. Starting with VMBus protocol version 5.0, the VMBus
host side supports multiple connections from a single guest instance (see
commit ae20b254306a6). While there's no upstream kernel code that creates
multiple VMBus connections, it's unclear whether the IRQ domain should be
global across all VMBus connection, or per connection. Are relid's global (to
the VM) or per connection?

* I have not tried or tested any hv_sock channels. Is there a handy test
program that would be convenient for opening multiple hv_sock connections to
the guest, have them persist long enough to see what /proc/interrupts shows,
and try changing the IRQ affinity?

* I have not tried or tested Linux guest hibernation paths.

The patches build and run against 6.10-rc2 and next-20240603.

Michael Kelley (12):
Drivers: hv: vmbus: Drop unsupported VMBus devices earlier
Drivers: hv: vmbus: Fix error path that deletes non-existent sysfs
group
Drivers: hv: vmbus: Add an IRQ name to VMBus channels
PCI: hv: Annotate the VMBus channel IRQ name
scsi: storvsc: Annotate the VMBus channel IRQ name
genirq: Add per-cpu flow handler with conditional IRQ stats
Drivers: hv: vmbus: Set up irqdomain and irqchip for the VMBus
connection
Drivers: hv: vmbus: Allocate an IRQ per channel and use for relid
mapping
Drivers: hv: vmbus: Use Linux IRQs to handle VMBus channel interrupts
Drivers: hv: vmbus: Implement vmbus_irq_set_affinity
Drivers: hv: vmbus: Wait for MODIFYCHANNEL to finish when offlining
CPUs
Drivers: hv: vmbus: Ensure IRQ affinity isn't set to a CPU going
offline

arch/x86/kernel/cpu/mshyperv.c | 9 +-
drivers/hv/channel.c | 125 ++++------
drivers/hv/channel_mgmt.c | 139 ++++++++----
drivers/hv/connection.c | 48 ++--
drivers/hv/hv.c | 90 +++++++-
drivers/hv/hv_common.c | 2 +-
drivers/hv/hyperv_vmbus.h | 22 +-
drivers/hv/vmbus_drv.c | 338 +++++++++++++++++++---------
drivers/pci/controller/pci-hyperv.c | 5 +
drivers/scsi/storvsc_drv.c | 9 +
include/asm-generic/mshyperv.h | 9 +-
include/linux/hyperv.h | 9 +
include/linux/irq.h | 1 +
kernel/irq/chip.c | 29 +++
14 files changed, 567 insertions(+), 268 deletions(-)

--
2.25.1