Re: [patch V2 00/46] x86, PCI, XEN, genirq ...: Prepare for device MSI
From: Qian Cai
Date: Fri Sep 25 2020 - 11:29:34 EST
On Wed, 2020-08-26 at 13:16 +0200, Thomas Gleixner wrote:
> This is the second version of providing a base to support device MSI (non
> PCI based) and on top of that support for IMS (Interrupt Message Storm)
> based devices in a halfways architecture independent way.
>
> The first version can be found here:
>
> https://lore.kernel.org/r/20200821002424.119492231@xxxxxxxxxxxxx
>
> It's still a mixed bag of bug fixes, cleanups and general improvements
> which are worthwhile independent of device MSI.
Reverting the part of this patchset on the top of today's linux-next fixed an
boot issue on HPE ProLiant DL560 Gen10, i.e.,
$ git revert --no-edit 13b90cadfc29..bc95fd0d7c42
.config: https://gitlab.com/cailca/linux-mm/-/blob/master/x86.config
It looks like the crashes happen in the interrupt remapping code where they are
only able to to generate partial call traces.
[ 1.912386][ T0] ACPI: X2APIC_NMI (uid[0xf5] high level 9983][ T0] ... MAX_LOCK_DEPTH: 48
[ 7.914876][ T0] ... MAX_LOCKDEP_KEYS: 8192
[ 7.919942][ T0] ... CLASSHASH_SIZE: 4096
[ 7.925009][ T0] ... MAX_LOCKDEP_ENTRIES: 32768
[ 7.930163][ T0] ... MAX_LOCKDEP_CHAINS: 65536
[ 7.935318][ T0] ... CHAINHASH_SIZE: 32768
[ 7.940473][ T0] memory used by lock dependency info: 6301 kB
[ 7.946586][ T0] memory used for stack traces: 4224 kB
[ 7.952088][ T0] per task-struct memory footprint: 1920 bytes
[ 7.968312][ T0] mempolicy: Enabling automatic NUMA balancing. Configure with numa_balancing= or the kernel.numa_balancing sysctl
[ 7.980281][ T0] ACPI: Core revision 20200717
[ 7.993343][ T0] clocksource: hpet: mask: 0xffffffff max_cycles: 0xffffffff, max_idle_ns: 79635855245 ns
[ 8.003270][ T0] APIC: Switch to symmetric I/O mode setup
[ 8.008951][ T0] DMAR: Host address width 46
[ 8.013512][ T0] DMAR: DRHD base: 0x000000e5ffc000 flags: 0x0
[ 8.019680][ T0] DMAR: dmar0: reg_base_addr e5ffc000 ver 1:0 cap 8d2078c106f0466 [ T0] DMAR-IR: IOAPIC id 15 under DRHD base 0xe5ffc000 IOMMU 0
[ 8.420990][ T0] DMAR-IR: IOAPIC id 8 under DRHD base 0xddffc000 IOMMU 15
[ 8.428166][ T0] DMAR-IR: IOAPIC id 9 under DRHD base 0xddffc000 IOMMU 15
[ 8.435341][ T0] DMAR-IR: HPET id 0 under DRHD base 0xddffc000
[ 8.441456][ T0] DMAR-IR: Queued invalidation will be enabled to support x2apic and Intr-remapping.
[ 8.457911][ T0] DMAR-IR: Enabled IRQ remapping in x2apic mode
[ 8.466614][ T0] BUG: kernel NULL pointer dereference, address: 0000000000000000
[ 8.474295][ T0] #PF: supervisor instruction fetch in kernel mode
[ 8.480669][ T0] #PF: error_code(0x0010) - not-present page
[ 8.486518][ T0] PGD 0 P4D 0
[ 8.489757][ T0] Oops: 0010 [#1] SMP KASAN PTI
[ 8.494476][ T0] CPU: 0 PID: 0 Comm: swapper/0 Tainted: G I 5.9.0-rc6-next-20200925 #2
[ 8.503987][ T0] Hardware name: HPE ProLiant DL560 Gen10/ProLiant DL560 Gen10, BIOS U34 11/13/2019
[ 8.513238][ T0] RIP: 0010:0x0
[ 8.516562][ T0] Code: Bad RIP v
or
[ 2.906744][ T0] ACPI: X2API32, address 0xfec68000, GSI 128-135
[ 2.907063][ T0] IOAPIC[15]: apic_id 29, version 32, address 0xfec70000, GSI 136-143
[ 2.907071][ T0] IOAPIC[16]: apic_id 30, version 32, address 0xfec78000, GSI 144-151
[ 2.907079][ T0] ACPI: INT_SRC_OVR (bus 0 bus_irq 0 global_irq 2 dfl dfl)
[ 2.907084][ T0] ACPI: INT_SRC_OVR (bus 0 bus_irq 9 global_irq 9 high level)
[ 2.907100][ T0] Using ACPI (MADT) for SMP configuration information
[ 2.907105][ T0] ACPI: HPET id: 0x8086a701 base: 0xfed00000
[ 2.907116][ T0] ACPI: SPCR: console: uart,mmio,0x0,115200
[ 2.907121][ T0] TSC deadline timer available
[ 2.907126][ T0] smpboot: Allowing 144 CPUs, 0 hotplug CPUs
[ 2.907163][ T0] [mem 0xd0000000-0xfdffffff] available for PCI devices
[ 2.907175][ T0] clocksource: refined-jiffies: mask: 0xffffffff max_cycles: 0xffffffff, max_idle_ns: 19112604462750000 ns
[ 2.914541][ T0] setup_percpu: NR_CPUS:256 nr_cpumask_bits:144 nr_cpu_ids:144 nr_node_ids:4
[ 2.926109][ 466 ecap f020df
[ 9.134709][ T0] DMAR: DRHD base: 0x000000f5ffc000 flags: 0x0
[ 9.140867][ T0] DMAR: dmar8: reg_base_addr f5ffc000 ver 1:0 cap 8d2078c106f0466 ecap f020df
[ 9.149610][ T0] DMAR: DRHD base: 0x000000f7ffc000 flags: 0x0
[ 9.155762][ T0] DMAR: dmar9: reg_base_addr f7ffc000 ver 1:0 cap 8d2078c106f0466 ecap f020df
[ 9.164491][ T0] DMAR: DRHD base: 0x000000f9ffc000 flags: 0x0
[ 9.170645][ T0] DMAR: dmar10: reg_base_addr f9ffc000 ver 1:0 cap 8d2078c106f0466 ecap f020df
[ 9.179476][ T0] DMAR: DRHD base: 0x000000fbffc000 flags: 0x0
[ 9.185626][ T0] DMAR: dmar11: reg_base_addr fbffc000 ver 1:0 cap 8d2078c106f0466 ecap f020df
[ 9.194442][ T0] DMAR: DRHD base: 0x000000dfffc000 flags: 0x0
[ 9.200587][ T0] DMAR: dmar12: reg_base_addr dfffc000 ver 1:0 cap 8d2078c106f0466 ecap f020df
[ 9.209418][ T0] DMAR: DRHD base: 0x000000e1ffc000 flags: 0x0
[ 9.215551][ T0] DMAR: dmar13: reg_base_addr e1ffc000 ver 1:0 cap 8d2078c106f0466 ecap f020df
[ 9.224367][ T0] DMAR: DRHD base: 0x000000e3ffc83][ T0] msi_domain_alloc+0x8e/0x280
[ 9.615015][ T0] __irq_domain_a8992cd
[ 9.711906][ T0] R10: ffffffff85407d78 R11: fffffbfff18992cc R12: ffffffff8546ffc0
[ 9.719761][ T0] R13: 0000000000000098 R14: ffff888106e63a40 R15: 0000000000000001
[ 9.727617][ T0] FS: 0000000000000000(0000) GS:ffff8887df800000(0000) knlGS:0000000000000000
[ 9.736431][ T0] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 9.742892][ T0] CR2: ffffffffffffffd6 CR3: 0000001ba7814001 CR4: 00000000000606b0
[ 9.750747][ T0] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[ 9.758601][ T0] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[ 9.766456][ T0] Kernel panic - not syncing: Fatal exception
[ 9.772547][ T0] ---[ end Kernel panic - not syncing: Fatal exception ]---
The working boot (without those patches) looks like this:
[ 1.913963][ T0] ACPI: X2APIC_NMI (uid[0xf4] high level lint[0x1])
[ 1.913967][ T0] ACPI: X2APIC_NMI (uid[0xf5] high level lint[0x1])
[ 1.913970][ T0] ACPI: X2APIC_NMI (uid[0xf6] high level lint[0x1])
[ 1.913974][ T0] ACPI: X2APIC_NMI (uid[0xf7] high level lint[0x1])
[ 1.914017][ T0] IOAPIC[0]: apic_id 8, version 32, address 0xfec00000, GSI 0-23
[ 1.914032][ T0] IOAPIC[1]: apic_id 9, version 32, address 0xfec01000, GSI 24-31
[ 1.914039][ T0] IOAPIC[2]: apic_id 10, version 32, address 0xfec08000, GSI 32-39
[ 1.914047][ T0] IOAPIC[3]: apic_id 11, version 32, address 0xfec10000, GSI 40-47
[ 1.914054][ T0] IOAPIC[4]: apic_id 12, version 32, address 0xfec18000, GSI 48-55
[ 1.914062][ T0] IOAPIC[5]: apic_id 15, version 32, address 0xfec20000, GSI 56-63
[ 1.[ 7.994567][ T0] mempolicy: Enabling automatic NUMA balancing. Configure with numa_balancing= or the kernel.numa_balancing sysctl
[ 8.006541][ T0] ACPI: Core revision 20200717
[ 8.019713][ T0] clocksource: hpet: mask: 0xffffffff max_cycles: 0xffffffff, max_idle_ns: 79635855245 ns
[ 8.029672][ T0] APIC: Switch to symmetric I/O mode setup
[ 8.035354][ T0] DMAR: Host address width 46
[ 8.039915][ T0] DMAR: DRHD base: 0x000000e5ffc000 flags: 0x0
[ 8.046095][ T0] DMAR: dmar0: reg_base_addr e5ffc000 ver 1:0 cap 8d2078c106f0466 ecap f020df
[ 8.054840][ T0] DMAR: DRHD base: 0x000000e7ffc000 flags: 0x0
[ 8.060997][ T0] DMAR: dmar1: reg_base_addr e7ffc000 ver 1:0 cap 8d2078c106f0466 ecap f020df
[ 8.069740][ T0] DMAR: DRHD base: 0x000000e9ffc000 flags: 0x0
[ 8.075872][ T0] DMAR: dmar2: reg_base_addr e9ffc000 ver 1:0 cap 8d2078c106f0466 ecap f020df
[ 8.084615][ T0] DMAR: DRHD base: 0x000000ebffc000 flags: 0x0
[ 8.090761][ T0] DMAR: dmar3: reg_base_addr ebffc000 ver 1:0 cap 8d2078c106f0466 ecap fMAR-IR: Enabled IRQ remapping in x2apic mode
[ 8.513491][ T0] ..TIMER: vector=0x30 apic1=0 pin1=2 apic2=-1 pin2=-1
[ 8.568289][ T0] clocksource: tsc-early: mask: 0xffffffffffffffff max_cycles: 0x2b3e459bf4c, max_idle_ns: 440795289890 ns
[ 8.579576][ T0] Calibrating delay loop (skipped), value calculated using timer frequency.. 6000.00 BogoMIPS (lpj=30000000)
[ 8.589574][ T0] pid_max: default: 147456 minimum: 1152
[ 8.714025][ T0] efi: memattr: Entry attributes invalid: RO and XP bits both cleared
[ 8.719577][ T0] efi: memattr: ! 0x0000a057a000-0x0000a05b4fff [Runtime Code |RUN| | | | | | | | | | | | ]
[ 8.775355][ T0] Dentry cache hash table entries: 8388608 (order: 14, 67108864 bytes, vmalloc)
[ 8.798868][ T0] Inode-cache hash table entries: 4194304 (order: 13, 33554432 bytes, vmalloc)
[ 8.811550][ T0] Mount-cache hash table entries: 131072 (order: 8, 1048576 bytes, vmalloc)
[ 8.820076][ T0] Mountpoint-cache hash table entries: 131072 (order: 8, 1048576 bytes, vmalloc)
[ 8.879327][ T0] mce: CPU0: Thermal mo[ 8.996916][ T1] Performance Events: PEBS fmt3+, Skylake events, 32-deep LBR, full-width counters, Intel PMU driver.
[ 8.999591][ T1] ... version: 4
[ 9.004310][ T1] ... bit width: 48
[ 9.009118][ T1] ... generic registers: 4
[ 9.009574][ T1] ... value mask: 0000ffffffffffff
[ 9.015601][ T1] ... max period: 00007fffffffffff
[ 9.019574][ T1] ... fixed-purpose events: 3
[ 9.024294][ T1] ... event mask: 000000070000000f
[ 9.034357][ T1] rcu: Hierarchical SRCU implementation.
[ 9.062516][ T5] NMI watchdog: Enabled. Permanently consumes one hw-PMU counter.
>
> There are quite a bunch of issues to solve:
>
> - X86 does not use the device::msi_domain pointer for historical reasons
> and due to XEN, which makes it impossible to create an architecture
> agnostic device MSI infrastructure.
>
> - X86 has it's own msi_alloc_info data type which is pointlessly
> different from the generic version and does not allow to share code.
>
> - The logic of composing MSI messages in an hierarchy is busted at the
> core level and of course some (x86) drivers depend on that.
>
> - A few minor shortcomings as usual
>
> This series addresses that in several steps:
>
> 1) Accidental bug fixes
>
> iommu/amd: Prevent NULL pointer dereference
>
> 2) Janitoring
>
> x86/init: Remove unused init ops
> PCI: vmd: Dont abuse vector irqomain as parent
> x86/msi: Remove pointless vcpu_affinity callback
>
> 3) Sanitizing the composition of MSI messages in a hierarchy
>
> genirq/chip: Use the first chip in irq_chip_compose_msi_msg()
> x86/msi: Move compose message callback where it belongs
>
> 4) Simplification of the x86 specific interrupt allocation mechanism
>
> x86/irq: Rename X86_IRQ_ALLOC_TYPE_MSI* to reflect PCI dependency
> x86/irq: Add allocation type for parent domain retrieval
> iommu/vt-d: Consolidate irq domain getter
> iommu/amd: Consolidate irq domain getter
> iommu/irq_remapping: Consolidate irq domain lookup
>
> 5) Consolidation of the X86 specific interrupt allocation mechanism to be as
> close
> as possible to the generic MSI allocation mechanism which allows to get
> rid
> of quite a bunch of x86'isms which are pointless
>
> x86/irq: Prepare consolidation of irq_alloc_info
> x86/msi: Consolidate HPET allocation
> x86/ioapic: Consolidate IOAPIC allocation
> x86/irq: Consolidate DMAR irq allocation
> x86/irq: Consolidate UV domain allocation
> PCI/MSI: Rework pci_msi_domain_calc_hwirq()
> x86/msi: Consolidate MSI allocation
> x86/msi: Use generic MSI domain ops
>
> 6) x86 specific cleanups to remove the dependency on arch_*_msi_irqs()
>
> x86/irq: Move apic_post_init() invocation to one place
> x86/pci: Reducde #ifdeffery in PCI init code
> x86/irq: Initialize PCI/MSI domain at PCI init time
> irqdomain/msi: Provide DOMAIN_BUS_VMD_MSI
> PCI: vmd: Mark VMD irqdomain with DOMAIN_BUS_VMD_MSI
> PCI/MSI: Provide pci_dev_has_special_msi_domain() helper
> x86/xen: Make xen_msi_init() static and rename it to xen_hvm_msi_init()
> x86/xen: Rework MSI teardown
> x86/xen: Consolidate XEN-MSI init
> irqdomain/msi: Allow to override msi_domain_alloc/free_irqs()
> x86/xen: Wrap XEN MSI management into irqdomain
> iommm/vt-d: Store irq domain in struct device
> iommm/amd: Store irq domain in struct device
> x86/pci: Set default irq domain in pcibios_add_device()
> PCI/MSI: Make arch_.*_msi_irq[s] fallbacks selectable
> x86/irq: Cleanup the arch_*_msi_irqs() leftovers
> x86/irq: Make most MSI ops XEN private
> iommu/vt-d: Remove domain search for PCI/MSI[X]
> iommu/amd: Remove domain search for PCI/MSI
>
> 7) X86 specific preparation for device MSI
>
> x86/irq: Add DEV_MSI allocation type
> x86/msi: Rename and rework pci_msi_prepare() to cover non-PCI MSI
>
> 8) Generic device MSI infrastructure
> platform-msi: Provide default irq_chip:: Ack
> genirq/proc: Take buslock on affinity write
> genirq/msi: Provide and use msi_domain_set_default_info_flags()
> platform-msi: Add device MSI infrastructure
> irqdomain/msi: Provide msi_alloc/free_store() callbacks
>
> 9) POC of IMS (Interrupt Message Storm) irq domain and irqchip
> implementations for both device array and queue storage.
>
> irqchip: Add IMS (Interrupt Message Storm) driver - NOT FOR MERGING
>
> Changes vs. V1:
>
> - Addressed various review comments and addressed the 0day fallout.
> - Corrected the XEN logic (Jürgen)
> - Make the arch fallback in PCI/MSI opt-in not opt-out (Bjorn)
>
> - Fixed the compose MSI message inconsistency
>
> - Ensure that the necessary flags are set for device SMI
>
> - Make the irq bus logic work for affinity setting to prepare
> support for IMS storage in queue memory. It turned out to be
> less scary than I feared.
>
> - Remove leftovers in iommu/intel|amd
>
> - Reworked the IMS POC driver to cover queue storage so Jason can have a
> look whether that fits the needs of MLX devices.
>
> The whole lot is also available from git:
>
> git://git.kernel.org/pub/scm/linux/kernel/git/tglx/devel.git device-msi
>
> This has been tested on Intel/AMD/KVM but lacks testing on:
>
> - HYPERV (-ENODEV)
> - VMD enabled systems (-ENODEV)
> - XEN (-ENOCLUE)
> - IMS (-ENODEV)
>
> - Any non-X86 code which might depend on the broken compose MSI message
> logic. Marc excpects not much fallout, but agrees that we need to fix
> it anyway.
>
> #1 - #3 should be applied unconditionally for obvious reasons
> #4 - #6 are wortwhile cleanups which should be done independent of device MSI
>
> #7 - #8 look promising to cleanup the platform MSI implementation
> independent of #8, but I neither had cycles nor the stomach to
> tackle that.
>
> #9 is obviously just for the folks interested in IMS
>
> Thanks,
>
> tglx