[PATCH] x86/smpboot: Fix uncore_pci_remove() indexing bug when hot-removing a physical CPU

From: Ingo Molnar
Date: Tue Feb 13 2018 - 06:49:20 EST



* Masayoshi Mizuma <msys.mizuma@xxxxxxxxx> wrote:

> From: Masayoshi Mizuma <m.mizuma@xxxxxxxxxxxxxx>
>
> When a physical cpu is hot-removed, the following warning message
> are shown while the uncore device is removing in uncore_pci_remove().
>
> WARNING: CPU: 120 PID: 5 at arch/x86/events/intel/uncore.c:988
> uncore_pci_remove+0xf1/0x110
> ...
> CPU: 120 PID: 5 Comm: kworker/u1024:0 Not tainted 4.15.0-rc8 #1
> Workqueue: kacpi_hotplug acpi_hotplug_work_fn
> ...
> Call Trace:
> pci_device_remove+0x36/0xb0
> device_release_driver_internal+0x145/0x210
> pci_stop_bus_device+0x76/0xa0
> pci_stop_root_bus+0x44/0x60
> acpi_pci_root_remove+0x1f/0x80
> acpi_bus_trim+0x54/0x90
> acpi_bus_trim+0x2e/0x90
> acpi_device_hotplug+0x2bc/0x4b0
> acpi_hotplug_work_fn+0x1a/0x30
> process_one_work+0x141/0x340
> worker_thread+0x47/0x3e0
> kthread+0xf5/0x130
>
> When uncore_pci_remove() runs, it tries to get package id to
> clear the value of uncore_extra_pci_dev[].dev[] by using
> topology_phys_to_logical_pkg(). The warning messesage are
> shown because topology_phys_to_logical_pkg() returns -1.
>
> arch/x86/events/intel/uncore.c:
> static void uncore_pci_remove(struct pci_dev *pdev)
> {
> ...
> phys_id = uncore_pcibus_to_physid(pdev->bus);
> ...
> pkg = topology_phys_to_logical_pkg(phys_id); //returns -1
> for (i = 0; i < UNCORE_EXTRA_PCI_DEV_MAX; i++) {
> if (uncore_extra_pci_dev[pkg].dev[i] == pdev) {
> uncore_extra_pci_dev[pkg].dev[i] = NULL;
> break;
> }
> }
> WARN_ON_ONCE(i >= UNCORE_EXTRA_PCI_DEV_MAX); //HERE!!
>
> topology_phys_to_logical_pkg() tries to find
> cpuinfo_x86->phys_proc_id that matches the phys_pkg argument.
>
> arch/x86/kernel/smpboot.c:
> int topology_phys_to_logical_pkg(unsigned int phys_pkg)
> {
> int cpu;
>
> for_each_possible_cpu(cpu) {
> struct cpuinfo_x86 *c = &cpu_data(cpu);
>
> if (c->initialized && c->phys_proc_id == phys_pkg)
> return c->logical_proc_id;
> }
> return -1;
> }
>
> However, the phys_proc_id is already set to 0 by remove_siblinginfo()
> when the cpu was offlined.
> So, topology_phys_to_logical_pkg() cannot find correct the
> logical_proc_id and always returns -1.
> As the result, uncore_pci_remove() calls WARN_ON_ONCE() and the warning
> messages are shown.
>
> To avoid this, remove the setting from remove_siblinginfo().
> There is no influence about the removing because phys_proc_id is not
> used after it is hot-removed and it is re-set while hot-adding.

So I think this fix goes beyond fixing a 'warning', if we get -1 for 'pkg':

> pkg = topology_phys_to_logical_pkg(phys_id); //returns -1
> for (i = 0; i < UNCORE_EXTRA_PCI_DEV_MAX; i++) {
> if (uncore_extra_pci_dev[pkg].dev[i] == pdev) {
> uncore_extra_pci_dev[pkg].dev[i] = NULL;

... then that creates two _real_ bugs AFAICS:

1) we dereference uncore_extra_pci_dev[] with a negative index

2) we fail to clean up a stale pointer in uncore_extra_pci_dev[][]

So I've rewritten your changelog accordingly - see the attached patch.

I have also added a Cc: stable tag.

Thanks,

Ingo

===================>