Re: [PATCH 1/1] x86/vector: Fix vector leak during CPU offline

From: Thomas Gleixner
Date: Mon May 13 2024 - 08:44:35 EST

Next message: Andreas Hindborg: "Re: [PATCH 1/3] rust: block: introduce `kernel::block::mq` module"
Previous message: Dietmar Eggemann: "Re: [PATCH v3] sched: Consolidate cpufreq updates"
In reply to: Dave Hansen: "Re: [PATCH 1/1] x86/vector: Fix vector leak during CPU offline"
Next in thread: Dongli Zhang: "Re: [PATCH 1/1] x86/vector: Fix vector leak during CPU offline"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]

On Fri, May 10 2024 at 12:06, Dongli Zhang wrote:
> The absence of IRQD_MOVE_PCNTXT prevents immediate effectiveness of
> interrupt affinity reconfiguration via procfs. Instead, the change is
> deferred until the next instance of the interrupt being triggered on the
> original CPU.
>
> When the interrupt next triggers on the original CPU, the new affinity is
> enforced within __irq_move_irq(). A vector is allocated from the new CPU,
> but if the old vector on the original CPU remains online, it is not
> immediately reclaimed. Instead, apicd->move_in_progress is flagged, and the
> reclaiming process is delayed until the next trigger of the interrupt on
> the new CPU.
>
> Upon the subsequent triggering of the interrupt on the new CPU,
> irq_complete_move() adds a task to the old CPU's vector_cleanup list if it
> remains online. Subsequently, the timer on the old CPU iterates over its
> vector_cleanup list, reclaiming vectors.
>
> However, if the old CPU is offline before the interrupt triggers again on
> the new CPU, irq_complete_move() simply resets both apicd->move_in_progress
> and apicd->prev_vector to 0. Consequently, the vector remains unreclaimed
> in vector_matrix, resulting in a CPU vector leak.

I doubt that.

Any interrupt which is affine to an outgoing CPU is migrated and
eventually pending moves are enforced:

cpu_down()
...
cpu_disable_common()
fixup_irqs()
irq_migrate_all_off_this_cpu()
migrate_one_irq()
irq_force_complete_move()
free_moved_vector();

No?

In fact irq_complete_move() should never see apicd->move_in_progress
with apicd->prev_cpu pointing to an offline CPU.

The CPU offline case in __vector_schedule_cleanup() should not even
exist or at least just emit a warning.

If you can trigger that case, then there is something fundamentally
wrong with the CPU hotplug interrupt migration code and that needs to be
investigated and fixed.

Thanks,

tglx

Next message: Andreas Hindborg: "Re: [PATCH 1/3] rust: block: introduce `kernel::block::mq` module"
Previous message: Dietmar Eggemann: "Re: [PATCH v3] sched: Consolidate cpufreq updates"
In reply to: Dave Hansen: "Re: [PATCH 1/1] x86/vector: Fix vector leak during CPU offline"
Next in thread: Dongli Zhang: "Re: [PATCH 1/1] x86/vector: Fix vector leak during CPU offline"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]