Re: [PATCH 1/1] x86/vector: Fix vector leak during CPU offline
From: Dongli Zhang
Date: Wed May 15 2024 - 15:51:57 EST
On 5/13/24 3:46 PM, Thomas Gleixner wrote:
> On Mon, May 13 2024 at 10:43, Dongli Zhang wrote:
>> On 5/13/24 5:44 AM, Thomas Gleixner wrote:
>>> On Fri, May 10 2024 at 12:06, Dongli Zhang wrote:
>>> Any interrupt which is affine to an outgoing CPU is migrated and
>>> eventually pending moves are enforced:
>>>
>>> cpu_down()
>>> ...
>>> cpu_disable_common()
>>> fixup_irqs()
>>> irq_migrate_all_off_this_cpu()
>>> migrate_one_irq()
>>> irq_force_complete_move()
>>> free_moved_vector();
>>>
>>> No?
>>
>> I noticed this and finally abandoned the solution to fix at migrate_one_irq(),
>> because:
>>
>> 1. The objective of migrate_one_irq()-->irq_force_complete_move() looks to
>> cleanup before irq_do_set_affinity().
>>
>> 2. The irq_needs_fixup() may return false so that irq_force_complete_move() does
>> not get the chance to trigger.
>>
>> 3. Even irq_force_complete_move() is triggered, it exits early if
>> apicd->prev_vector==0.
>
> But that's not the case, really.
>
>> The apicd->prev_vector can be cleared by __vector_schedule_cleanup() because
>> cpu_disable_common() releases the vector_lock after CPU is flagged offline.
>
> Nothing can schedule vector cleanup at that point because _all_ other
> CPUs spin in stop_machine() with interrupts disabled and therefore
> cannot handle interrupts which might invoke it.
Thank you very much for the explanation! Now I see why to drop the vector_lock
is not an issue.
[snip]
>
>> diff --git a/kernel/irq/cpuhotplug.c b/kernel/irq/cpuhotplug.c
>> index 1ed2b1739363..5ecd072a34fe 100644
>> --- a/kernel/irq/cpuhotplug.c
>> +++ b/kernel/irq/cpuhotplug.c
>> @@ -69,6 +69,14 @@ static bool migrate_one_irq(struct irq_desc *desc)
>> return false;
>> }
>>
>> + /*
>> + * Complete an eventually pending irq move cleanup. If this
>> + * interrupt was moved in hard irq context, then the vectors need
>> + * to be cleaned up. It can't wait until this interrupt actually
>> + * happens and this CPU was involved.
>> + */
>> + irq_force_complete_move(desc);
>
> You cannot do that here because it is only valid when the interrupt is
> affine to the outgoing CPU.
>
> In the problem case the interrupt was affine to the outgoing CPU, but
> the core code does not know that it has not been cleaned up yet. It does
> not even know that the interrupt was affine to the outgoing CPU before.
>
> So in principle we could just go and do:
>
> } else {
> - apicd->prev_vector = 0;
> + free_moved_vector(apicd);
> }
> raw_spin_unlock(&vector_lock);
>
> but that would not give enough information and would depend on the
> interrupt to actually arrive.
>
> irq_force_complete_move() already describes this case, but obviously it
> does not work because the core code does not invoke it in that
> situation.
>
> So yes, moving the invocation of irq_force_complete_move() before the
> irq_needs_fixup() call makes sense, but it wants this to actually work
> correctly:
>
> --- a/arch/x86/kernel/apic/vector.c
> +++ b/arch/x86/kernel/apic/vector.c
> @@ -1036,7 +1036,8 @@ static void __vector_schedule_cleanup(st
> add_timer_on(&cl->timer, cpu);
> }
> } else {
> - apicd->prev_vector = 0;
> + pr_warn("IRQ %u schedule cleanup for offline CPU %u\n", apicd->irq, cpu);
> + free_moved_vector(apicd);
> }
> raw_spin_unlock(&vector_lock);
> }
> @@ -1097,10 +1098,11 @@ void irq_force_complete_move(struct irq_
> goto unlock;
>
> /*
> - * If prev_vector is empty, no action required.
> + * If prev_vector is empty or the descriptor was previously
> + * not on the outgoing CPU no action required.
> */
> vector = apicd->prev_vector;
> - if (!vector)
> + if (!vector || apicd->prev_cpu != smp_processor_id())
> goto unlock;
>
The above may not work. migrate_one_irq() relies on irq_force_complete_move() to
always reclaim the apicd->prev_vector. Otherwise, the call of
irq_do_set_affinity() later may return -EBUSY.
801 /*
802 * Careful here. @apicd might either have move_in_progress set or
803 * be enqueued for cleanup. Assigning a new vector would either
804 * leave a stale vector on some CPU around or in case of a pending
805 * cleanup corrupt the hlist.
806 */
807 if (apicd->move_in_progress || !hlist_unhashed(&apicd->clist))
808 return -EBUSY;
Or maybe add a flag to control whether to enforce a cleanup in any conditions?
-void irq_force_complete_move(struct irq_desc *desc)
+void irq_force_complete_move(struct irq_desc *desc, bool always_force)
{
struct apic_chip_data *apicd;
struct irq_data *irqd;
@@ -1102,6 +1103,9 @@ void irq_force_complete_move(struct irq_desc *desc)
if (!vector)
goto unlock;
+ if (!always_force && apicd->prev_cpu != smp_processor_id())
+ goto unlock;
+
/*
* This is tricky. If the cleanup of the old vector has not been
* done yet, then the following setaffinity call will fail with
.. and call irq_force_complete_move() at different places?
@@ -79,6 +79,11 @@ static bool migrate_one_irq(struct irq_desc *desc)
* interrupt.
*/
if (irqd_is_per_cpu(d) || !irqd_is_started(d) || !irq_needs_fixup(d)) {
+ /*
+ * Complete an eventually pending irq move cleanup if this
+ * interrupt was affine to the outgoing CPU.
+ */
+ irq_force_complete_move(desc, false);
/*
* If an irq move is pending, abort it if the dying CPU is
* the sole target.
@@ -93,7 +98,7 @@ static bool migrate_one_irq(struct irq_desc *desc)
* to be cleaned up. It can't wait until this interrupt actually
* happens and this CPU was involved.
*/
- irq_force_complete_move(desc);
+ irq_force_complete_move(desc, true);
/*
* If there is a setaffinity pending, then try to reuse the pending
Thank you very much!
Dongli Zhang