Re: [RESEND] [PATCH v2] [BUGFIX] x86/x86_64: fix CPU offliningtriggered "inactive" device IRQ interrruption

From: Gary Hade
Date: Thu Jun 04 2009 - 12:39:18 EST


On Wed, Jun 03, 2009 at 04:55:26AM -0700, Eric W. Biederman wrote:
> Gary Hade <garyhade@xxxxxxxxxx> writes:
>
> > Impact: Eliminates a race that can leave the system in an
> > unusable state
> >
> > During rapid offlining of multiple CPUs there is a chance
> > that an IRQ affinity move destination CPU will be offlined
> > before the IRQ affinity move initiated during the offlining
> > of a previous CPU completes. This can happen when the device
> > is not very active and thus fails to generate the IRQ that is
> > needed to complete the IRQ affinity move before the move
> > destination CPU is offlined. When this happens there is an
> > -EBUSY return from __assign_irq_vector() during the offlining
> > of the IRQ move destination CPU which prevents initiation of
> > a new IRQ affinity move operation to an online CPU. This
> > leaves the IRQ affinity set to an offlined CPU.
> >
> > I have been able to reproduce the problem on some of our
> > systems using the following script. When the system is idle
> > the problem often reproduces during the first CPU offlining
> > sequence.
>
> Nacked-by: "Eric W. Biederman" <ebiederm@xxxxxxxxxxxx>
>
> fixup_irqs() is broken for allowing such a thing.

When fixup_irqs() calls the set_affinity function:
...
if (desc->chip->set_affinity)
desc->chip->set_affinity(irq, affinity);
...
it receives no feedback so it obviously expects the set_affinity
function or it's called functions to do the right thing by preventing
or correctly handling any problems that should arise. In the case of
this bug there is obviously a problem happening during the set_affinity
function call that needs to be resolved and/or properly handled.

When you made your "x86_64 irq: Safely cleanup an irq after moving it."
changes (re: http://lkml.org/lkml/2007/2/23/92) you added the check
to __assign_irq_vector() that causes it to return -EBUSY if the
migration of the IRQ is still in progress:
+ if ((cfg->move_in_progress) || cfg->move_cleanup_count)
+ return -EBUSY;
+
However, you did not add any code to other functions on the
call stack to properly deal with this error. When doing this
you may have assumed (as I may have also assumed) that the underlying
code was solid enough that the handling was not needed. Unfortunately,
you apparently did not anticipate the case where an idle or relatively
idle device may not generate the IRQ needed to complete the move
before the CPU that is still handling that IRQ is offlined.

My fix only addresses the issue that caused the -EBUSY return
and subsequent mess. It does not address the omitted handling
for this error condition. If we were to add the handling to
fixup_irq() and the arch and non-arch specific functions above
it on the call stack as you may be suggesting, it would be quite
involved because of all the things that would need to be undone.

I am not certain that my fix plugs the very last hole that could
cause the -EBUSY return from __assign_irq_vector() so maybe we
should at least add a warning or BUG_ON to make the unhandled
error more obvious in the future. I would be happy to provide
this via a separate patch.

Thanks,
Gary

--
Gary Hade
System x Enablement
IBM Linux Technology Center
503-578-4503 IBM T/L: 775-4503
garyhade@xxxxxxxxxx
http://www.ibm.com/linux/ltc

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/