Re: Device hang when offlining a CPU due to IRQ misrouting

From: Eric W. Biederman
Date: Sat Jun 23 2007 - 20:45:59 EST


Andrew Morton <akpm@xxxxxxxxxxxxxxxxxxxx> writes:

> On Sun, 24 Jun 2007 01:54:52 +0200 "Rafael J. Wysocki" <rjw@xxxxxxx> wrote:
>
>> On Wednesday, 20 June 2007 00:08, Siddha, Suresh B wrote:
>> > On Tue, Jun 19, 2007 at 01:49:30PM -0700, Darrick J. Wong wrote:
>> > >
>> > > This fixes the problem! Hurrah!
>> >
>> > Great! Andrew, please include the appended patch in -mm.
>> >
>> > ----
>> > Subject: [patch] x86_64, irq: use mask/unmask and proper locking in
> fixup_irqs
>> > From: Suresh Siddha <suresh.b.siddha@xxxxxxxxx>
>> >
>> > Force irq migration path during cpu offline, is not using proper
>> > locks and irq_chip mask/unmask routines. This will result in
>> > some races(especially the device generating the interrupt can see
>> > some inconsistent state, resulting in issues like stuck irq,..).
>> >
>> > Appended patch fixes the issue by taking proper lock and
>> > encapsulating irq_chip set_affinity() with a mask() before and an
>> > unmask() after.
>> >
>> > This fixes a MSI irq stuck issue reported by Darrick Wong.
>> >
>> > There are several more general bugs in this area(irq migration in the
>> > process context). For example,
>> >
>> > 1. Possibility of missing edge triggered irq.
>> > 2. Reliable method of migrating level triggered irq in the process context.
>> >
>> > We plan to look and close these in the near future.
>>
>> This patch breaks hibernation on my Turion 64 X2 - based testbox (HPC nx6325).
>>
>> _cpu_down() just hangs as though there were a deadlock in there, 100% of the
>> time.
>>
>
> Thanks, I dropped it.

Hmm. It looks like Siddha sent the wrong version of the patch.
The working tested version had an additional test to ensure
the mask and unmask methods were implemented.

i.e.
+ if (irq_desc[irq].chip->mask)
+ irq_desc[irq].chip->mask(irq);
and

+ if (irq_desc[irq].chip->unmask)
+ irq_desc[irq].chip->unmask(irq);
+

Siddha think you can resend the correct version.

Rafael. Think you can add those two ifs and see if you test bed box
works?

I'm still not convinced that we can make fixup_irqs work in general
but if we aren't going to yank it we should at least make it
consistent with the rest of the code.

Eric

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/