Re: [PATCH 6/7] ppc64: EEH Avoid racing reports of errors

From: linas
Date: Fri Oct 07 2005 - 10:23:30 EST


On Wed, Oct 05, 2005 at 09:23:11PM +1000, Paul Mackerras was heard to remark:
> Linas writes:
>
> > 06-eeh-report-race.patch
>
> Shouldn't you pass in pe_dn->child here, or
> alternatively rearrange __eeh_mark_slot to do the node you give it
> plus its children (recursively)?

Yes; that's right; this gets fixed in a later patch in the series.
I guess this one snuck by while I was trying to sync up all the
different patches I was carrying :-/

> Two other comments about __eeh_mark_slot: (1) despite the comment, the
> function doesn't do anything to any pci_dev or pci_driver

The comment is also a "back port" of function that shows up in a later
patch, and so indeed is inappropriate for this patch. Again, my excuse
is that I got sloppy while juggling all of these patchlets. Sorry.

> (not that it
> should be touching any pci_driver),

One problem I was seeing was that after getting an EEH error,
some device drivers would start spinning in thier interrupt handlers.
I tried to break out of this spin-loop by adding a call to a
function that asked "am I the victim of an EEH event"?
Unfortunately, the first implementation of this call was not
interrupt safe (pci_device_to_OF_node calls traverse_pci_devices).
While scratching my head on to how to best fix this, I decided that
the best thing to do would be to mark up the pci driver with a flag;
that way, the driver can look up te EEH state without any further ado.

One might be able to get rid of this state in pci_driver,
although it seemed generically useful to have. For example,
later on, I futzed with a version that disabled the irq line
for that adapter "as soon as possible", and that seems to also
work, at least on an SMP machine. On a non-SMP machine, there
is still the danger that the device driver is spinning with
interrupts disabled, waiting on a status regiser to change,
that will never change. (And because of the deadlock, the
code to disable a given irq line never runs). Its all
depends on how the device driver got written.

> and (2) a recursive function can't
> really be inline

Well, no, but at least the first level call can be inlined; I assumed
that gcc would do at least that, but didn't check.

--linas

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/