RE: Problems with EDAC coexisting with BIOS
From: Gross, Mark
Date: Wed May 03 2006 - 16:48:58 EST
>-----Original Message-----
>From: Tim Small [mailto:tim@xxxxxxxxxxxxxxxx]
>Sent: Wednesday, May 03, 2006 1:25 PM
>To: Alan Cox
>Cc: Ong, Soo Keong; Gross, Mark; bluesmoke-devel@xxxxxxxxxxxxxxxxxxxxx;
>LKML; Carbonari, Steven; Wang, Zhenyu Z
>Subject: Re: Problems with EDAC coexisting with BIOS
>
>Alan Cox wrote:
>
>>On Llu, 2006-04-24 at 22:15 +0800, Ong, Soo Keong wrote:
>>
>>
>>>To me, periodical is not a good design for error handling, it wastes
>>>transaction bandwidth that should be used for other more productive
>>>purposes.
>>>
>>>
>>
>>The periodical choice is mostly down to the brain damaged choice of
NMI
>>as the viable alternative, which is as good as 'not usable'
>>
>>
>Hi,
>
>As I believe that the majority of the bluesmoke/EDAC developers are
>(were) operating under the assumption that it would be possible to do
>something with NMI-signalled errors, I was wondering what the problems
>with using NMI-signalled ECC errors were?
Soo Keong or Steve, can you answer this one?
>
>Are there some systems states in which an incoming NMI throws a spanner
>in to the works in an unrecoverable way? If this is the case, is it so
>on all x86/x86-64 systems, or just a subset, and is there no way to
>implement some sort of top half / bottom half style NMI handler
>cleanly? As I am certainly not an x86 architecture expert, I would
>appreciate any input from the resident gurus ;o).
I don't think NMI's are so much the problem as those blasted SMI's. And
SMI's are not for sharing :(
These problems are BIOS specific, Your Mileage Will Vary from one bios
to the next.
>
>Quickly returning to the original problem - I know this isn't a proper
>API by any stretch of the imagination, and would require changes to
>existing BIOSs, but the EDAC module could reprogram the chipset
>error-signalling registers, so that an ECC error no longer triggers an
>SMI. The BIOS SMI handler could then read the signalling registers,
and
>leave the ECC registers well alone if ECC errors are not set to
generate
>an SMI.
It's not the SMI from ECC events that are the problem. It's the BIOS
assuming no one else would want to use those registers on Dev0:Fun1, and
the fact that you still end up with a race between the OS and the BIOS
SMI to handle the events.
>
>Cheers,
>
>Tim.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/