Re: [RFC 3/6] x86, NMI, Rename memory parity error to PCI SERR error

From: huang ying
Date: Thu Sep 23 2010 - 01:37:53 EST


Hi, Maciej,

On Wed, Sep 22, 2010 at 7:04 AM, Maciej W. Rozycki <macro@xxxxxxxxxxxxxx> wrote:
> On Fri, 10 Sep 2010, Huang Ying wrote:
>
>> memory parity error is only valid for IBM PC-AT, newer machine use 7
>> bit (0x80) of 0x61 port for PCI SERR. While memory error is usually
>> reported via MCE. So corresponding function name and kernel log string
>> is changed.
>
> ÂThings perhaps changed over the last few years while I have not been
> watching, but for many years the bit #7 of the NMI status port
> (implemented by the southbridge at 0x61 in the port I/O space) was still
> used for memory parity or ECC errors even after the original IBM PC/AT.
> The usual arrangement was in the event of a memory error the memory
> controller in the northbridge would assert the chip's PCI SERR output
> line, which in turn would be trapped by the southbridge and converted to
> an NMI event while setting the said bit in the NMI status port. ÂSee e.g.
> the 82349HX System Controller datasheet (Intel document number 290551).

Thanks for your information. So EDAC function call in NMI handler
should be kept? But as you pointed out, the function name of
corresponding handler should be PCI SERR instead of memory parity. It
just can be used to report memory error on some system. I think we can
rename the function and string to PCI SERR and add some comments for
EDAC function call that checks memory errors.

> ÂSo the name of the error reported is not that unjustified except, of
> course, to be precise the handler would have to scan the state of the SERR
> output reported by all the PCI devices in the PCI configuration space to
> find the originator and then interpret the event accordingly. ÂWhich
> obviously means the only piece of code that could exactly know what the
> reason was is the respective device driver as causes of SERR are
> device-specific and may require processing of device-specific registers to
> determine the cause and/or recover (a device reset may be required in some
> cases).

In addition to PCI SERR, I think modern system rely more on PCIE AER,
which can report more information about error. There are recovery
support for PCIE AER in kernel already. Do we need some similar
mechanism for PCI SERR? Because PCIE AER becomes more and more common
on server platform, I think some minimal check such as scaning devices
SERR/PERR bit should be sufficient.

> ÂOf course using the MCE seems natural and better, especially if the
> exception can be raised synchronously and stop the failing memory load CPU
> instruction from completion -- this is important for parity and MBE ECC
> errors, where in some cases the handler may be able to retry the failing
> operation having refreshed RAM from the backing store or otherwise the
> affected process must be killed (unless a kernel memory location is
> involved that is, where the whole system has to be brought down).
>
> ÂOTOH, for CPU stores and DMA transactions the event will always be
> asynchronous and an NMI might be a better option, as in the case of parity
> and MBE ECC errors the whole system will probably have to be brought down,
> and with SBE ECC errors scrubbing can be done at any time and otherwise
> (except from logging and/or marking the physical page bad, as required) no
> action is needed.

In fact, MCE is a special exception, it can be used for asynchronous
events too. Such as memory error detected by patrol scrubbing, please
take a look at latest Intel 64 and IA32 architectures software
developer's manual Vol 3A section 15.9.3: Architecturally Defined UCR
Errors.

Best Regards,
Huang Ying
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/