Re: EDAC (was: Re: 2.6.14-rc5-mm1)

From: Doug Thompson
Date: Wed Oct 26 2005 - 15:22:10 EST




--- Sander <sander@xxxxxxxxxxx> wrote:

> Doug Thompson wrote (ao):
> > --- Sander <sander@xxxxxxxxxxx> wrote:
> > > Via Epia MII 10000, kernel 2.6.14-rc4-mm1:
>
> > The EDAC scanning code first scans the STATUS
> register
> > of all the PCI devices in the system. This status
> > register reflects operations on the main bus.
> > Second, the code scans the SECONDARY STATUS
> register
> > of all bridge devices, which reflects operations
> on
> > the sub-bus.
> >
> > This instance (0000:00:01.0) of output shows me
> the
> > VIA VT8633 is generating the parity bit. The
> default
> > poll interval if 1000 ms and the above output
> shows
> > this. This bridge is either having a parity error
> on
> > the main bus OR more likely is generating false
> > positives. How to determine which? More
> investigation
> > is needed.
>
> Anything I can do?

To help? Keep an eye for other devices which post
parity errors.

To overcome this on your own system? If you don't want
so many message for the moment, turn off EDAC. Later
when the blacklist is avaliable, put this device in
the blacklist.

> And will blacklisting make EDAC useless?

No, just less closure, less complete. If we were SURE
all devices followed the rules, then a parity event is
a BAD thing we could then count on. Since it is an
imperfect world, we gather the "blacklist" of cards
that don't follow the PCI spec, send them a blasting
letter, buy alternatives that do work and continue to
scan for parity errors.

This scanning of parity errrors allowed my company to
isolate data corruption between an interconnect in
nodes on a cluster. The fault? The riser card had
failings. With this parity scanner, we isolated the
borderline risers and replaced them. Saved alot of
time. Luckily, the card we had did NOT generate false
positives. We have another high speed interconnect
which does generate false positives, we told them
about it. Helped them reproduce the reporting via
script using 'setpci'. They finally ack'd they had a
firmware problem and will rev the FW in january -
yeah!

> If so, does it make more sense not to configure
EDAC?

depends on your requirements.

we have been living with systems with PCI devices for
a decade now. how many times have events occurred that
had no explaination and are simply dismissed? There
were no detectors.

We assume many things, even today. How many desktops
with gigs of memory have no ECC? I have learned my
lesson while refactoring bluesmoke/edac that ECC is
very important. ECC always in my machines for now on,
for me anyway.

For PCI devices, if you want to "know" data is being
transmitted correctly, then there needs to be
"detector" and "reporter" and "handler" agents of this
bad events to properly notice, report and process
them.

doug thompson


>
> --
> Humilis IT Services and Solutions
> http://www.humilis.net
>



"If you think Education is expensive, just try Ignorance"

"Don't tell people HOW to do things, tell them WHAT you
want and they will surprise you with their ingenuity."
Gen George Patton

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/