Re: [PATCH 3/4] RAS: Add a Corrected Errors Collector

From: Luck, Tony
Date: Thu Mar 23 2017 - 13:20:36 EST


On Thu, Mar 23, 2017 at 04:22:28PM +0100, Borislav Petkov wrote:
> On Wed, Mar 22, 2017 at 07:03:39PM +0100, Borislav Petkov wrote:
> > Lemme try to write a small script exercising exactly that scenario to
> > see whether I'm actually not talking crap here :-)
>
> Ok, here's a snapshot from the CEC after letting it run for a couple of
> hours in a guest with a script running twice in parallel and injecting
> random PFNs. We have 0 offlined pages because a PFN number doesn't
> repeat frequently enough to cause an overflow.
>
> When I force the occurrence of a single PFN for 1023 and more times and
> do that more than once, this happens:
>
> [ 6629.091239] RAS: Soft-offlining pfn: 0x7fff
> [ 6629.093036] __get_any_page: 0x7fff free buddy page
> [ 6653.259476] RAS: Soft-offlining pfn: 0x7fff
> [ 6653.260100] soft offline: 0x7fff page already poisoned
>
> ...
>
> Stats:
> CEs: 32614
> offlined pages: 2
> ^^^^^^^^^^^^^^^^^
>
> Flags: 0x0
> Timer interval: 86400 seconds
> Decays: 254
> Action threshold: 1023
>
> The "already poisoned" thing shouldn't happen in real life because once
> the page frame is poisoned, it shouldn't generate MCEs.

It can happen if Linux didn't actually take the page offline
(because it was a kernel page). The CEC code only knows that
it queued this page to be taken offline ... and has no way
to know if that succeeded or not.

Some people have grumbled about mcelog(8) doing the same thing.

So is it worth keeping track of the page numbers that we
tried to offline? If they show up again we shouldn't add
them back into the array.

-Tony