Re: [PATCH v8 3/3] x86, mce: Add __mcsafe_copy()

From: Tony Luck
Date: Sat Jan 09 2016 - 20:41:02 EST


On Sat, Jan 9, 2016 at 4:23 PM, Dan Williams <dan.j.williams@xxxxxxxxx> wrote:
> On Sat, Jan 9, 2016 at 2:33 PM, Andy Lutomirski <luto@xxxxxxxxxxxxxx> wrote:
>> Shouldn't that logic live in the mcsafe_copy routine itself rather
>> than being delegated to callers?
>>
>
> Yes, please.

Yes - we should have some of that fancy self-patching code that
redirects to the optimal routine for the cpu model we are running
on.

BUT ... it's all going to be very messy. We don't have any CPUID
capability bits to say whether we support recovery, or which instructions
are good/bad choices for recovery. You might think that MCG_CAP{24}
which is described as "software error recovery" (or some such) would
be a good clue, but you'd be wrong. The bit got a little overloaded and
there are cpus that set it, but won't recover.

Only Intel(R) Xeon(R) branded cpus can recover, but not all. The story so far:

Nehalem, Westmere: E7 models support SRAO recovery (patrol scrub,
cache eviction). Not relevant for this e-mail thread.

Sandy Bridge: Some "advanced RAS" skus will recover from poison reads
(these have E5 model names, there was no E7 in this generation)

Ivy Bridge: Xeon E5-* models do not recover. E7-* models do recover.
Note E5 and E7 have the same CPUID model number.

Haswell: Same as Ivy Bridge

Broadwell/Sky Lake: Xeon not released yet ... can't talk about them.

Linux code recently got some recovery bits for AMD cpus ... I don't
know what the story is on which models support this,

-Tony