Re: [RFC PATCH 3/4] acpi: apei: Do not panic() in NMI because of GHES messages
From: James Morse
Date: Fri Apr 20 2018 - 03:27:18 EST
Hi Alex,
On 04/16/2018 10:59 PM, Alex G. wrote:
> On 04/13/2018 11:38 AM, James Morse wrote:
>> This assumes a cache-invalidate will clear the error, which I don't
think we're
>> guaranteed on arm.
>> It also destroys any adjacent data, "everyone's happy" includes the
thread that
>> got a chunk of someone-else's stack frame, I don't think it will be
happy for
>> very long!
>
> Hmm, no cache-line (or page) invalidation on arm64? How does
> dma_map/unmap_*() work then? You may not guarantee to fix the error, but
There are cache-invalidate instructions, but I don't think 'solving' a
RAS error with them is the right thing to do.
> I don't buy into the "let's crash without trying" argument.
Our 'cache writeback granule' may be as large as 2K, so we may have to
invalidate up to 2K of data to convince the hardware this address is
okay again.
All we've done here is differently-corrupt the data so that it no longer
generates a RAS fault, it just gives you the wrong data instead.
Cache-invalidation is destructive.
I don't think there is a one-size-fits-all solution here.
>> (this is a side issue for AER though)
>
> Somebody muddled up AER with these tables, so we now have to worry about
> it. :)
Eh? I see there is a v2, maybe I'll understand this comment once I read it.
>>> How does FFS handle race conditions that can occur when accessing HW
>>> concurrently with the OS? I'm told it's the main reasons why BIOS
>>> doesn't release unused cores from SMM early.
>>
>> This is firmware's problem, it depends on whether there is any
hardware that is
>> shared with the OS. Some hardware can be marked 'secure' in which
case only
>> firmware can access it, alternatively firmware can trap or just
disable the OS's
>> access to the shared hardware.
>
> It's everyone's problem. It's the firmware's responsibility.
It depends on the SoC design. If there is no hardware that the OS and
firmware both need to access to handle an error then I don't think
firmware needs to do this.
>> For example, with the v8.2 RAS Extensions, there are some per-cpu error
>> registers. Firmware can disable these for the OS, so that it always
reads 0 from
>> them. Instead firmware takes the error via FF, reads the registers from
>> firmware, and dumps CPER records into the OS's memory.
>>
>> If there is a shared hardware resource that both the OS and firmware
may be
>> accessing, yes firmware needs to pull the other CPUs in, but this
depends on the
>> SoC design, it doesn't necessarily happen.
>
> The problem with shared resources is just a problem. I've seen systems
> where all 100 cores are held up for 300+ ms. In latency-critical
> applications reliability drops exponentially. Am I correct in assuming
> your answer would be to "hide" more stuff from the OS?
No, I'm not a fan of firmware cycle stealing. If you can design the SoC or
firmware so that the 'all CPUs' stuff doesn't need to happen, then you
won't get
these issues. (I don't design these things, I'm sure they're much more
complicated
than I think!)
Because the firmware is SoC-specific, so it only needs to do exactly
what is necessary.
>>> I think the idea of firmware-first is broken. But it's there, it's
>>> shipping in FW, so we have to accommodate it in SW.
>>
>> Part of our different-views here is firmware-first is taking
something away from
>> you, whereas for me its giving me information that would otherwise be in
>> secret-soc-specific registers.
>
> Under this interpretation, FFS is a band-aid to the problem of "secret"
> registers. "Secret" hardware doesn't really fit well into the idea of an
> OS [1].
Sorry, I'm being sloppy with my terminology, by secret-soc-specific I
mean either Linux can't access them (firmware privilege-level only) or
Linux can't reasonably know where these registers are, as they're
soc-specific and vary by manufacture.
>>> And linux can handle a wide subset of MCEs just fine, so the
>>> ghes_is_deferrable() logic would, under my argument, agree to pass
>>> execution to the actual handlers.
>>
>> For some classes of error we can't safely get there.
>
> Optimize for the common case.
At the expense of reliability?
Thanks,
James