Re: [RFC PATCH 3/4] acpi: apei: Do not panic() in NMI because of GHES messages
From: Alex G.
Date: Fri Apr 20 2018 - 18:04:58 EST
On 04/20/2018 02:27 AM, James Morse wrote:
> Hi Alex,
>
> On 04/16/2018 10:59 PM, Alex G. wrote:
>> On 04/13/2018 11:38 AM, James Morse wrote:
>>> This assumes a cache-invalidate will clear the error, which I don't
> think we're
>>> guaranteed on arm.
>>> It also destroys any adjacent data, "everyone's happy" includes the
> thread that
>>> got a chunk of someone-else's stack frame, I don't think it will be
> happy for
>>> very long!
>>
>> Hmm, no cache-line (or page) invalidation on arm64? How does
>> dma_map/unmap_*() work then? You may not guarantee to fix the error, but
>
> There are cache-invalidate instructions, but I don't think 'solving' a
> RAS error with them is the right thing to do.
You seem to be putting RAS on a pedestal in a very cloudy and foggy day.
I admit that I fail to see the specialness of RAS in comparison to other
errors.
>> I don't buy into the "let's crash without trying" argument.
>
> Our 'cache writeback granule' may be as large as 2K, so we may have to
> invalidate up to 2K of data to convince the hardware this address is
> okay again.
Eureka! OS can invalidate the entire page. 1:1 mapping with the memory
management data.
> All we've done here is differently-corrupt the data so that it no longer
> generates a RAS fault, it just gives you the wrong data instead.
> Cache-invalidation is destructive.
>
> I don't think there is a one-size-fits-all solution here.
Of course there isn't. That's not the issue.
A cache corruption is a special case of a memory access issue, and that,
we already know how to handle. Triple-fault and cpu-on-fire concerns
apply wrt returning to the context which triggered the problem. We've
already figured that out.
There is a lot of opportunity here for using well tested code paths and
not crashing on first go. Why let firmware make this a problem again?
>>> (this is a side issue for AER though)
>>
>> Somebody muddled up AER with these tables, so we now have to worry about
>> it. :)
>
> Eh? I see there is a v2, maybe I'll understand this comment once I read it.
I meant that somebody (the spec writers) decided to put ominous errors
(PCIe) on the same severity scale with "cpu is on fire" errors.
>>>> How does FFS handle race conditions that can occur when accessing HW
>>>> concurrently with the OS? I'm told it's the main reasons why BIOS
>>>> doesn't release unused cores from SMM early.
>>>
>>> This is firmware's problem, it depends on whether there is any
> hardware that is
>>> shared with the OS. Some hardware can be marked 'secure' in which
> case only
>>> firmware can access it, alternatively firmware can trap or just
> disable the OS's
>>> access to the shared hardware.
>>
>> It's everyone's problem. It's the firmware's responsibility.
>
> It depends on the SoC design. If there is no hardware that the OS and
> firmware both need to access to handle an error then I don't think
> firmware needs to do this.
>
>
>>> For example, with the v8.2 RAS Extensions, there are some per-cpu error
>>> registers. Firmware can disable these for the OS, so that it always
> reads 0 from
>>> them. Instead firmware takes the error via FF, reads the registers from
>>> firmware, and dumps CPER records into the OS's memory.
>>>
>>> If there is a shared hardware resource that both the OS and firmware
> may be
>>> accessing, yes firmware needs to pull the other CPUs in, but this
> depends on the
>>> SoC design, it doesn't necessarily happen.
>>
>> The problem with shared resources is just a problem. I've seen systems
>> where all 100 cores are held up for 300+ ms. In latency-critical
>> applications reliability drops exponentially. Am I correct in assuming
>> your answer would be to "hide" more stuff from the OS?
>
> No, I'm not a fan of firmware cycle stealing. If you can design the SoC or
> firmware so that the 'all CPUs' stuff doesn't need to happen, then you
> won't get
> these issues. (I don't design these things, I'm sure they're much more
> complicated
> than I think!)
>
> Because the firmware is SoC-specific, so it only needs to do exactly
> what is necessary.
Irrespective of the hardware design, there's devicetree, ACPI methods,
and a few other ways to inform the OS of non-standard bits. They don't
have the resource sharing problem. I'm confused as to why FFS is used
when there are concerns about resource conflicts instead of race-free
alternatives.
>>>> I think the idea of firmware-first is broken. But it's there, it's
>>>> shipping in FW, so we have to accommodate it in SW.
>>>
>>> Part of our different-views here is firmware-first is taking
> something away from
>>> you, whereas for me its giving me information that would otherwise be in
>>> secret-soc-specific registers.
>>
>> Under this interpretation, FFS is a band-aid to the problem of "secret"
>> registers. "Secret" hardware doesn't really fit well into the idea of an
>> OS [1].
>
> Sorry, I'm being sloppy with my terminology, by secret-soc-specific I
> mean either Linux can't access them (firmware privilege-level only) or
> Linux can't reasonably know where these registers are, as they're
> soc-specific and vary by manufacture.
This is still a software problem. I'm assuming register access can be
granted to the OS, and I'm also assuming that there exists a non-FFS way
to describe the registers to the OS.
>>>> And linux can handle a wide subset of MCEs just fine, so the
>>>> ghes_is_deferrable() logic would, under my argument, agree to pass
>>>> execution to the actual handlers.
>>>
>>> For some classes of error we can't safely get there.
>>
>> Optimize for the common case.
>
> At the expense of reliability?
Who suggested to sacrifice reliability?
Alex