Re: [Qemu-devel] [PATCH v3 1/4] ACPI: Add APEI GHES Table Generation support
From: Laszlo Ersek
Date: Mon May 29 2017 - 12:03:43 EST
Hi,
did you remove me from the To: / Cc: list intentionally, or was that an
oversight? I caught your message in my list folders only by luck.
Some followup below:
On 05/29/17 17:27, gengdongjiu wrote:
>> (46) What is "physical_addr" good for? Below I can only see an
>> assignment to it, in ghes_update_guest(). Where is the field read?
> this "physical_addr" address is the physical error address in the
> CPER. such as the physical address that happen hwpoison, this address
> is delivered by the KVM and QEMU transfer this address to physical.
I understand that in the ghes_update_guest() function, you accept a
parameter called "physical_address", and you pass it on to
ghes_generate_cper_record(). That makes sense, yes.
However, you also assign the same value to "ges.physical_addr". And that
structure field is never read. So my point is that the
"GhesErrorState.physical_addr" field is superfluous and should be removed.
I checked the other three patches in the series and they don't seem to
read that structure member either. Correct me if I'm wrong.
>> (55) What happens if you run out of the preallocated memory?
> if it run out of the preallocated memory. it will overwrite other
> error source. every block's size is fixed. so it does not easy
> dynamically extend the size if it is overflow. Anyway I will add a
> error report if it happens overwrite.
I understand (and agree) that dynamic allocation is not possible here.
But that doesn't justify overwriting the error status data block that
belongs to a different data source. (Worse, if this happens with the
last error status data block, for error source 10, you could overwrite
memory that belongs to the OS.)
If an error status data block becomes full, then we should either wrap
back to the start of the same data block, or else stop forwarding errors
for that error source.
Does the ACPI spec say anything about this? I.e., about the case when
the system runs out of the memory that was reserved for recording
hardware errors?
>>>> +
>>>> + mem_err = (struct cper_sec_mem_err *) (gdata + 1);
>>>> +
>>>> + /* In order to simplify simulation, hardcode the CPER section to memory
>>>> + * section.
>>>> + */
>>>> + mem_err->validation_bits |= CPER_MEM_VALID_ERROR_TYPE;
>>>> + mem_err->error_type = 3;
>>
>> (58) Is this supposed to stand for "Multi-bit ECC" (from "N.2.5 Memory
>> Error Section" in UEFI 2.6)? Should we have a macro for that?
> Yes, it is. What do you mean a macro?
A #define for the integer value 3.
> For all the errors that happen in the guest OS, in order to simulate
> easy, I abstract all the error section to memory section, even though
> the error section is processor or other section.
Why is that a valid thing to do? (I'm not doubting it is valid, I'm just
asking.) Will that not confuse the ACPI subsystem of the guest OS?
> I do not know whether do you have some suggestion for that.
Well I would have thought (without any expertise on the subject) that
hardware errors from the host side should be mapped to the guest more or
less "type correct". IOW, it looks strange that, say, a CPU error is
reported as a memory error. But this is just an uneducated guess.
>>>> + mem_err->validation_bits |= CPER_MEM_VALID_CARD | CPER_MEM_VALID_MODULE |
>>>> + CPER_MEM_VALID_BANK | CPER_MEM_VALID_ROW |
>>>> + CPER_MEM_VALID_COLUMN | CPER_MEM_VALID_BIT_POSITION;
>>>> + mem_err->card = 1;
>>>> + mem_err->module = 2;
>>>> + mem_err->bank = 3;
>>>> + mem_err->row = 1;
>>>> + mem_err->column = 2;
>>>> + mem_err->bit_pos = 5;
>>
>> (60) I have no idea where these values come from.
> For all the errors that happen in the guest OS, in order to simulate
> easy, I abstract all the error section to memory section, and hard
> code the memory section error value as above.
Sure, but why is that safe? Will the guest OS not want to do something
about these error details? If we are feeding the guest OS invalid error
details, will that not lead to confusion?
>> (64) What does "reqr" stand for?
> It stand for the request size.
Can you please call it "req_size" or something similar? The English
expression
request size
contains only one "r" letter, so it's hard to understand where the
second "r" in "reqr" comes from.
Thanks,
Laszlo