RE: [PATCH v6 1/2] ACPI / APEI: Add support to notify the vendor specific HW errors

From: Shiju Jose
Date: Wed Apr 08 2020 - 05:21:24 EST


Hi Boris,

>-----Original Message-----
>From: Borislav Petkov [mailto:bp@xxxxxxxxx]
>Sent: 31 March 2020 10:09
>To: Shiju Jose <shiju.jose@xxxxxxxxxx>
>Cc: linux-acpi@xxxxxxxxxxxxxxx; linux-pci@xxxxxxxxxxxxxxx; linux-
>kernel@xxxxxxxxxxxxxxx; rjw@xxxxxxxxxxxxx; helgaas@xxxxxxxxxx;
>lenb@xxxxxxxxxx; james.morse@xxxxxxx; tony.luck@xxxxxxxxx;
>gregkh@xxxxxxxxxxxxxxxxxxx; zhangliguang@xxxxxxxxxxxxxxxxx;
>tglx@xxxxxxxxxxxxx; Linuxarm <linuxarm@xxxxxxxxxx>; Jonathan Cameron
><jonathan.cameron@xxxxxxxxxx>; tanxiaofei <tanxiaofei@xxxxxxxxxx>;
>yangyicong <yangyicong@xxxxxxxxxx>
>Subject: Re: [PATCH v6 1/2] ACPI / APEI: Add support to notify the vendor
>specific HW errors
>
>On Mon, Mar 30, 2020 at 03:44:29PM +0000, Shiju Jose wrote:
>> 1. rasdaemon need not to print the vendor error data reported by the
>firmware if the
>> kernel driver already print those information. In this case rasdaemon will
>only need to store
>> the decoded vendor error data to the SQL database.
>
>Well, there's a problem with this:
>
>rasdaemon printing != kernel driver printing
>
>Because printing in dmesg would need people to go grep dmesg.
>
>Printing through rasdaemon or any userspace agent, OTOH, is a lot more
>flexible wrt analyzing and collecting those error records. Especially if you are a
>data center admin and you want to collect all your error
>records: grepping dmesg simply doesn't scale versus all the rasdaemon
>agents reporting to a centrallized location.
Ok.
I posted V7 of this series.
"[v7 PATCH 0/6] ACPI / APEI: Add support to notify non-fatal HW errors"

>
>> 2. If the vendor kernel driver want to report extra error information
>through
>> the vendor specific data (though presently we do not have any such use
>case) for the rasdamon to log.
>> I think the error handled status useful to indicate that the kernel driver
>has filled the extra information and
>> rasdaemon to decode and log them after extra data specific validity
>check.
>
>The kernel driver can report that extra information without the kernel saying
>that the error was handled.
>
>So I still see no sense for the kernel to tell userspace explicitly that it handled
>the error. There might be a valid reason, though, of which I cannot think of
>right now.
Ok.

>
>Thx.
>
>--
>Regards/Gruss,
> Boris.
>
>https://people.kernel.org/tglx/notes-about-netiquette

Thanks,
Shiju