RE: [PATCH v4 1/2] ACPI: APEI: Add support to notify the vendor specific HW errors
From: Shiju Jose
Date: Thu Mar 12 2020 - 08:10:29 EST
Hi James,
Thanks for reviewing the code.
>-----Original Message-----
>From: linux-pci-owner@xxxxxxxxxxxxxxx [mailto:linux-pci-
>owner@xxxxxxxxxxxxxxx] On Behalf Of James Morse
>Sent: 11 March 2020 17:30
>To: Shiju Jose <shiju.jose@xxxxxxxxxx>
>Cc: linux-acpi@xxxxxxxxxxxxxxx; linux-pci@xxxxxxxxxxxxxxx; linux-
>kernel@xxxxxxxxxxxxxxx; rjw@xxxxxxxxxxxxx; helgaas@xxxxxxxxxx;
>lenb@xxxxxxxxxx; bp@xxxxxxxxx; tony.luck@xxxxxxxxx;
>gregkh@xxxxxxxxxxxxxxxxxxx; zhangliguang@xxxxxxxxxxxxxxxxx;
>tglx@xxxxxxxxxxxxx; Linuxarm <linuxarm@xxxxxxxxxx>; Jonathan Cameron
><jonathan.cameron@xxxxxxxxxx>; tanxiaofei <tanxiaofei@xxxxxxxxxx>;
>yangyicong <yangyicong@xxxxxxxxxx>
>Subject: Re: [PATCH v4 1/2] ACPI: APEI: Add support to notify the vendor
>specific HW errors
>
>Hi Shiju,
>
>On 07/02/2020 10:31, Shiju Jose wrote:
>> Presently APEI does not support reporting the vendor specific HW
>> errors, received in the vendor defined table entries, to the vendor
>> drivers for any recovery.
>>
>> This patch adds the support to register and unregister the error
>> handling function for the vendor specific HW errors and notify the
>> registered kernel driver.
>
>Is it possible to use the kernel's existing atomic_notifier_chain_register() API for
>this?
>
>The one thing that can't be done in the same way is the GUID filtering in ghes.c.
>Each driver would need to check if the call matched a GUID they knew about,
>and return NOTIFY_DONE if they "don't care".
>
>I think this patch would be a lot smaller if it was tweaked to be able to use the
>existing API. If there is a reason not to use it, it would be good to know what it
>is.
I think when using atomic_notifier_chain_register we have following limitations,
1. All the registered error handlers would get called, though an error is not related to those handlers.
Also this may lead to mishandling of the error information if a handler does not
implement GUID checking etc.
2. atomic_notifier_chain_register (notifier_chain_register) looks like does not support
pass the handler's private data during the registration which supposed to
passed later in the call back function *notifier_fn_t(... ,void *data) to the handler.
3. Also got difficulty in passing the ghes error data(acpi_hest_generic_data), GUID
for the error received to the handler through the notifier_chain callback interface.
Sorry if I did not understood your suggestion correctly.
>
>
>> diff --git a/drivers/acpi/apei/ghes.c b/drivers/acpi/apei/ghes.c index
>> 103acbb..69e18d7 100644
>> --- a/drivers/acpi/apei/ghes.c
>> +++ b/drivers/acpi/apei/ghes.c
>> @@ -490,6 +490,109 @@ static void ghes_handle_aer(struct
>> acpi_hest_generic_data *gdata)
>
>> +/**
>> + * ghes_unregister_event_handler - unregister the previously
>> + * registered event handling function.
>> + * @sec_type: sec_type of the corresponding CPER.
>> + * @data: driver specific data to distinguish devices.
>> + */
>> +void ghes_unregister_event_handler(guid_t sec_type, void *data) {
>> + struct ghes_event_notify *event_notify;
>> + bool found = false;
>> +
>> + mutex_lock(&ghes_event_notify_mutex);
>> + rcu_read_lock();
>> + list_for_each_entry_rcu(event_notify,
>> + &ghes_event_handler_list, list) {
>> + if (guid_equal(&event_notify->sec_type, &sec_type)) {
>
>> + if (data != event_notify->data)
>
>It looks like you need multiple drivers to handle the same GUID because of
>multiple root ports. Can't the handler lookup the right device?
This check was because GUID is shared among multiple devices with one driver as seen
in the B2889FC9 driver (pcie-hisi-error.c).
>
>
>> + continue;
>> + list_del_rcu(&event_notify->list);
>> + found = true;
>> + break;
>> + }
>> + }
>> + rcu_read_unlock();
>> + mutex_unlock(&ghes_event_notify_mutex);
>> +
>> + if (!found) {
>> + pr_err("Tried to unregister a GHES event handler that has not
>been registered\n");
>> + return;
>> + }
>> +
>> + synchronize_rcu();
>> + kfree(event_notify);
>> +}
>> +EXPORT_SYMBOL_GPL(ghes_unregister_event_handler);
>
>> @@ -525,11 +628,14 @@ static void ghes_do_proc(struct ghes *ghes,
>>
>> log_arm_hw_error(err);
>> } else {
>> - void *err = acpi_hest_get_payload(gdata);
>> -
>> - log_non_standard_event(sec_type, fru_id, fru_text,
>> - sec_sev, err,
>> - gdata->error_data_length);
>> + if (!ghes_handle_non_standard_event(sec_type, gdata,
>> + sev)) {
>> + void *err = acpi_hest_get_payload(gdata);
>> +
>> + log_non_standard_event(sec_type, fru_id,
>> + fru_text, sec_sev, err,
>> + gdata->error_data_length);
>> + }
>
>So, a side effect of the kernel handling these is they no longer get logged out of
>trace points?
>
>I guess the driver the claims this logs some more accurate information. Are
>there expected to be any user-space programs doing something useful with
>B2889FC9... today?
The B2889FC9 driver does not expect any corresponding user space programs.
The driver mainly for the error recovery and basic error decoding and logging.
Previously we added the error logging for the B2889FC9 in the rasdaemon.
>
>
>Thanks,
>
>James
Thanks,
Shiju