Re: [PATCH v5] trace: ras: add ARM processor error information trace event

From: Xie XiuQi
Date: Tue Jun 27 2017 - 02:59:05 EST


Hi Boris,

Thanks for your comments.

On 2017/6/26 22:06, Borislav Petkov wrote:
> On Sat, Jun 24, 2017 at 11:38:23AM +0800, Xie XiuQi wrote:
>> Add a new trace event for ARM processor error information, so that
>> the user will know what error occurred. With this information the
>> user may take appropriate action.
>>
>> These trace events are consistent with the ARM processor error
>> information table which defined in UEFI 2.6 spec section N.2.4.4.1.
>>
>> ---
>> v5: add trace enabled condition which is lost on v4 back again
>> put flag after the type to keep multiple_error on a 2 byte boundary
>>
>> v4: use __print_flags instead of __print_symbolic, because ARM_PROC_ERR_FLAGS
>> might have more than on bit set.
>> setting up default values for __entry to avoid a lot of else branches.
>> set flags to 0 by default instead of ~0.
>> fix a typo
>> rename arm_proc_err to arm_err_info_event
>> remove "ARM Processor Error: " prefix
>> rebase on Tyler's patchset v17 "Add UEFI 2.6 and ACPI 6.1 updates for RAS on ARM64"
>>
>> https://patchwork.kernel.org/patch/9806267/
>>
>> v3: no change
>>
>> v2: add trace enabled condition as Steven's suggestion.
>> fix a typo.
>>
>> https://patchwork.kernel.org/patch/9653767/
>> ---
>>
>> Cc: Steven Rostedt <rostedt@xxxxxxxxxxx>
>> Cc: Tyler Baicar <tbaicar@xxxxxxxxxxxxxx>
>> Signed-off-by: Xie XiuQi <xiexiuqi@xxxxxxxxxx>
>> ---
>> drivers/ras/ras.c | 11 +++++++
>> include/linux/cper.h | 5 ++++
>> include/ras/ras_event.h | 79 +++++++++++++++++++++++++++++++++++++++++++++++++
>> 3 files changed, 95 insertions(+)
>>
>> diff --git a/drivers/ras/ras.c b/drivers/ras/ras.c
>> index 39701a5..f76ab0f 100644
>> --- a/drivers/ras/ras.c
>> +++ b/drivers/ras/ras.c
>> @@ -22,7 +22,17 @@ void log_non_standard_event(const uuid_le *sec_type, const uuid_le *fru_id,
>>
>> void log_arm_hw_error(struct cper_sec_proc_arm *err)
>> {
>> + int i;
>> + struct cper_arm_err_info *err_info;
>> +
>> trace_arm_event(err);
>> +
>> + if (!trace_arm_err_info_event_enabled())
>> + return;
>
> If we're going to check whether the tracepoint is enabled, you need
> to do that for arm_event TP too. Because from looking at the spec,
> arm_event dumps
>
> Table 260. ARM Processor Ejkrror Section
>
> and you're dumping
>
> Table 261. ARM Processor Error Information Structure
>
> which is embedded in the previous table.
>
> So this is basically a single error event and the error info structures
> can describe different incarnations to that error event.
>
> And you need to mirror exactly that behavior.
>
> Then, when you do that, you need to document somewhere so that userspace
> knows to open *both* TPs in order to get the full error information.
>
> Alternatively, you can extend arm_event to get issued with *each*
> cper_arm_err_info but that would mean a lot of redundant information
> being shuffled out to userspace.

How about we report the full info via arm_err_info_event which just for someone
who want the detail information, and leave arm_event closed. If someone do not
care the error detail, who could just open arm_event.

It may like this for each err_info in one section:

arm_err_info_event: affinity level: 1; MPIDR: 0000001; MIDR: 0000001; running state: 0; PSCI state: 1;
type: TLB error; count: 65535; flags: First error captured|Last error captured|Propagated|Overflow;
error info: 0000000005244678; virtual address: 0000000000013579; physical address: 0000000000024680

One problem is that may report some redundant information if we have more than one err_info in a section.

Does Tyler have any good idea?

>
> So I guess that's ARM folks' call.
>

--
Thanks,
Xie XiuQi