Re: [BUG] Re: [PATCH v10 1/3] aerdrv: Trace Event for AER

From: Ethan Zhao
Date: Wed Dec 04 2013 - 10:28:19 EST


Rui,
Agree with that, there are really many such confusing error type definition
need to be standardized or unified, some of them are
ambiguousãinconsistent, some of them violates ACPI/PCI spec.

According to the ACPI spec, the 'FATAL' in fact, is a sub-category
of 'UNCORRECTABLE' , the another one is 'NON-FATAL', then how to
understand the bebow 'UNCORRECTABLE's ? mean uncorrectable-non-fatal
? just guess, confusing.

acpi\actbl1.h

#define ACPI_EINJ_PROCESSOR_CORRECTABLE (1)
#define ACPI_EINJ_PROCESSOR_UNCORRECTABLE (1<<1)
#define ACPI_EINJ_PROCESSOR_FATAL (1<<2)
#define ACPI_EINJ_MEMORY_CORRECTABLE (1<<3)
#define ACPI_EINJ_MEMORY_UNCORRECTABLE (1<<4)
#define ACPI_EINJ_MEMORY_FATAL (1<<5)
#define ACPI_EINJ_PCIX_CORRECTABLE (1<<6)
#define ACPI_EINJ_PCIX_UNCORRECTABLE (1<<7)
#define ACPI_EINJ_PCIX_FATAL (1<<8)
#define ACPI_EINJ_PLATFORM_CORRECTABLE (1<<9)
#define ACPI_EINJ_PLATFORM_UNCORRECTABLE (1<<10)
#define ACPI_EINJ_PLATFORM_FATAL (1<<11)

edac.h

enum hw_event_mc_err_type {
HW_EVENT_ERR_CORRECTED,
HW_EVENT_ERR_UNCORRECTED,
HW_EVENT_ERR_FATAL,
HW_EVENT_ERR_INFO,
};

ghes.h
enum {
GHES_SEV_NO = 0x0,
GHES_SEV_CORRECTED = 0x1,
GHES_SEV_RECOVERABLE = 0x2,
GHES_SEV_PANIC = 0x3,
};

What's the meaning of GHES_SEV_PANIC ? Why not 'FATAL' , just as
described in ACPI spec section 18.3.2.6.1,
"
Error Severity 4 16 Identifies the error severity of the reported error:
0 â Recoverableï
1 â Fatalï
2 â Correctedï
3 â None
"
If there is other intension, but could be seen translated into 'FATAL' later:

case GHES_SEV_PANIC:
type = HW_EVENT_ERR_FATAL;

And these looks reasonable,
aer.h

#define AER_NONFATAL 0
#define AER_FATAL 1
#define AER_CORRECTABLE 2


Thanks,
Ethan

On Wed, Dec 4, 2013 at 11:10 AM, rui wang <ruiv.wang@xxxxxxxxx> wrote:
> Resending adding Mauro's new Email address...
>
>
> On 1/17/13, Lance Ortiz <lance.ortiz@xxxxxx> wrote:
>> This header file will define a new trace event that will be triggered when
>> a AER event occurs. The following data will be provided to the trace
>> event.
>>
>> char * dev_name - The name of the slot where the device resides
>> ([domain:]bus:device.function).
>>
>> u32 status - Either the correctable or uncorrectable register
>> indicating what error or errors have been see.
>>
>> u8 severity - error severity 0:NONFATAL 1:FATAL 2:CORRECTED
>>
>> The trace event will also provide a trace string that may look like:
>>
>> "0000:05:00.0 PCIe Bus Error:severity=Uncorrected (Non-Fatal), Poisoned
>> TLP"
>>
>> v1-v2 Move header from include/ras/aer_event.h to
>> include/trace/events/ras.h
>> v3-v4 Cleaned up comments and commit header
>> v4-v5 More cleanup remove () from if statement in print.
>> Renamed string define to be more specific.
>> v5-v6 change TRACE_SYSTEM define to be ras and not aer.
>>
>> Signed-off-by: Lance Ortiz <lance.ortiz@xxxxxx>
>> Acked-by: Mauro Carvalho Chehab <mchehab@xxxxxxxxxx>
>> Acked-by: Tony Luck <tony.luck@xxxxxxxxx>
>> ---
>>
>> include/trace/events/ras.h | 77
>> ++++++++++++++++++++++++++++++++++++++++++++
>> 1 files changed, 77 insertions(+), 0 deletions(-)
>> create mode 100644 include/trace/events/ras.h
>>
>> diff --git a/include/trace/events/ras.h b/include/trace/events/ras.h
>> new file mode 100644
>> index 0000000..88b8783
>> --- /dev/null
>> +++ b/include/trace/events/ras.h
>> @@ -0,0 +1,77 @@
>> +#undef TRACE_SYSTEM
>> +#define TRACE_SYSTEM ras
>> +
>> +#if !defined(_TRACE_AER_H) || defined(TRACE_HEADER_MULTI_READ)
>> +#define _TRACE_AER_H
>> +
>> +#include <linux/tracepoint.h>
>> +#include <linux/edac.h>
>> +
>> +
>> +/*
>> + * PCIe AER Trace event
>> + *
>> + * These events are generated when hardware detects a corrected or
>> + * uncorrected event on a PCIe device. The event report has
>> + * the following structure:
>> + *
>> + * char * dev_name - The name of the slot where the device resides
>> + * ([domain:]bus:device.function).
>> + * u32 status - Either the correctable or uncorrectable register
>> + * indicating what error or errors have been seen
>> + * u8 severity - error severity 0:NONFATAL 1:FATAL 2:CORRECTED
>> + */
>> +
>> +#define aer_correctable_errors \
>> + {BIT(0), "Receiver Error"}, \
>> + {BIT(6), "Bad TLP"}, \
>> + {BIT(7), "Bad DLLP"}, \
>> + {BIT(8), "RELAY_NUM Rollover"}, \
>> + {BIT(12), "Replay Timer Timeout"}, \
>> + {BIT(13), "Advisory Non-Fatal"}
>> +
>> +#define aer_uncorrectable_errors \
>> + {BIT(4), "Data Link Protocol"}, \
>> + {BIT(12), "Poisoned TLP"}, \
>> + {BIT(13), "Flow Control Protocol"}, \
>> + {BIT(14), "Completion Timeout"}, \
>> + {BIT(15), "Completer Abort"}, \
>> + {BIT(16), "Unexpected Completion"}, \
>> + {BIT(17), "Receiver Overflow"}, \
>> + {BIT(18), "Malformed TLP"}, \
>> + {BIT(19), "ECRC"}, \
>> + {BIT(20), "Unsupported Request"}
>> +
>> +TRACE_EVENT(aer_event,
>> + TP_PROTO(const char *dev_name,
>> + const u32 status,
>> + const u8 severity),
>> +
>> + TP_ARGS(dev_name, status, severity),
>> +
>> + TP_STRUCT__entry(
>> + __string( dev_name, dev_name )
>> + __field( u32, status )
>> + __field( u8, severity )
>> + ),
>> +
>> + TP_fast_assign(
>> + __assign_str(dev_name, dev_name);
>> + __entry->status = status;
>> + __entry->severity = severity;
>> + ),
>> +
>> + TP_printk("%s PCIe Bus Error: severity=%s, %s\n",
>> + __get_str(dev_name),
>> + __entry->severity == HW_EVENT_ERR_CORRECTED ? "Corrected" :
>> + __entry->severity == HW_EVENT_ERR_FATAL ?
>> + "Fatal" : "Uncorrected",
>> + __entry->severity == HW_EVENT_ERR_CORRECTED ?
>> + __print_flags(__entry->status, "|", aer_correctable_errors) :
>> + __print_flags(__entry->status, "|", aer_uncorrectable_errors))
>> +);
>
> Here's a bug causing inconsistency between dmesg and the trace event output.
> When dmesg says "severity=Corrected", the trace event says
> "severity=Fatal". What happens is that HW_EVENT_ERR_CORRECTED is
> defined in edac.h:
>
> enum hw_event_mc_err_type {
> HW_EVENT_ERR_CORRECTED,
> HW_EVENT_ERR_UNCORRECTED,
> HW_EVENT_ERR_FATAL,
> HW_EVENT_ERR_INFO,
> };
>
> while aer_print_error() uses aer_error_severity_string[] defined as:
>
> static const char *aer_error_severity_string[] = {
> "Uncorrected (Non-Fatal)",
> "Uncorrected (Fatal)",
> "Corrected"
> };
>
> In this case dmesg is correct because info->severity is assigned in
> aer_isr_one_error() using the definitions in include/linux/ras.h:
> #define AER_NONFATAL 0
> #define AER_FATAL 1
> #define AER_CORRECTABLE 2
>
> So which one is the standard? Is there a plan to unify all these names?
>
> Thanks
> Rui Wang
>
>> +
>> +#endif /* _TRACE_AER_H */
>> +
>> +/* This part must be outside protection */
>> +#include <trace/define_trace.h>
>>
>> --
>> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
>> the body of a message to majordomo@xxxxxxxxxxxxxxx
>> More majordomo info at http://vger.kernel.org/majordomo-info.html
>> Please read the FAQ at http://www.tux.org/lkml/
>>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-pci" in
> the body of a message to majordomo@xxxxxxxxxxxxxxx
> More majordomo info at http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/