Re: [PATCH] trace: skip hwasan

From: James Morse
Date: Thu Feb 21 2019 - 09:19:25 EST


Hi!

On 18/02/2019 13:59, Will Deacon wrote:
> [+James, who knows how to decode these things]

Decode is a strong term!

This stuff is printed by Cavium's secure-world software. All I'm doing is spotting the
bits that vary between the out we've seen!


> On Mon, Feb 18, 2019 at 02:56:47PM +0100, Dmitry Vyukov wrote:
>> On Mon, Feb 18, 2019 at 2:27 PM Qian Cai <cai@xxxxxx> wrote:
>>> On 2/17/19 2:30 AM, Dmitry Vyukov wrote:
>>>> On Sun, Feb 17, 2019 at 5:34 AM Qian Cai <cai@xxxxxx> wrote:
>>>>>
>>>>> Enabling function tracer with CONFIG_KASAN_SW_TAGS=y (hwasan) tracer
>>>>> causes the whole system frozen on ThunderX2 systems with 256 CPUs,
>>>>> because there is a burst of too much pointer access, and then KASAN will
>>>>> dereference each byte of the shadow address for the tag checking which
>>>>> will kill all the CPUs.
>>>>
>>>> Could you please elaborate what exactly happens and who/why kills
>>>> CPUs? Number of memory accesses should not make any difference.
>>>> With hardware support (MTE) it won't be possible to disable
>>>> instrumentation (loads and stores check tags themselves), so it would
>>>> be useful to keep track of exact reasons we disable instrumentation to
>>>> know how to deal with them with hardware support.
>>>> It would be useful to keep this info in the comment in the Makefile.
>>>
>>> It turns out sometimes it will trigger a hardware error.
>>
>> Please add this to the comment that there is that error, reason is
>> unknown, happens from time to time.
>> "Too much pointer access" is confusing and does not seem to be the
>> root cause (there are lots of source files that cause lots of pointer
>> accesses).

> I don't think this is directly related to KASAN, as I'm sure we've seen this
> RAS error before.

Not quite like this. I've had one choke on some PCIe transaction[0].

This looks like corruption detected in a cache associated with a CPU. 'Write back' and
'Physical Address' suggests its the data cache:


>>> Node 0 NBU 0 Error report :
>>> NBU BAR Error
[..]
>>> Physical Address : 0x40011ff00
>>>
>>> NBU BAR Error : Decoded info :
>>> Agent info : CPU
>>> Core ID : 21
>>> Thread ID : 1
>>> Requ: type : 4 : Write Back
>>> Node 0 NBU 1 Error report :
>>> NBU BAR Error
[..]
>>> Physical Address : 0x40011ff40
>>>
>>> NBU BAR Error : Decoded info :
>>> Agent info : CPU
>>> Core ID : 21
>>> Thread ID : 1
>>> Requ: type : 4 : Write Back
>>> Node 0 NBU 2 Error report :
>>> NBU BAR Error
[..]
>>> Physical Address : 0x40011ff80

If you can reproduce it, and it always affects Core:21,Thread:1 I'd suggest offline-ing
all the threads/CPUs in that core. It may be one cache is close to some threshold, and you
can offline the core that its part of.


Thanks,

James


[0] For comparison, I've had one of these during kexec:
# NBU BAR Error : Decoded info :
# Agent info : IO
# : PCIE0
# Requ: type : 2 : Read