Re: [PATCH] trace: skip hwasan

From: James Morse
Date: Thu Feb 21 2019 - 09:19:25 EST

Next message: Kirill A. Shutemov: "Re: [PATCH v2 02/13] x86/mm: Add p?d_large() definitions"
Previous message: Pankaj Bansal: "RE: [PATCH v2] drivers: mux: Add Generic regmap bitfield-based multiplexer in mmio-mux"
In reply to: Will Deacon: "Re: [PATCH] trace: skip hwasan"
Next in thread: Will Deacon: "Re: [PATCH] trace: skip hwasan"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]

Hi!

On 18/02/2019 13:59, Will Deacon wrote:
> [+James, who knows how to decode these things]

Decode is a strong term!

This stuff is printed by Cavium's secure-world software. All I'm doing is spotting the
bits that vary between the out we've seen!

> On Mon, Feb 18, 2019 at 02:56:47PM +0100, Dmitry Vyukov wrote:
>> On Mon, Feb 18, 2019 at 2:27 PM Qian Cai <cai@xxxxxx> wrote:
>>> On 2/17/19 2:30 AM, Dmitry Vyukov wrote:
>>>> On Sun, Feb 17, 2019 at 5:34 AM Qian Cai <cai@xxxxxx> wrote:
>>>>>
>>>>> Enabling function tracer with CONFIG_KASAN_SW_TAGS=y (hwasan) tracer
>>>>> causes the whole system frozen on ThunderX2 systems with 256 CPUs,
>>>>> because there is a burst of too much pointer access, and then KASAN will
>>>>> dereference each byte of the shadow address for the tag checking which
>>>>> will kill all the CPUs.
>>>>
>>>> Could you please elaborate what exactly happens and who/why kills
>>>> CPUs? Number of memory accesses should not make any difference.
>>>> With hardware support (MTE) it won't be possible to disable
>>>> instrumentation (loads and stores check tags themselves), so it would
>>>> be useful to keep track of exact reasons we disable instrumentation to
>>>> know how to deal with them with hardware support.
>>>> It would be useful to keep this info in the comment in the Makefile.
>>>
>>> It turns out sometimes it will trigger a hardware error.
>>
>> Please add this to the comment that there is that error, reason is
>> unknown, happens from time to time.
>> "Too much pointer access" is confusing and does not seem to be the
>> root cause (there are lots of source files that cause lots of pointer
>> accesses).

> I don't think this is directly related to KASAN, as I'm sure we've seen this
> RAS error before.

Not quite like this. I've had one choke on some PCIe transaction[0].

This looks like corruption detected in a cache associated with a CPU. 'Write back' and
'Physical Address' suggests its the data cache:

>>> Node 0 NBU 0 Error report :
>>> NBU BAR Error
[..]
>>> Physical Address : 0x40011ff00
>>>
>>> NBU BAR Error : Decoded info :
>>> Agent info : CPU
>>> Core ID : 21
>>> Thread ID : 1
>>> Requ: type : 4 : Write Back
>>> Node 0 NBU 1 Error report :
>>> NBU BAR Error
[..]
>>> Physical Address : 0x40011ff40
>>>
>>> NBU BAR Error : Decoded info :
>>> Agent info : CPU
>>> Core ID : 21
>>> Thread ID : 1
>>> Requ: type : 4 : Write Back
>>> Node 0 NBU 2 Error report :
>>> NBU BAR Error
[..]
>>> Physical Address : 0x40011ff80

If you can reproduce it, and it always affects Core:21,Thread:1 I'd suggest offline-ing
all the threads/CPUs in that core. It may be one cache is close to some threshold, and you
can offline the core that its part of.

Thanks,

James

[0] For comparison, I've had one of these during kexec:
# NBU BAR Error : Decoded info :
# Agent info : IO
# : PCIE0
# Requ: type : 2 : Read

Next message: Kirill A. Shutemov: "Re: [PATCH v2 02/13] x86/mm: Add p?d_large() definitions"
Previous message: Pankaj Bansal: "RE: [PATCH v2] drivers: mux: Add Generic regmap bitfield-based multiplexer in mmio-mux"
In reply to: Will Deacon: "Re: [PATCH] trace: skip hwasan"
Next in thread: Will Deacon: "Re: [PATCH] trace: skip hwasan"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]