Re: Crash on armv7-a using KASAN
From: Mark Rutland
Date: Tue Oct 15 2024 - 10:00:49 EST
On Tue, Oct 15, 2024 at 03:51:02PM +0200, Linus Walleij wrote:
> On Tue, Oct 15, 2024 at 12:28 PM Mark Rutland <mark.rutland@xxxxxxx> wrote:
> > On Mon, Oct 14, 2024 at 03:19:49PM +0200, Clement LE GOFFIC wrote:
>
> > I think what's happening here is that when switching from prev to next
> > in the scheduler, we switch to next's mm before we actually switch to
> > next's register state, and there's a transient window where prev is
> > executed using next's mm. AFAICT we don't map prev's KASAN stack shadow
> > into next's mm anywhere, and so inlined KASAN_STACK checks recursively
> > fault on this until we switch to the overflow stack.
>
> Oh my, that's pretty advanced. Well spotted!
> So it has nothing to do with Ards commit, correlation does not
> imply causation.
>
> > More details on that below.
> >
> > Linus, are you able to look into this?
>
> Of course, I'm trying to reproduce the bug.
>
> > > __dabt_svc from do_translation_fault+0x30/0x2b0
> > > do_translation_fault from do_DataAbort+0x74/0x1dc
> > > do_DataAbort from __dabt_svc+0x4c/0x80
> > > Exception stack(0xac003ad8 to 0xac003b20)
> > > 3ac0: ac003bc8
> > > 00000005
> > > 3ae0: ac003b88 74800779 7480078f ac003b88 7480078f ac003b88 00000005
> > > 82412640
> > > 3b00: ac003d20 ac003d54 00000051 ac003b28 80125c14 80125920 200f0193
> > > ffffffff
> > > __dabt_svc from do_translation_fault+0x30/0x2b0
> > > do_translation_fault from do_DataAbort+0x74/0x1dc
> > > do_DataAbort from __dabt_svc+0x4c/0x80
> > > Exception stack(0xac003b88 to 0xac003bd0)
> > > 3b80: ac003c78 00000805 ac003c38 7480078f 74800798
> > > ac003c38
> > > 3ba0: 74800798 ac003c38 00000805 82412640 ac003d20 ac003d54 00000051
> > > ac003bd8
> > > 3bc0: 80125c14 80125920 200f0193 ffffffff
> > > __dabt_svc from do_translation_fault+0x30/0x2b0
> > > do_translation_fault from do_DataAbort+0x74/0x1dc
> > > do_DataAbort from __dabt_svc+0x4c/0x80
> >
> > The above frames are the same; whatever the kernel is accessing at
> > do_translation_fault+0x30 is causing this to go recursive...
> >
> > I can reproduce this, pretty easily, with a similar enough trace, though
> > faddr2line isn't happy to give me a line number.
>
> Did you reproduce it the same way with a few find /?
>
> I am trying to reproduce it and failing :/
> (Using Torvald's HEAD)
>
> This is my config:
>
> CONFIG_HAVE_ARCH_KASAN=y
> CONFIG_HAVE_ARCH_KASAN_VMALLOC=y
> CONFIG_CC_HAS_KASAN_GENERIC=y
> CONFIG_CC_HAS_WORKING_NOSANITIZE_ADDRESS=y
> CONFIG_KASAN=y
> CONFIG_CC_HAS_KASAN_MEMINTRINSIC_PREFIX=y
> CONFIG_KASAN_GENERIC=y
> CONFIG_KASAN_OUTLINE=y
> # CONFIG_KASAN_INLINE is not set
> # CONFIG_KASAN_STACK is not set
> # CONFIG_KASAN_VMALLOC is not set
> # CONFIG_KASAN_EXTRA_INFO is not set
>
> Do you use more KASAN?
I used a config per Clement's instructions, i.e.
| [mark@lakrids:~/src/linux]% git checkout v6.12-rc3
| HEAD is now at 8e929cb546ee4 Linux 6.12-rc3
| [mark@lakrids:~/src/linux]% git clean -qfdx
| [mark@lakrids:~/src/linux]% echo 'CONFIG_KASAN=y' > arch/arm/configs/fragment-kasan.config
| [mark@lakrids:~/src/linux]% usekorg 14.2.0 make ARCH=arm CROSS_COMPILE=arm-linux-gnueabi- vexpress_defconfig fragment-kasan.config
| HOSTCC scripts/basic/fixdep
| HOSTCC scripts/kconfig/conf.o
| HOSTCC scripts/kconfig/confdata.o
| HOSTCC scripts/kconfig/expr.o
| LEX scripts/kconfig/lexer.lex.c
| YACC scripts/kconfig/parser.tab.[ch]
| HOSTCC scripts/kconfig/lexer.lex.o
| HOSTCC scripts/kconfig/menu.o
| HOSTCC scripts/kconfig/parser.tab.o
| HOSTCC scripts/kconfig/preprocess.o
| HOSTCC scripts/kconfig/symbol.o
| HOSTCC scripts/kconfig/util.o
| HOSTLD scripts/kconfig/conf
| #
| # configuration written to .config
| #
| Using .config as base
| Merging ./arch/arm/configs/fragment-kasan.config
| Value of CONFIG_KASAN is redefined by fragment ./arch/arm/configs/fragment-kasan.config:
| Previous value: # CONFIG_KASAN is not set
| New value: CONFIG_KASAN=y
|
| #
| # merged configuration written to .config (needs make)
| #
| #
| # configuration written to .config
| #
| [mark@lakrids:~/src/linux]% grep KASAN .config
| CONFIG_KASAN_SHADOW_OFFSET=0x5f000000
| CONFIG_HAVE_ARCH_KASAN=y
| CONFIG_HAVE_ARCH_KASAN_VMALLOC=y
| CONFIG_CC_HAS_KASAN_GENERIC=y
| CONFIG_KASAN=y
| CONFIG_CC_HAS_KASAN_MEMINTRINSIC_PREFIX=y
| CONFIG_KASAN_GENERIC=y
| # CONFIG_KASAN_OUTLINE is not set
| CONFIG_KASAN_INLINE=y
| CONFIG_KASAN_STACK=y
| CONFIG_KASAN_VMALLOC=y
| # CONFIG_KASAN_MODULE_TEST is not set
| # CONFIG_KASAN_EXTRA_INFO is not set
IIUC you'll need CONFIG_KASAN_INLINE=y, CONFIG_KASAN_STACK=y, and
CONFIG_KASAN_VMALLOC=y.
With that, just booting and tapping a few keys was sufficient to trigger
a crash using Debian the initrd Clement linked.
[...]
> > The relevant asm is:
> (...)
> > ... so we're using the new task's mm, but still executing in the context of the
> > old task (and using its stack). I suspect the new task's mm doesn't have the
> > old task's stack shadow mapped in, and AFAICT we don't map that in explicitly
> > anywhere before we switch to the new mm.
> >
> > Linus, can you look into that?
>
> Yeah it looks like a spot-on identification of the problem, I can try to
> think about how we could fix this if I can reproduce it, I keep trying
> to provoke the crash :/
It's a bit grotty -- AFAICT you'd either need to prefault in the
specific part of the vmalloc space when switching tasks, or we'd need to
preallocate all the shared vmalloc tables at the start of time so that
they're always up-to-date.
While we could disable KASAN_STACK, that's only going to mask the
problem until this happens for any other vmalloc shadow...
Mark.