Re: [PATCH] arm64: Print provenance information for addresses in registers

From: Stefan Wiehler
Date: Wed Mar 20 2024 - 10:15:11 EST


Hi Mark,

> Regardless of what 32-bit ARM does, I don't think it makes sense to dump this
> for all the GPRs. It's incredibly verbose and not all that helpful, e.g.
> testing with the LKDTM WRITE_RO test I get:
>
> | # echo WRITE_RO > /sys/kernel/debug/provoke-crash/DIRECT
> | lkdtm: Performing direct entry WRITE_RO
> | lkdtm: attempting bad rodata write at ffffadee958c1bf0
> | Unable to handle kernel write to read-only memory at virtual address ffffadee958c1bf0
> | Mem abort info:
> | ESR = 0x000000009600004e
> | EC = 0x25: DABT (current EL), IL = 32 bits
> | SET = 0, FnV = 0
> | EA = 0, S1PTW = 0
> | FSC = 0x0e: level 2 permission fault
> | Data abort info:
> | ISV = 0, ISS = 0x0000004e, ISS2 = 0x00000000
> | CM = 0, WnR = 1, TnD = 0, TagAccess = 0
> | GCS = 0, Overlay = 0, DirtyBit = 0, Xs = 0
> | swapper pgtable: 4k pages, 48-bit VAs, pgdp=0000000041d6d000
> | [ffffadee958c1bf0] pgd=100000043ffff803, p4d=100000043ffff803, pud=100000043fffe803, pmd=0060000041600f81
> | Internal error: Oops: 000000009600004e [#1] PREEMPT SMP
> | Modules linked in:
> | CPU: 4 PID: 147 Comm: sh Not tainted 6.8.0-rc3-00001-gc66454d88811 #1
> | Hardware name: linux,dummy-virt (DT)
> | pstate: 61400009 (nZCv daif +PAN -UAO -TCO +DIT -SSBS BTYPE=--)
> | pc : lkdtm_WRITE_RO+0x3c/0x54
> | lr : lkdtm_WRITE_RO+0x24/0x54
> | sp : ffff800080913c80
> | x29: ffff800080913c80 x28: ffff0000c09ca200 x27: 0000000000000000
> | x26: 0000000000000000 x25: 0000000000000000 x24: 0000aaaaee418930
> | x23: ffff0000c0645780 x22: ffff800080913df0 x21: ffffadee96c957a0
> | x20: 0000000000000009 x19: ffffadee958c1000 x18: 0000000000000006
> | x17: 0000000000000000 x16: 0000000000000000 x15: ffff800080913720
> | x14: 0000000000000000 x13: ffffadee969310b0 x12: 0000000000000432
> | x11: 0000000000000166 x10: ffffadee969890b0 x9 : ffffadee969310b0
> | x8 : 00000000ffffefff x7 : ffffadee969890b0 x6 : 80000000fffff000
> | x5 : ffff0003fdfb5cc8 x4 : 0000000000000000 x3 : 0000000000000000
> | x2 : 00000000abcd1234 x1 : 000000000198b861 x0 : ffffadee95e334a0
> | Call trace:
> | lkdtm_WRITE_RO+0x3c/0x54
> | lkdtm_do_action+0x1c/0x30
> | direct_entry+0xbc/0x144
> | full_proxy_write+0x60/0xb4
> | vfs_write+0xd0/0x35c
> | ksys_write+0x70/0x104
> | __arm64_sys_write+0x1c/0x28
> | invoke_syscall+0x48/0x110
> | el0_svc_common.constprop.0+0x40/0xe0
> | do_el0_svc+0x1c/0x28
> | el0_svc+0x34/0xb4
> | el0t_64_sync_handler+0x120/0x12c
> | el0t_64_sync+0x190/0x194
> | Register x0 information: 0-page vmalloc region starting at 0xffffadee95470000 allocated at paging_init+0x148/0x6f4
> | Register x1 information: non-paged memory
> | Register x2 information: non-paged memory
> | Register x3 information: NULL pointer
> | Register x4 information: NULL pointer
> | Register x5 information: non-slab/vmalloc memory
> | Register x6 information: non-paged memory
> | Register x7 information: 0-page vmalloc region starting at 0xffffadee96910000 allocated at paging_init+0x1d8/0x6f4
> | Register x8 information: non-paged memory
> | Register x9 information: 0-page vmalloc region starting at 0xffffadee96910000 allocated at paging_init+0x1d8/0x6f4
> | Register x10 information: 0-page vmalloc region starting at 0xffffadee96910000 allocated at paging_init+0x1d8/0x6f4
> | Register x11 information: non-paged memory
> | Register x12 information: non-paged memory
> | Register x13 information: 0-page vmalloc region starting at 0xffffadee96910000 allocated at paging_init+0x1d8/0x6f4
> | Register x14 information: NULL pointer
> | Register x15 information: 4-page vmalloc region starting at 0xffff800080910000 allocated at kernel_clone+0x64/0x360
> | Register x16 information: NULL pointer
> | Register x17 information: NULL pointer
> | Register x18 information: non-paged memory
> | Register x19 information: 0-page vmalloc region starting at 0xffffadee95470000 allocated at paging_init+0x148/0x6f4
> | Register x20 information: non-paged memory
> | Register x21 information: 0-page vmalloc region starting at 0xffffadee96910000 allocated at paging_init+0x1d8/0x6f4
> | Register x22 information: 4-page vmalloc region starting at 0xffff800080910000 allocated at kernel_clone+0x64/0x360
> | Register x23 information: slab dentry start ffff0000c0645780 pointer offset 0 size 192
> | Register x24 information: non-paged memory
> | Register x25 information: NULL pointer
> | Register x26 information: NULL pointer
> | Register x27 information: NULL pointer
> | Register x28 information: slab task_struct start ffff0000c09ca200 pointer offset 0 size 4352
> | Register x29 information: 4-page vmalloc region starting at 0xffff800080910000 allocated at kernel_clone+0x64/0x360
> | Code: f2b579a2 900089c0 ca020021 91128000 (f905fa61)
> | ---[ end trace 0000000000000000 ]---
>
> Looking at the dump, the vast majority of those "Register X?? information:"
> lines are irrelevant, and it's difficult to see the wood for the trees.
>
> Why do we need this for (almost) all registers, rather than just the faulting
> address?

We often get bug reports of rarely reproducing issues on customer sites
where we would like to dump as much helpful information as possible.

Let me show you an example of an ARM32 oops where register provenance
information proved to be quite useful (abbreviated):

Unable to handle kernel NULL pointer dereference at virtual address 0000001c
pgd = ea66fa29
[0000001c] *pgd=8461ab003, *pmd=8f7d9c003
Internal error: Oops: 207 [#1] PREEMPT SMP ARM
...
PC is at do_open_execat+0x180/0x228
LR is at 0x0
pc : [<c03f9640>] lr : [<00000000>] psr: 40070013
sp : c7c27ee8 ip : 00001020 fp : 0002fde4
r10: 0000000b r9 : 00000001 r8 : 00000000
r7 : 00000100 r6 : c66aa0c0 r5 : c15f8000 r4 : 00000000
r3 : 000a801d r2 : 00001020 r1 : 00020020 r0 : c417f990
Flags: nZcv IRQs on FIQs on Mode SVC_32 ISA ARM Segment user
Control: 30c5387d Table: 46572940 DAC: 55555555
Register r0 information: slab dentry start c417f990 pointer offset 0 size 36
Register r1 information: non-paged memory
Register r2 information: non-paged memory
Register r3 information: non-paged memory
Register r4 information: NULL pointer
Register r5 information: slab names_cache start c15f8000 pointer offset 0 size 4096
Register r6 information: slab filp start c66aa0c0 pointer offset 0
Register r7 information: non-paged memory
Register r8 information: NULL pointer
Register r9 information: non-paged memory
Register r10 information: non-paged memory
Register r11 information: non-paged memory
Register r12 information: non-paged memory
Process ddal (pid: 14765, stack limit = 0xdb737995)
Stack: (0xc7c27ee8 to 0xc7c28000)
...
[<c03f9640>] (do_open_execat) from [<c03f987c>] (bprm_execve+0x150/0x6b4)
[<c03f987c>] (bprm_execve) from [<c03f9f8c>] (do_execveat_common+0x1ac/0x208)
[<c03f9f8c>] (do_execveat_common) from [<c03fae08>] (sys_execve+0x38/0x40)
[<c03fae08>] (sys_execve) from [<c0200140>] (ret_fast_syscall+0x0/0x40)
Exception stack(0xc7c27fa8 to 0xc7c27ff0)
...

>From the disassembly we can see that the null pointer comes from r6:

(gdb) disassemble do_open_execat+0x180
...
0xc03f9638 <+376>: ldr r0, [r6, #12]
0xc03f963c <+380>: ldr lr, [r0, #40] ; 0x28
0xc03f9640 <+384>: ldr r3, [lr, #28]
...

Thanks to register provenance information we know now that r6 points
into the filp SLAB cache, and by following the offsets added we can find
the null pointer entry.

> In die_kernel_fault() we already print some information regarding the psecific
> faulting address:
>
> | static void die_kernel_fault(const char *msg, unsigned long addr,
> | unsigned long esr, struct pt_regs *regs)
> | {
> | bust_spinlocks(1);
> |
> | pr_alert("Unable to handle kernel %s at virtual address %016lx\n", msg,
> | addr);
> |
> | kasan_non_canonical_hook(addr);
> |
> | mem_abort_decode(esr);
> |
> | show_pte(addr);
> | die("Oops", regs, esr);
> | bust_spinlocks(0);
> | make_task_dead(SIGKILL);
> | }
>
> ... and IMO it'd make more sense to call mem_dump_obj() there, e.g.
>
> | static void die_kernel_fault(const char *msg, unsigned long addr,
> | unsigned long esr, struct pt_regs *regs)
> | {
> | bust_spinlocks(1);
> |
> | pr_alert("Unable to handle kernel %s at virtual address %016lx\n", msg,
> | addr);
> |
> | pr_alert("Fauling address:");
> | mem_dump_obj((void *)addr);
> | kasan_non_canonical_hook(addr);
> |
> | mem_abort_decode(esr);
> |
> | show_pte(addr);
> | die("Oops", regs, esr);
> | bust_spinlocks(0);
> | make_task_dead(SIGKILL);
> | }
>
> ... which produces significantly more legible output, e.g.
>
> | # echo WRITE_RO > /sys/kernel/debug/provoke-crash/DIRECT
> | lkdtm: Performing direct entry WRITE_RO
> | lkdtm: attempting bad rodata write at ffffa66f5e2c1bf0
> | Unable to handle kernel write to read-only memory at virtual address ffffa66f5e2c1bf0
> | Fauling address: 0-page vmalloc region starting at 0xffffa66f5de70000 allocated at paging_init+0x148/0x6f4
> | Mem abort info:
> | ESR = 0x000000009600004e
> | EC = 0x25: DABT (current EL), IL = 32 bits
> | SET = 0, FnV = 0
> | EA = 0, S1PTW = 0
> | FSC = 0x0e: level 2 permission fault
> | Data abort info:
> | ISV = 0, ISS = 0x0000004e, ISS2 = 0x00000000
> | CM = 0, WnR = 1, TnD = 0, TagAccess = 0
> | GCS = 0, Overlay = 0, DirtyBit = 0, Xs = 0
> | swapper pgtable: 4k pages, 48-bit VAs, pgdp=0000000041d6d000
> | [ffffa66f5e2c1bf0] pgd=100000043ffff803, p4d=100000043ffff803, pud=100000043fffe803, pmd=0060000041600f81
> | Internal error: Oops: 000000009600004e [#1] PREEMPT SMP
> | Modules linked in:
> | CPU: 3 PID: 148 Comm: sh Not tainted 6.8.0-rc3-00002-gf95e3bd06270-dirty #2
> | Hardware name: linux,dummy-virt (DT)
> | pstate: 61400009 (nZCv daif +PAN -UAO -TCO +DIT -SSBS BTYPE=--)
> | pc : lkdtm_WRITE_RO+0x3c/0x54
> | lr : lkdtm_WRITE_RO+0x24/0x54
> | sp : ffff80008098bc80
> | x29: ffff80008098bc80 x28: ffff0000c144a200 x27: 0000000000000000
> | x26: 0000000000000000 x25: 0000000000000000 x24: 0000aaab235e21d0
> | x23: ffff0000c06459c0 x22: ffff80008098bdf0 x21: ffffa66f5f6957a0
> | x20: 0000000000000009 x19: ffffa66f5e2c1000 x18: 0000000000000006
> | x17: 0000000000000000 x16: 0000000000000000 x15: ffff80008098b720
> | x14: 0000000000000000 x13: ffffa66f5f3310b0 x12: 0000000000000432
> | x11: 0000000000000166 x10: ffffa66f5f3890b0 x9 : ffffa66f5f3310b0
> | x8 : 00000000ffffefff x7 : ffffa66f5f3890b0 x6 : 80000000fffff000
> | x5 : ffff0003fdf9fcc8 x4 : 0000000000000000 x3 : 0000000000000000
> | x2 : 00000000abcd1234 x1 : 000000000198b861 x0 : ffffa66f5e8334b0
> | Call trace:
> | lkdtm_WRITE_RO+0x3c/0x54
> | lkdtm_do_action+0x1c/0x30
> | direct_entry+0xbc/0x144
> | full_proxy_write+0x60/0xb4
> | vfs_write+0xd0/0x35c
> | ksys_write+0x70/0x104
> | __arm64_sys_write+0x1c/0x28
> | invoke_syscall+0x48/0x110
> | el0_svc_common.constprop.0+0x40/0xe0
> | do_el0_svc+0x1c/0x28
> | el0_svc+0x34/0xb4
> | el0t_64_sync_handler+0x120/0x12c
> | el0t_64_sync+0x190/0x194
> | Code: f2b579a2 900089c0 ca020021 9112c000 (f905fa61)
> | ---[ end trace 0000000000000000 ]---
>
> ... and likewise even when we hit addresses without KASAN shadow:
>
> | # echo ACCESS_NULL > /sys/kernel/debug/provoke-crash/DIRECT
> | lkdtm: Performing direct entry ACCESS_NULL
> | lkdtm: attempting bad read at 0000000000000000
> | Unable to handle kernel paging request at virtual address dfff800000000000
> | Fauling address: non-paged memory
> | KASAN: null-ptr-deref in range [0x0000000000000000-0x0000000000000007]
> | Mem abort info:
> | ESR = 0x0000000096000005
> | EC = 0x25: DABT (current EL), IL = 32 bits
> | SET = 0, FnV = 0
> | EA = 0, S1PTW = 0
> | FSC = 0x05: level 1 translation fault
> | Data abort info:
> | ISV = 0, ISS = 0x00000005, ISS2 = 0x00000000
> | CM = 0, WnR = 0, TnD = 0, TagAccess = 0
> | GCS = 0, Overlay = 0, DirtyBit = 0, Xs = 0
> | [dfff800000000000] address between user and kernel address ranges
> | Internal error: Oops: 0000000096000005 [#1] PREEMPT SMP
> | Modules linked in:
> | CPU: 2 PID: 149 Comm: sh Not tainted 6.8.0-rc3-00002-gf95e3bd06270-dirty #3
> | Hardware name: linux,dummy-virt (DT)
> | pstate: 61400009 (nZCv daif +PAN -UAO -TCO +DIT -SSBS BTYPE=--)
> | pc : lkdtm_ACCESS_NULL+0x2c/0xa4
> | lr : lkdtm_ACCESS_NULL+0x24/0xa4
> | sp : ffff800080d87ab0
> | x29: ffff800080d87ab0 x28: ffff0000c5296f14 x27: 1fffe00018a52df5
> | x26: ffff0000c5296fa0 x25: 1fffe00018a52df4 x24: ffff0000c8fcda00
> | x23: 0000aaaadb0a7040 x22: ffff800080d87d70 x21: ffffcb98b050dae0
> | x20: 000000000000000c x19: ffffcb98acd53f40 x18: ffff0003fdff56c0
> | x17: 0000000000000000 x16: 0000000000000000 x15: 0000000000000000
> | x14: 0000000000000000 x13: 3030303030303030 x12: ffff7000101b0ee1
> | x11: 1ffff000101b0ee0 x10: ffff7000101b0ee0 x9 : dfff800000000000
> | x8 : 00008fffefe4f120 x7 : ffff800080d87707 x6 : 0000000000000001
> | x5 : ffff800080d87700 x4 : 1fffe000191f9b41 x3 : 0000000000000000
> | x2 : 0000000000000000 x1 : ffff0000c8fcda00 x0 : dfff800000000000
> | Call trace:
> | lkdtm_ACCESS_NULL+0x2c/0xa4
> | lkdtm_do_action+0x3c/0x68
> | direct_entry+0x130/0x228
> | full_proxy_write+0xd4/0x140
> | vfs_write+0x1e4/0x8ec
> | ksys_write+0xf4/0x1e0
> | __arm64_sys_write+0x6c/0x9c
> | invoke_syscall+0x6c/0x254
> | el0_svc_common.constprop.0+0xac/0x230
> | do_el0_svc+0x40/0x58
> | el0_svc+0x38/0x70
> | el0t_64_sync_handler+0x120/0x12c
> | el0t_64_sync+0x190/0x194
> | Code: aa1303e0 97adbc11 d2d00000 f2fbffe0 (39c00000)
> | ---[ end trace 0000000000000000 ]---

I understand that the proposal is currently too verbose, but printing
provenance information only for the faulting address would not have
helped in above example. What do you think about printing only registers
that hold information useful for debugging, i.e. by excluding null
pointers and non-paged areas?

> I would also note that X30 can be used as a scratch register in the body of
> functions and might hold an interesting address, whereas x2 *should* always
> point to a frame record on the stack. So I don't understand why we'd both
> dumping x29 but skip x30.

Thanks for pointing this out, I must have misunderstood ARM64 register
usage.

Kind regards,

Stefan