Re: [PATCH -v5 00/17] Rewrite x86/ftrace to use text_poke (and more)

From: Masami Hiramatsu
Date: Sun Nov 24 2019 - 22:55:45 EST


Hi Peter,

I got a BUG in poke_int3_handler() on 8 threads qemu x86-64
while testing multi-probe testcase.
I couldn't reproduce it because maybe it is a timing bug.

(Sorry, since the terminal line-wrap is disabled, the last chars are overwritten)

BUG: kernel NULL pointer dereference, add8
[ 1475.147390] #PF: supervisor read access in kernel mode
[ 1475.148380] #PF: error_code(0x0000) - not-present page
[ 1475.149362] PGD 0 P4D 0
[ 1475.149970] Oops: 0000 [#1] PREEMPT SMP PTI
[ 1475.150824] CPU: 6 PID: 0 Comm: swapper/6 Not tainted 5.4.0-rc6+ #19
[ 1475.151996] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.12.1-0-ga5cab58e9a4
[ 1475.154078] RIP: 0010:poke_int3_handler+0x33/0x70
[ 1475.155008] Code: d2 75 04 31 c0 5b c3 f6 87 88 00 00 00 03 75 f3 48 8b 87 80 00 00 00 48 89 fb b
[ 1475.158308] RSP: 0018:ffffc900001a8c10 EFLAGS: 00010046
[ 1475.159345] RAX: 0000000000000000 RBX: ffffc900001a8c38 RCX: ffffffff81800b57
[ 1475.160653] RDX: 0000000000000001 RSI: 0000000000000000 RDI: ffffffff81026ec5
[ 1475.162032] RBP: 0000000000000000 R08: 0000000000000000 R09: 0000000000000000
[ 1475.163448] R10: 0000000000000000 R11: 0000000000000000 R12: 0000000000000000
[ 1475.164828] R13: 0000000000000000 R14: 0000000000000000 R15: 0000000000000000
[ 1475.166180] FS: 0000000000000000(0000) GS:ffff88807d980000(0000) knlGS:0000000000000000
[ 1475.167876] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 1475.169066] CR2: 0000000000000008 CR3: 000000000201e000 CR4: 00000000000006a0
[ 1475.170431] Call Trace:
[ 1475.171064] <IRQ>
[ 1475.171609] do_int3+0xd/0xf0
[ 1475.172302] int3+0x42/0x50
[ 1475.173014] RIP: 0010:sched_clock+0x6/0x10
[ 1475.174061] Code: d3 ea f6 c1 40 48 0f 45 c2 4c 01 c0 65 ff 0d 99 ee fe 7e 74 02 5d c3 e8 f8 ad f
[ 1475.178461] RSP: 0018:ffffc900001a8d10 EFLAGS: 00000016
[ 1475.179475] RAX: 000001577e58ce4d RBX: ffff88807d261c00 RCX: 0000000000000000
[ 1475.180685] RDX: 0000000000000014 RSI: 0000000000000014 RDI: ffffffff82556180
[ 1475.181907] RBP: ffff88807d018600 R08: 000002ac502de60d R09: 0000000000000002
[ 1475.183213] R10: 0000000000000000 R11: 0000000000000000 R12: ffff88807d261ce0
[ 1475.184365] R13: ffff88807d261d20 R14: 00000000000003e8 R15: 0000000000000002
[ 1475.185515] ? sched_clock+0x6/0x10
[ 1475.186173] trace_clock_local+0xc/0x20
[ 1475.186939] ring_buffer_lock_reserve+0x10d/0x410
[ 1475.187797] trace_event_buffer_lock_reserve+0x4a/0xf0
[ 1475.188740] kprobe_trace_func+0x10e/0x370
[ 1475.189489] ? sched_clock+0x6/0x10
[ 1475.190141] kprobe_dispatcher+0x39/0x60
[ 1475.190911] aggr_pre_handler+0x4c/0x90
[ 1475.191602] ? sched_clock+0x5/0x10
[ 1475.192317] kprobe_int3_handler+0x101/0x150
[ 1475.193133] do_int3+0x36/0xf0
[ 1475.193697] int3+0x42/0x50
[ 1475.194241] RIP: 0010:sched_clock+0x6/0x10
[ 1475.194947] Code: d3 ea f6 c1 40 48 0f 45 c2 4c 01 c0 65 ff 0d 99 ee fe 7e 74 02 5d c3 e8 f8 ad f
[ 1475.197612] RSP: 0018:ffffc900001a8fc8 EFLAGS: 00000012
[ 1475.198466] RAX: 000001577e58cc7f RBX: 0000000000000000 RCX: 0000000000000000
[ 1475.199511] RDX: 0000000000000014 RSI: 0000000000000014 RDI: ffffffff82556180
[ 1475.200552] RBP: 0000000000000000 R08: 000002ac502de274 R09: 0000000000000001
[ 1475.201641] R10: 0000000000000000 R11: ffff88807d1a9e58 R12: 0000000000000000
[ 1475.202804] R13: 0000000000000000 R14: 0000000000000000 R15: 0000000000000000
[ 1475.203932] ? sched_clock+0x6/0x10
[ 1475.204557] ? sched_clock+0x5/0x10
[ 1475.205176] sched_clock_cpu+0xe/0xd0
[ 1475.205830] irq_exit+0xb3/0xc0
[ 1475.206512] call_function_interrupt+0xf/0x20
[ 1475.207252] </IRQ>
[ 1475.207714] RIP: 0010:default_idle+0x23/0x180
[ 1475.208449] Code: ff 90 90 90 90 90 90 41 55 41 54 55 53 e8 35 29 c7 ff 0f 1f 44 00 00 e8 8b af 5
[ 1475.211721] RSP: 0018:ffffc9000008feb8 EFLAGS: 00000202 ORIG_RAX: ffffffffffffff03
[ 1475.213312] RAX: ffff88807d1a95c0 RBX: 0000000000000006 RCX: 0000000000000000
[ 1475.214687] RDX: 0000000000000046 RSI: 0000000000000006 RDI: ffff88807d1a95c0
[ 1475.216027] RBP: ffffffff8212da80 R08: 0000000000000001 R09: 0000000000000000
[ 1475.217399] R10: 0000000000000000 R11: 0000000000000000 R12: 0000000000000000
[ 1475.218867] R13: 0000000000000000 R14: 0000000000000000 R15: 0000000000000000
[ 1475.220144] ? default_idle+0x15/0x180
[ 1475.220929] default_idle_call+0x26/0x30
[ 1475.221809] do_idle+0x1fe/0x220
[ 1475.222695] ? trace_hardirqs_on+0x2c/0xf0
[ 1475.223550] cpu_startup_entry+0x14/0x20
[ 1475.224390] start_secondary+0x152/0x180
[ 1475.225157] secondary_startup_64+0xa4/0xb0
[ 1475.226011] Modules linked in: [last unloaded: trace_printk]
[ 1475.227028] CR2: 0000000000000008
[ 1475.227636] ---[ end trace 5d3d01238a02c9e8 ]---
[ 1475.228410] RIP: 0010:poke_int3_handler+0x33/0x70
[ 1475.229187] Code: d2 75 04 31 c0 5b c3 f6 87 88 00 00 00 03 75 f3 48 8b 87 80 00 00 00 48 89 fb b
[ 1475.232231] RSP: 0018:ffffc900001a8c10 EFLAGS: 00010046
[ 1475.233180] RAX: 0000000000000000 RBX: ffffc900001a8c38 RCX: ffffffff81800b57
[ 1475.234374] RDX: 0000000000000001 RSI: 0000000000000000 RDI: ffffffff81026ec5
[ 1475.236003] RBP: 0000000000000000 R08: 0000000000000000 R09: 0000000000000000
[ 1475.237818] R10: 0000000000000000 R11: 0000000000000000 R12: 0000000000000000
[ 1475.240007] R13: 0000000000000000 R14: 0000000000000000 R15: 0000000000000000
[ 1475.242493] FS: 0000000000000000(0000) GS:ffff88807d980000(0000) knlGS:0000000000000000
[ 1475.245272] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 1475.247136] CR2: 0000000000000008 CR3: 000000000201e000 CR4: 00000000000006a0
[ 1475.249053] Kernel panic - not syncing: Fatal exception
[ 1475.250593] Kernel Offset: disabled
[ 1475.251459] ---[ end Kernel panic - not syncing: Fatal exception ]---

$ eu-addr2line -e ./vmlinux poke_int3_handler+0x33
arch/x86/kernel/alternative.c:996:6

-----
int poke_int3_handler(struct pt_regs *regs)
{
struct text_poke_loc *tp;
unsigned char int3 = 0xcc;
void *ip;

/*
* Having observed our INT3 instruction, we now must observe
* bp_patching.nr_entries.
*
* nr_entries != 0 INT3
* WMB RMB
* write INT3 if (nr_entries)
*
* Idem for other elements in bp_patching.
*/
smp_rmb();

if (likely(!bp_patching.nr_entries))
return 0;

if (user_mode(regs))
return 0;

/*
* Discount the sizeof(int3). See text_poke_bp_batch().
*/
ip = (void *) regs->ip - sizeof(int3);

/*
* Skip the binary search if there is a single member in the vector.
*/
if (unlikely(bp_patching.nr_entries > 1)) {
tp = bsearch(ip, bp_patching.vec, bp_patching.nr_entries,
sizeof(struct text_poke_loc),
patch_cmp);
if (!tp)
return 0;
} else {
tp = bp_patching.vec;
if (tp->addr != ip) /* !!!! <-- HERE !!!!! */
return 0;
}

/* set up the specified breakpoint detour */
regs->ip = (unsigned long) tp->detour;

return 1;
}
-----

This seems like caused by bp_patching.vec is NULL but
bp_patching.nr_entries != 0. I think there is a small chance to do
this, because we have no smp_wmb after cleaning bp_patching.nr_entries.

-----
void text_poke_bp_batch(struct text_poke_loc *tp, unsigned int nr_entries)
...
on_each_cpu(do_sync_core, NULL, 1);
/*
* sync_core() implies an smp_mb() and orders this store against
* the writing of the new instruction.
*/
bp_patching.vec = NULL;
bp_patching.nr_entries = 0;
}
-----

I think the "on_each_cpu(do_sync_core, NULL, 1);" can sync the pipeline
but doesn't ensure all ongoing int3 handling is done. Thus, we may need a
bigger wait in between bp_patching.nr_entries = 0 and bp_patching.vec = NULL;
e.g.

bp_patching.nr_entries = 0;
synchronize_rcu();
bp_patching.vec = NULL;

What would you think about ?

Thank you,

On Mon, 11 Nov 2019 14:12:52 +0100
Peter Zijlstra <peterz@xxxxxxxxxxxxx> wrote:

> Ftrace is one of the last W^X violators (after this only KLP is left). These
> patches move it over to the generic text_poke() interface and thereby get rid
> of this oddity.
>
> The first 14 patches are the same as in the -v4 posting. The last 3 patches are
> new.
>
> Will, patch 13, arm/ftrace, is unchanged. This is because this way it preserves
> behaviour, but if you can provide me a tested-by for the simpler variant I can
> drop that in.
>
> Patch 15 reworks ftrace's event_create_dir(), which ran module code before the
> module was finished loading (before we even applied jump_labels and all that).
>
> Patch 16 and 17 address minor review feedback.
>
> Ingo, Alexei wants patch #1 for some BPF stuff, can he get that in a topic branch?
>


--
Masami Hiramatsu <mhiramat@xxxxxxxxxx>