Re: KRETPROBES are broken since kernel 5.8

From: Masami Hiramatsu
Date: Thu Dec 10 2020 - 08:11:00 EST


Hi Adam,

On Thu, 10 Dec 2020 08:17:42 +0100
Adam Zabrocki <pi3@xxxxxxxxxx> wrote:

> Hi,
>
> On Thu, Dec 10, 2020 at 10:25:07AM +0900, Masami Hiramatsu wrote:
> > Hi Adam,
> >
> > Thank you for reporting and debugging, actually we had investigated this
> > issue in Aug. Please read this thread.
> >
> > https://lkml.kernel.org/r/8816bdbbc55c4d2397e0b02aad2825d3@xxxxxxxxxxxxxx
> >
>
> Thanks for the link. I've read the entire history of proposed fix - very
> informative :)
>
> > We finally fixed this issue by commit e03b4a084ea6 ("kprobes: Remove NMI
> > context check") and commit 645f224e7ba2 ("kprobes: Tell lockdep about
> > kprobe nesting"). Yeah, it will be in the v5.10.
> >
> > Could you try to test whether these commits can solve the issue?
>
> I've applied these commits on the top of kernel 5.9.7 and verified that it
> works well. Non-optimized KRETPROBES are back to life.

Good!

>
> However, there might be another issue which I wanted to brought / discuss -
> problem with optimizer. Until kernel 5.9 KRETPROBE on e.g.
> 'ftrace_enable_sysctl' function was correctly optimized without any problems.

Did you check it on other functions? Did you see it only on the "ftrace_enable_sysctl"?

> Since 5.9 it can't be optimized anynmore. I didn't see any changes in the
> sources regarding the optimizer, neither function itself.
> When I looked at the generated vmlinux binary, I can see that GCC generated
> padding at the end of this function using INT3 opcode:
>
> ...
> ffffffff8130528b: 41 bd f0 ff ff ff mov $0xfffffff0,%r13d
> ffffffff81305291: e9 fe fe ff ff jmpq ffffffff81305194 <ftrace_enable_sysctl+0x114>
> ffffffff81305296: cc int3
> ffffffff81305297: cc int3
> ffffffff81305298: cc int3
> ffffffff81305299: cc int3
> ffffffff8130529a: cc int3
> ffffffff8130529b: cc int3
> ffffffff8130529c: cc int3
> ffffffff8130529d: cc int3
> ffffffff8130529e: cc int3
> ffffffff8130529f: cc int3

So these int3 is generated by GCC for padding, right?

> Such padding didn't exist in this function in generated images for older
> kernels. Nevertheless, such padding is not unusual and it's pretty common.
>
> Optimizer logic fails here:
>
> try_to_optimize_kprobe() -> alloc_aggr_kprobe() -> __prepare_optimized_kprobe()
> -> arch_prepare_optimized_kprobe() -> can_optimize():
>
> /* Decode instructions */
> addr = paddr - offset;
> while (addr < paddr - offset + size) { /* Decode until function end */
> unsigned long recovered_insn;
> if (search_exception_tables(addr))
> /*
> * Since some fixup code will jumps into this function,
> * we can't optimize kprobe in this function.
> */
> return 0;
> recovered_insn = recover_probed_instruction(buf, addr);
> if (!recovered_insn)
> return 0;
> kernel_insn_init(&insn, (void *)recovered_insn, MAX_INSN_SIZE);
> insn_get_length(&insn);
> /* Another subsystem puts a breakpoint */
> if (insn.opcode.bytes[0] == INT3_INSN_OPCODE)
> return 0;
> /* Recover address */
> insn.kaddr = (void *)addr;
> insn.next_byte = (void *)(addr + insn.length);
> /* Check any instructions don't jump into target */
> if (insn_is_indirect_jump(&insn) ||
> insn_jump_into_range(&insn, paddr + INT3_INSN_SIZE,
> DISP32_SIZE))
> return 0;
> addr += insn.length;
> }
>
> One of the check tries to protect from the situation when another subsystem
> puts a breakpoint there as well:
>
> /* Another subsystem puts a breakpoint */
> if (insn.opcode.bytes[0] == INT3_INSN_OPCODE)
> return 0;

Right.

>
> However, that's not the case here. INT3_INSN_OPCODE is placed at the end of
> the function as padding (and protect from NOP-padding problems).
>
> I wonder, if optimizer should take this special case into account? Is it worth
> to still optimize such functions when they are padded with INT3?

Indeed. I expected int3 is used from other subsystems (e.g. kgdb) and,
in that case the optimization can confuse them.
But if the gcc uses int3 to pad the room between functions, it should be
reconsidered.

Thank you,

>
> > If it is OK, we should backport those to stable tree.
>
> Agreed.
>
> Thanks,
> Adam
>
> > Thank you,
> >
> > On Wed, 9 Dec 2020 22:50:01 +0100
> > Adam Zabrocki <pi3@xxxxxxxxxx> wrote:
> >
> > > Hello,
> > >
> > > Starting from kernel 5.8 all non-optimized kretprobes don't work. Until 5.8,
> > > when #DB exception was raised, entry to the NMI was not fully performed. Among
> > > others, the following logic was executed:
> > > https://elixir.bootlin.com/linux/v5.7.19/source/arch/x86/kernel/traps.c#L589
> > >
> > > if (!user_mode(regs)) {
> > > rcu_nmi_enter();
> > > preempt_disable();
> > > }
> > >
> > > In some older kernels function ist_enter() was called instead. Inside this
> > > function we can see the following logic:
> > > https://elixir.bootlin.com/linux/v5.7.19/source/arch/x86/kernel/traps.c#L91
> > >
> > > if (user_mode(regs)) {
> > > RCU_LOCKDEP_WARN(!rcu_is_watching(), "entry code didn't wake RCU");
> > > } else {
> > > /*
> > > * We might have interrupted pretty much anything. In
> > > * fact, if we're a machine check, we can even interrupt
> > > * NMI processing. We don't want in_nmi() to return true,
> > > * but we need to notify RCU.
> > > */
> > > rcu_nmi_enter();
> > > }
> > >
> > > preempt_disable();
> > >
> > > As the comment says "We don't want in_nmi() to return true, but we need to
> > > notify RCU.". However, since kernel 5.8 the logic of how interrupts are handled
> > > was modified and currently we have this (function "exc_int3"):
> > > https://elixir.bootlin.com/linux/v5.8/source/arch/x86/kernel/traps.c#L630
> > >
> > > /*
> > > * idtentry_enter_user() uses static_branch_{,un}likely() and therefore
> > > * can trigger INT3, hence poke_int3_handler() must be done
> > > * before. If the entry came from kernel mode, then use nmi_enter()
> > > * because the INT3 could have been hit in any context including
> > > * NMI.
> > > */
> > > if (user_mode(regs)) {
> > > idtentry_enter_user(regs);
> > > instrumentation_begin();
> > > do_int3_user(regs);
> > > instrumentation_end();
> > > idtentry_exit_user(regs);
> > > } else {
> > > nmi_enter();
> > > instrumentation_begin();
> > > trace_hardirqs_off_finish();
> > > if (!do_int3(regs))
> > > die("int3", regs, 0);
> > > if (regs->flags & X86_EFLAGS_IF)
> > > trace_hardirqs_on_prepare();
> > > instrumentation_end();
> > > nmi_exit();
> > > }
> > >
> > > The root of unlucky change comes from this commit:
> > >
> > > https://github.com/torvalds/linux/commit/0d00449c7a28a1514595630735df383dec606812#diff-51ce909c2f65ed9cc668bc36cc3c18528541d8a10e84287874cd37a5918abae5
> > >
> > > which later was modified by this commit:
> > >
> > > https://github.com/torvalds/linux/commit/8edd7e37aed8b9df938a63f0b0259c70569ce3d2
> > >
> > > and this is what we currently have in all kernels since 5.8. Essentially,
> > > KRETPROBES are not working since these commits. We have the following logic:
> > >
> > > asm_exc_int3() -> exc_int3():
> > > |
> > > ----------------|
> > > |
> > > v
> > > ...
> > > nmi_enter();
> > > ...
> > > if (!do_int3(regs))
> > > |
> > > -----|
> > > |
> > > v
> > > do_int3() -> kprobe_int3_handler():
> > > |
> > > ----------------|
> > > |
> > > v
> > > ...
> > > if (!p->pre_handler || !p->pre_handler(p, regs))
> > > |
> > > -------------------------|
> > > |
> > > v
> > > ...
> > > pre_handler_kretprobe():
> > > ...
> > > if (unlikely(in_nmi())) {
> > > rp->nmissed++;
> > > return 0;
> > > }
> > >
> > > Essentially, exc_int3() calls nmi_enter(), and pre_handler_kretprobe() before
> > > invoking any registered kprobe verifies if it is not in NMI via in_nmi() call.
> > >
> > > This bug was masked by OPTIMIZER which was turning most of the KPROBES to be
> > > FTRACE so essentially if someone registered KRETPROBE, buggy code was not
> > > invoked (FTRACE code was executed instead). However, the optimizer was changed
> > > and can't optimize as many functions anymore (probably another bug which might
> > > be worth to investigate) and this bug is more visible.
> > >
> > > Thanks,
> > > Adam
> > >
> > > --
> > > pi3 (pi3ki31ny) - pi3 (at) itsec pl
> > > http://pi3.com.pl
> > >
> >
> >
> > --
> > Masami Hiramatsu <mhiramat@xxxxxxxxxx>
>
> --
> pi3 (pi3ki31ny) - pi3 (at) itsec pl
> http://pi3.com.pl
>


--
Masami Hiramatsu <mhiramat@xxxxxxxxxx>