RE: [PATCH] arm64: kprobe: Enable OPTPROBE for arm64

From: Song Bao Hua (Barry Song)
Date: Sat Jul 31 2021 - 08:21:55 EST




> -----Original Message-----
> From: Masami Hiramatsu [mailto:mhiramat@xxxxxxxxxx]
> Sent: Saturday, July 31, 2021 1:16 PM
> To: Song Bao Hua (Barry Song) <song.bao.hua@xxxxxxxxxxxxx>
> Cc: liuqi (BA) <liuqi115@xxxxxxxxxx>; catalin.marinas@xxxxxxx;
> will@xxxxxxxxxx; naveen.n.rao@xxxxxxxxxxxxx; anil.s.keshavamurthy@xxxxxxxxx;
> davem@xxxxxxxxxxxxx; linux-arm-kernel@xxxxxxxxxxxxxxxxxxx; Zengtao (B)
> <prime.zeng@xxxxxxxxxxxxx>; robin.murphy@xxxxxxx; Linuxarm
> <linuxarm@xxxxxxxxxx>; linux-kernel@xxxxxxxxxxxxxxx
> Subject: Re: [PATCH] arm64: kprobe: Enable OPTPROBE for arm64
>
> On Fri, 30 Jul 2021 10:04:06 +0000
> "Song Bao Hua (Barry Song)" <song.bao.hua@xxxxxxxxxxxxx> wrote:
>
> > > > > >
> > > > > > Hi Qi,
> > > > > >
> > > > > > Thanks for your effort!
> > > > > >
> > > > > > On Mon, 19 Jul 2021 20:24:17 +0800
> > > > > > Qi Liu <liuqi115@xxxxxxxxxx> wrote:
> > > > > >
> > > > > > > This patch introduce optprobe for ARM64. In optprobe, probed
> > > > > > > instruction is replaced by a branch instruction to detour
> > > > > > > buffer. Detour buffer contains trampoline code and a call to
> > > > > > > optimized_callback(). optimized_callback() calls opt_pre_handler()
> > > > > > > to execute kprobe handler.
> > > > > >
> > > > > > OK so this will replace only one instruction.
> > > > > >
> > > > > > >
> > > > > > > Limitations:
> > > > > > > - We only support !CONFIG_RANDOMIZE_MODULE_REGION_FULL case to
> > > > > > > guarantee the offset between probe point and kprobe pre_handler
> > > > > > > is not larger than 128MiB.
> > > > > >
> > > > > > Hmm, shouldn't we depends on !CONFIG_ARM64_MODULE_PLTS? Or,
> > > > > > allocate an intermediate trampoline area similar to arm optprobe
> > > > > > does.
> > > > >
> > > > > Depending on !CONFIG_ARM64_MODULE_PLTS will totally disable
> > > > > RANDOMIZE_BASE according to arch/arm64/Kconfig:
> > > > > config RANDOMIZE_BASE
> > > > > bool "Randomize the address of the kernel image"
> > > > > select ARM64_MODULE_PLTS if MODULES
> > > > > select RELOCATABLE
> > > >
> > > > Yes, but why it is required for "RANDOMIZE_BASE"?
> > > > Does that imply the module call might need to use PLT in
> > > > some cases?
> > > >
> > > > >
> > > > > Depending on !RANDOMIZE_MODULE_REGION_FULL seems to be still
> > > > > allowing RANDOMIZE_BASE via avoiding long jump according to:
> > > > > arch/arm64/Kconfig:
> > > > >
> > > > > config RANDOMIZE_MODULE_REGION_FULL
> > > > > bool "Randomize the module region over a 4 GB range"
> > > > > depends on RANDOMIZE_BASE
> > > > > default y
> > > > > help
> > > > > Randomizes the location of the module region inside a 4 GB window
> > > > > covering the core kernel. This way, it is less likely for modules
> > > > > to leak information about the location of core kernel data structures
> > > > > but it does imply that function calls between modules and the core
> > > > > kernel will need to be resolved via veneers in the module PLT.
> > > > >
> > > > > When this option is not set, the module region will be randomized
> over
> > > > > a limited range that contains the [_stext, _etext] interval of the
> > > > > core kernel, so branch relocations are always in range.
> > > >
> > > > Hmm, this dependency looks strange. If it always in range, don't we need
> > > > PLT for modules?
> > > >
> > > > Cataline, would you know why?
> > > > Maybe it's a KASLR's Kconfig issue?
> > >
> > > I actually didn't see any problem after making this change:
> > >
> > > diff --git a/arch/arm64/Kconfig b/arch/arm64/Kconfig
> > > index e07e7de9ac49..6440671b72e0 100644
> > > --- a/arch/arm64/Kconfig
> > > +++ b/arch/arm64/Kconfig
> > > @@ -1781,7 +1781,6 @@ config RELOCATABLE
> > >
> > > config RANDOMIZE_BASE
> > > bool "Randomize the address of the kernel image"
> > > - select ARM64_MODULE_PLTS if MODULES
> > > select RELOCATABLE
> > > help
> > > Randomizes the virtual address at which the kernel image is
> > > @@ -1801,6 +1800,7 @@ config RANDOMIZE_BASE
> > > config RANDOMIZE_MODULE_REGION_FULL
> > > bool "Randomize the module region over a 4 GB range"
> > > depends on RANDOMIZE_BASE
> > > + select ARM64_MODULE_PLTS if MODULES
> > > default y
> > > help
> > > Randomizes the location of the module region inside a 4 GB window
> > >
> > > and having this config:
> > > # zcat /proc/config.gz | grep RANDOMIZE_BASE
> > > CONFIG_RANDOMIZE_BASE=y
> > >
> > > # zcat /proc/config.gz | grep RANDOMIZE_MODULE_REGION_FULL
> > > # CONFIG_RANDOMIZE_MODULE_REGION_FULL is not set
> > >
> > > # zcat /proc/config.gz | grep ARM64_MODULE_PLTS
> > > # CONFIG_ARM64_MODULE_PLTS is not set
> > >
> > > Modules work all good:
> > > # lsmod
> > > Module Size Used by
> > > btrfs 1355776 0
> > > blake2b_generic 20480 0
> > > libcrc32c 16384 1 btrfs
> > > xor 20480 1 btrfs
> > > xor_neon 16384 1 xor
> > > zstd_compress 163840 1 btrfs
> > > raid6_pq 110592 1 btrfs
> > > ctr 16384 0
> > > md5 16384 0
> > > ip_tunnel 32768 0
> > > ipv6 442368 28
> > >
> > >
> > > I am not quite sure if there is a corner case. If no,
> > > I would think the kconfig might be some improper.
> >
> > The corner case is that even CONFIG_RANDOMIZE_MODULE_REGION_FULL
> > is not enabled, but if CONFIG_ARM64_MODULE_PLTS is enabled, when
> > we can't get memory from the 128MB area in case the area is exhausted,
> > we will fall back in module_alloc() to a 2GB area as long as either
> > of the below two conditions is met:
> >
> > 1. KASAN is not enabled
> > 2. KASAN is enabled and CONFIG_KASAN_VMALLOC is also enabled.
> >
> > void *module_alloc(unsigned long size)
> > {
> > u64 module_alloc_end = module_alloc_base + MODULES_VSIZE;
> > gfp_t gfp_mask = GFP_KERNEL;
> > void *p;
> >
> > /* Silence the initial allocation */
> > if (IS_ENABLED(CONFIG_ARM64_MODULE_PLTS))
> > gfp_mask |= __GFP_NOWARN;
> >
> > if (IS_ENABLED(CONFIG_KASAN_GENERIC) ||
> > IS_ENABLED(CONFIG_KASAN_SW_TAGS))
> > /* don't exceed the static module region - see below */
> > module_alloc_end = MODULES_END;
> >
> > p = __vmalloc_node_range(size, MODULE_ALIGN, module_alloc_base,
> > module_alloc_end, gfp_mask, PAGE_KERNEL, 0,
> > NUMA_NO_NODE, __builtin_return_address(0));
> >
> > if (!p && IS_ENABLED(CONFIG_ARM64_MODULE_PLTS) &&
> > (IS_ENABLED(CONFIG_KASAN_VMALLOC) ||
> > (!IS_ENABLED(CONFIG_KASAN_GENERIC) &&
> > !IS_ENABLED(CONFIG_KASAN_SW_TAGS))))
> > /*
> > * KASAN without KASAN_VMALLOC can only deal with module
> > * allocations being served from the reserved module region,
> > * since the remainder of the vmalloc region is already
> > * backed by zero shadow pages, and punching holes into it
> > * is non-trivial. Since the module region is not randomized
> > * when KASAN is enabled without KASAN_VMALLOC, it is even
> > * less likely that the module region gets exhausted, so we
> > * can simply omit this fallback in that case.
> > */
> > p = __vmalloc_node_range(size, MODULE_ALIGN, module_alloc_base,
> > module_alloc_base + SZ_2G, GFP_KERNEL,
> > PAGE_KERNEL, 0, NUMA_NO_NODE,
> > __builtin_return_address(0));
> >
> > if (p && (kasan_module_alloc(p, size) < 0)) {
> > vfree(p);
> > return NULL;
> > }
> >
> > return p;
> > }
> >
> > This should be happening quite rarely. But maybe arm64's document
> > needs some minor fixup, otherwise, it is quite confusing.
>
> OK, so CONFIG_KASAN_VLALLOC=y and CONFIG_ARM64_MODULE_PLTS=y, the
> module_alloc()
> basically returns the memory in 128MB region, but can return the memory in 2GB
> region. (This is OK because optprobe can filter it out)
> But CONFIG_RANDOMIZE_MODULE_REGION_FULL=y, there is almost no chance to get
> the memory in 128MB region.
>
> Hmm, for the optprobe in kernel text, maybe we can define 'optinsn_alloc_start'
> by 'module_alloc_base - (SZ_2G - MODULES_VADDR)' and use __vmalloc_node_range()
> to avoid this issue. But that is only for the kernel. For the modules, we may
> always out of 128MB region.

If we can have some separate PLT entries in each module for optprobe,
we should be able to short-jump to the PLT entry and then PLT entry
will further long-jump to detour out of the range. That is exactly
the duty of PLT.

Right now, arm64 has support on dynamic_ftrace by adding a
section in module for ftrace PLT.
arch/arm64/include/asm/module.lds.h:
SECTIONS {
#ifdef CONFIG_ARM64_MODULE_PLTS
.plt 0 (NOLOAD) : { BYTE(0) }
.init.plt 0 (NOLOAD) : { BYTE(0) }
.text.ftrace_trampoline 0 (NOLOAD) : { BYTE(0) }
#endif
...
}

arch/arm64/kernel/module.c will initialize some PLT entries
for ftrace:

static int module_init_ftrace_plt(const Elf_Ehdr *hdr,
const Elf_Shdr *sechdrs,
struct module *mod)
{
#if defined(CONFIG_ARM64_MODULE_PLTS) && defined(CONFIG_DYNAMIC_FTRACE)
const Elf_Shdr *s;
struct plt_entry *plts;

s = find_section(hdr, sechdrs, ".text.ftrace_trampoline");
if (!s)
return -ENOEXEC;

plts = (void *)s->sh_addr;

__init_plt(&plts[FTRACE_PLT_IDX], FTRACE_ADDR);

if (IS_ENABLED(CONFIG_DYNAMIC_FTRACE_WITH_REGS))
__init_plt(&plts[FTRACE_REGS_PLT_IDX], FTRACE_REGS_ADDR);

mod->arch.ftrace_trampolines = plts;
#endif
return 0;
}

Ftrace will then use those PLT entries in arch/arm64/kernel/ftrace.c:
static struct plt_entry *get_ftrace_plt(struct module *mod, unsigned long addr)
{
#ifdef CONFIG_ARM64_MODULE_PLTS
struct plt_entry *plt = mod->arch.ftrace_trampolines;

if (addr == FTRACE_ADDR)
return &plt[FTRACE_PLT_IDX];
if (addr == FTRACE_REGS_ADDR &&
IS_ENABLED(CONFIG_DYNAMIC_FTRACE_WITH_REGS))
return &plt[FTRACE_REGS_PLT_IDX];
#endif
return NULL;
}

/*
* Turn on the call to ftrace_caller() in instrumented function
*/
int ftrace_make_call(struct dyn_ftrace *rec, unsigned long addr)
{
unsigned long pc = rec->ip;
u32 old, new;
long offset = (long)pc - (long)addr;

if (offset < -SZ_128M || offset >= SZ_128M) {
struct module *mod;
struct plt_entry *plt;

if (!IS_ENABLED(CONFIG_ARM64_MODULE_PLTS))
return -EINVAL;

/*
* On kernels that support module PLTs, the offset between the
* branch instruction and its target may legally exceed the
* range of an ordinary relative 'bl' opcode. In this case, we
* need to branch via a trampoline in the module.
*
* NOTE: __module_text_address() must be called with preemption
* disabled, but we can rely on ftrace_lock to ensure that 'mod'
* retains its validity throughout the remainder of this code.
*/
preempt_disable();
mod = __module_text_address(pc);
preempt_enable();

if (WARN_ON(!mod))
return -EINVAL;

plt = get_ftrace_plt(mod, addr);
if (!plt) {
pr_err("ftrace: no module PLT for %ps\n", (void *)addr);
return -EINVAL;
}

addr = (unsigned long)plt;
}

old = aarch64_insn_gen_nop();
new = aarch64_insn_gen_branch_imm(pc, addr, AARCH64_INSN_BRANCH_LINK);

return ftrace_modify_code(pc, old, new, true);
}

This might be the direction to go later. Anyway, "Rome wasn't built
in a day", for this stage, we might focus on optprobe for the case
of non-randomized module region :-).

BTW, @liuqi, if users set "nokaslr" in bootargs, will your optprobe
always work and not fall back to normal kprobe even we remove the
dependency on RANDOMIZED_MODULE_REGION_FULL?

>
> Thank you,
>
> --
> Masami Hiramatsu <mhiramat@xxxxxxxxxx>

Thanks
Barry