Re: [PATCH] uprobes: fix scratch register selection for rip-relative fixups

From: Jim Keniston
Date: Thu May 01 2014 - 20:48:36 EST


On Thu, 2014-05-01 at 19:09 +0200, Denys Vlasenko wrote:
> Before this patch, instructions such as div, mul,
> shifts with count in CL, cmpxchg are mishandled.
>
> This patch adds vex prefix handling. In particular,
> it avoids colliding with register operand encoded
> in vex.vvvv field.
>
> Since we need to avoid two possible register operands,
> the selection of scratch register needs to be from at least
> three registers.
>
> After looking through a lot of CPU docs, it looks like
> the safest choice is SI,DI,BX. Selecting BX needs care
> to not collide with implicit use of BX by cmpxchg8b.

I skipped reviewing some of this because I don't know about those parts
of the x86 architecture. What I saw and understood looks like it'll
work, although I have a couple of suggestions there. See below.

Thanks for your diligence and creativity on this.

Jim

>
> Signed-off-by: Denys Vlasenko <dvlasenk@xxxxxxxxxx>
> CC: Jim Keniston <jkenisto@xxxxxxxxxx>
> CC: Masami Hiramatsu <masami.hiramatsu.pt@xxxxxxxxxxx>
> CC: Srikar Dronamraju <srikar@xxxxxxxxxxxxxxxxxx>
> CC: Ingo Molnar <mingo@xxxxxxxxxx>
> CC: Oleg Nesterov <oleg@xxxxxxxxxx>
> ---
> arch/x86/kernel/uprobes.c | 154 ++++++++++++++++++++++++++++++++++------------
> 1 file changed, 116 insertions(+), 38 deletions(-)
>
> diff --git a/arch/x86/kernel/uprobes.c b/arch/x86/kernel/uprobes.c
> index c41bb4b..6cb53d1 100644
> --- a/arch/x86/kernel/uprobes.c
> +++ b/arch/x86/kernel/uprobes.c
> @@ -41,8 +41,10 @@
> /* Instruction will modify TF, don't change it */
> #define UPROBE_FIX_SETF 0x04
>
> -#define UPROBE_FIX_RIP_AX 0x08
> -#define UPROBE_FIX_RIP_CX 0x10
> +#define UPROBE_FIX_RIP_SI 0x08
> +#define UPROBE_FIX_RIP_DI 0x10
> +#define UPROBE_FIX_RIP_BX 0x20
> +#define UPROBE_FIX_RIP_ALL 0x38

I agree with Oleg's suggestion about how to define UPROBE_FIX_RIP_ALL.

>
> #define UPROBE_TRAP_NR UINT_MAX
>
> @@ -51,6 +53,8 @@
> #define OPCODE2(insn) ((insn)->opcode.bytes[1])
> #define OPCODE3(insn) ((insn)->opcode.bytes[2])
> #define MODRM_REG(insn) X86_MODRM_REG((insn)->modrm.value)
> +#define VEX2_VVVV(insn) X86_VEX_V((insn)->vex_prefix.bytes[1])
> +#define VEX3_VVVV(insn) X86_VEX_V((insn)->vex_prefix.bytes[2])

I disclaim any knowledge about VEX* stuff.

>
> #define W(row, b0, b1, b2, b3, b4, b5, b6, b7, b8, b9, ba, bb, bc, bd, be, bf)\
> (((b0##UL << 0x0)|(b1##UL << 0x1)|(b2##UL << 0x2)|(b3##UL << 0x3) | \
> @@ -274,63 +278,137 @@ static inline bool is_64bit_mm(struct mm_struct *mm)
> static void riprel_analyze(struct arch_uprobe *auprobe, struct insn *insn)
> {
> u8 *cursor;
> + u8 can_use_regs;
> u8 reg;
> + int reg2;
>
> if (!insn_rip_relative(insn))
> return;
>
> /*
> - * insn_rip_relative() would have decoded rex_prefix, modrm.
> + * insn_rip_relative() would have decoded rex_prefix,
> + * vex_prefix, modrm.
> * Clear REX.b bit (extension of MODRM.rm field):
> - * we want to encode rax/rcx, not r8/r9.
> + * we want to encode low numbered reg, not r8+.
> */
> if (insn->rex_prefix.nbytes) {
> cursor = auprobe->insn + insn_offset_rex_prefix(insn);
> - *cursor &= 0xfe; /* Clearing REX.B bit */
> + /* REX byte has 0100wrxb layout, clearing REX.b bit */
> + *cursor &= 0xfe;
> + }

skipped this next part...

> + /* Similar treatment for VEX3 prefix */
> + /* TODO: add XOP/EVEX treatment when insn decoder supports them */
> + if (insn->vex_prefix.nbytes == 3) {
> + /*
> + * vex2: c5 rvvvvLpp (has no b bit)
> + * vex3/xop: c4/8f rxbmmmmm wvvvvLpp
> + * evex: 62 rxbR00mm.wvvvv1pp.zllBVaaa
> + * (evex will need setting of both b and x since
> + * in non-sib encoding evex.x is 4th bit of MODRM.rm)
> + * Setting VEX3.b (setting because it has inverted meaning):
> + */
> + cursor = auprobe->insn + insn_offset_vex_prefix(insn) + 1;
> + *cursor |= 0x20;
> }
>

... resumed reviewing here.

> /*
> + * Convert from rip-relative addressing
> + * to register-relative addressing via a scratch register.
> + *
> + * This is tricky since there are insns with modrm byte
> + * which also use registers not encoded in modrm byte:
> + * [i]div/[i]mul: implicitly use dx:ax
> + * shift ops: implicitly use cx
> + * cmpxchg: implicitly uses ax
> + * cmpxchg8/16b: implicitly uses dx:ax and bx:cx
> + * Encoding: 0f c7/1 modrm
> + * The code below thinks that reg=1 (cx), chooses si as scratch.

I verified the above in the AMD manual, but cannot comment on this next
section...

> + * mulx: implicitly uses dx: mulx r/m,r1,r2: r1:r2 = dx * r/m
> + * First appeared in Haswell (BMI2 insn). It is vex-encoded.
> + * Example where none of bx,cx,dx can be used as scratch reg:
> + * c4 e2 63 f6 0d disp32 mulx disp32(%rip),%ebx,%ecx
> + * [v]pcmpistri: implicitly uses cx, xmm0
> + * [v]pcmpistrm: implicitly uses xmm0
> + * [v]pcmpestri: implicitly uses ax, dx, cx, xmm0
> + * [v]pcmpestrm: implicitly uses ax, dx, xmm0
> + * Evil SSE4.2 string comparison ops from hell.
> + * maskmovq/[v]maskmovdqu: implicitly uses (ds:rdi) as destination.
> + * Encoding: 0f f7 modrm, 66 0f f7 modrm, vex-encoded: c5 f9 f7 modrm.
> + * Store op1, byte-masked by op2 msb's in each byte, to (ds:rdi).
> + * AMD says it has no 3-operand form (vex.vvvv must be 1111)
> + * and that it can have only register operands, not mem
> + * (its modrm byte must have mode=11).
> + * If these restrictions will ever be lifted,
> + * we'll need code to prevent selection of di as scratch reg!

... resumed reviewing here.

> + *
> + * Summary: I don't know any insns with modrm byte which
> + * use SI register implicitly. DI register is used only
> + * by one insn (maskmovq) and BX register is used
> + * only by one too (cmpxchg8b).
> + * BP is stack-segment based (may be a problem?).
> + * AX, DX, CX are off-limits (many implicit users).
> + * SP is unusable (it's stack pointer - think about "pop mem";
> + * also, rsp+disp32 needs sib encoding -> insn length change).
> + */
> + reg = MODRM_REG(insn);
> + reg2 = -1;

VEX again. Covering my ears and humming...

> + if (insn->vex_prefix.nbytes == 2) {
> + reg2 = VEX2_VVVV(insn) ^ 0xf;
> + }
> + if (insn->vex_prefix.nbytes == 3) {
> + reg2 = VEX3_VVVV(insn) ^ 0xf;
> + }
> + /* TODO: add XOP, EXEV vvvv reading */
> +

... and I'm back.

> + /* Register numbering is ax,cx,dx,bx, sp,bp,si,di, r8..r15 */
> + can_use_regs = UPROBE_FIX_RIP_SI | UPROBE_FIX_RIP_DI | UPROBE_FIX_RIP_BX;
> +#if 0
> + /* In any case, one bit will remain.
> + * Clearing bx bit is pointless. Its state does not affect
> + * scratch reg selection: we always prefer si/di to bx.
> + */
> + if (reg == 3 || reg2 == 3)
> + can_use_regs &= ~UPROBE_FIX_RIP_BX;
> +#endif
> + if (reg == 6 || reg2 == 6)
> + can_use_regs &= ~UPROBE_FIX_RIP_SI;
> + if (reg == 7 || reg2 == 7)
> + can_use_regs &= ~UPROBE_FIX_RIP_DI;
> + /* TODO: force maskmovq to not use di */
> + auprobe->def.fixups = can_use_regs;
> +
> + /*
> + * Choose scratch reg. Order is important:
> + * must not select bx if we can use si (cmpxchg8b case!)
> + */
> + reg2 = 3; /* BX */
> + if (can_use_regs & UPROBE_FIX_RIP_DI)
> + reg2 = 7;
> + if (can_use_regs & UPROBE_FIX_RIP_SI)
> + reg2 = 6;

It seems more natural to code this as:
if (can_use_regs & UPROBE_FIX_RIP_SI)
reg2 = 6;
else if (can_use_regs & UPROBE_FIX_RIP_DI)
reg2 = 7;
else
reg2 = 3; /* BX */
... which is pretty much how you do it in scratch_reg().

> + /*
> * Point cursor at the modrm byte. The next 4 bytes are the
> * displacement. Beyond the displacement, for some instructions,
> * is the immediate operand.
> */
> cursor = auprobe->insn + insn_offset_modrm(insn);
> /*
> - * Convert from rip-relative addressing
> - * to register-relative addressing via a scratch register.
> + * Change modrm from "00 reg 101" to "10 reg reg2". Example:
> + * 89 05 disp32 mov %eax,disp32(%rip) becomes
> + * 89 86 disp32 mov %eax,disp32(%rsi)
> */
> - reg = MODRM_REG(insn);
> - if (reg == 0) {
> - /*
> - * The register operand (if any) is either the A register
> - * (%rax, %eax, etc.) or (if the 0x4 bit is set in the
> - * REX prefix) %r8. In any case, we know the C register
> - * is NOT the register operand, so we use %rcx (register
> - * #1) for the scratch register.
> - */
> - auprobe->def.fixups |= UPROBE_FIX_RIP_CX;
> - /*
> - * Change modrm from "00 000 101" to "10 000 001". Example:
> - * 89 05 disp32 mov %eax,disp32(%rip) becomes
> - * 89 81 disp32 mov %eax,disp32(%rcx)
> - */
> - *cursor = 0x81;
> - } else {
> - /* Use %rax (register #0) for the scratch register. */
> - auprobe->def.fixups |= UPROBE_FIX_RIP_AX;
> - /*
> - * Change modrm from "00 reg 101" to "10 reg 000". Example:
> - * 89 1d disp32 mov %edx,disp32(%rip) becomes
> - * 89 98 disp32 mov %edx,disp32(%rax)
> - */
> - *cursor = (reg << 3) | 0x80;
> - }
> + *cursor = (reg << 3) | 0x80 | reg2;
> }
>
> static inline unsigned long *
> scratch_reg(struct arch_uprobe *auprobe, struct pt_regs *regs)
> {
> - return (auprobe->def.fixups & UPROBE_FIX_RIP_AX) ? &regs->ax : &regs->cx;
> + /* Order is important - more than one bit can be set! */
> + if (auprobe->def.fixups & UPROBE_FIX_RIP_SI)
> + return &regs->si;
> + if (auprobe->def.fixups & UPROBE_FIX_RIP_DI)
> + return &regs->di;
> + return &regs->bx;
> }
>
> /*
> @@ -339,7 +417,7 @@ scratch_reg(struct arch_uprobe *auprobe, struct pt_regs *regs)
> */
> static void riprel_pre_xol(struct arch_uprobe *auprobe, struct pt_regs *regs)
> {
> - if (auprobe->def.fixups & (UPROBE_FIX_RIP_AX | UPROBE_FIX_RIP_CX)) {
> + if (auprobe->def.fixups & UPROBE_FIX_RIP_ALL) {
> struct uprobe_task *utask = current->utask;
> unsigned long *sr = scratch_reg(auprobe, regs);
>
> @@ -350,7 +428,7 @@ static void riprel_pre_xol(struct arch_uprobe *auprobe, struct pt_regs *regs)
>
> static void riprel_post_xol(struct arch_uprobe *auprobe, struct pt_regs *regs)
> {
> - if (auprobe->def.fixups & (UPROBE_FIX_RIP_AX | UPROBE_FIX_RIP_CX)) {
> + if (auprobe->def.fixups & UPROBE_FIX_RIP_ALL) {
> struct uprobe_task *utask = current->utask;
> unsigned long *sr = scratch_reg(auprobe, regs);
>
> @@ -724,9 +802,9 @@ bool arch_uprobe_xol_was_trapped(struct task_struct *t)
> *
> * If the original instruction was a rip-relative instruction such as
> * "movl %edx,0xnnnn(%rip)", we have instead executed an equivalent
> - * instruction using a scratch register -- e.g., "movl %edx,0xnnnn(%rax)".
> + * instruction using a scratch register -- e.g., "movl %edx,0xnnnn(%rsi)".
> * We need to restore the contents of the scratch register
> - * (FIX_RIP_AX or FIX_RIP_CX).
> + * (FIX_RIP_reg).
> */
> int arch_uprobe_post_xol(struct arch_uprobe *auprobe, struct pt_regs *regs)
> {


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/