Re: [PATCH 08/15] x86/alternatives: Teach text_poke_bp() to emulate instructions
From: Andy Lutomirski
Date: Tue Jun 11 2019 - 11:58:57 EST
> On Jun 11, 2019, at 1:03 AM, Peter Zijlstra <peterz@xxxxxxxxxxxxx> wrote:
>> On Fri, Jun 07, 2019 at 11:10:19AM -0700, Andy Lutomirski wrote:
>>> On Jun 7, 2019, at 10:34 AM, Peter Zijlstra <peterz@xxxxxxxxxxxxx> wrote:
>>> On Sat, Jun 08, 2019 at 12:47:08AM +0900, Masami Hiramatsu wrote:
>>>>> This fits almost all text_poke_bp() users, except
>>>>> arch_unoptimize_kprobe() which restores random text, and for that site
>>>>> we have to build an explicit emulate instruction.
>>>> Hm, actually it doesn't restores randome text, since the first byte
>>>> must always be int3. As the function name means, it just unoptimizes
>>>> (jump based optprobe -> int3 based kprobe).
>>>> Anyway, that is not an issue. With this patch, optprobe must still work.
>>> I thought it basically restored 5 bytes of original text (with no
>>> guarantee it is a single instruction, or even a complete instruction),
>>> with the first byte replaced with INT3.
>> I am surely missing some kprobe context, but is it really safe to use
>> this mechanism to replace more than one instruction?
> I'm not entirely up-to-scratch here, so Masami, please correct me if I'm
> So what happens is that arch_prepare_optimized_kprobe() <-
> copy_optimized_instructions() copies however much of the instruction
> stream is required such that we can overwrite the instruction at @addr
> with a 5 byte jump.
> arch_optimize_kprobe() then does the text_poke_bp() that replaces the
> instruction @addr with int3, copies the rel jump address and overwrites
> the int3 with jmp.
> And I'm thinking the problem is with something like:
> @addr: nop nop nop nop nop
> We copy out the nops into the trampoline, overwrite the first nop with
> an INT3, overwrite the remaining nops with the rel addr, but oops,
> another CPU can still be executing one of those NOPs, right?
> I'm thinking we could fix this by first writing INT3 into all relevant
> instructions, which is going to be messy, given the current code base.
How does that help? If RIP == x+2 and you want to put a 5-byte jump at address x, no amount of 0xcc is going to change the fact that RIP is in the middle of the jump.
Live patching can handle this by detecting this condition on each CPU, but performance wonât be great. Maybe some synchronize_sched trickery could help.