Re: [RFC][PROTO][PATCH -tip 0/7] kprobes: support jump optimizationon x86

From: Masami Hiramatsu
Date: Tue Apr 07 2009 - 21:52:06 EST


Hi Frederic,

Frederic Weisbecker wrote:
> On Mon, Apr 06, 2009 at 05:41:22PM -0400, Masami Hiramatsu wrote:
>> Hi,
>>
>> Here, I'd like to show you another x86 insn decoder user.
>> These are the prototype patchset of the kprobes jump optimization
>> (a.k.a. Djprobe, which I had developed two years ago). Finally,
>> I rewrote it as the jump optimized probe. These patches are still
>> under development, it neither support temporary disabling, nor
>> support debugfs interface. However, its basic functions(register/
>> unregister/optimizing/safety check) are implemented.
>>
>> These patches can be applied on -tip tree + following patches;
>> - kprobes patches on -mm tree (I attached on this mail)
>> And below patches which I sent last week.
>> - x86: instruction decorder API
>> - x86: kprobes checks safeness of insertion address.
>>
>> So, this is another example of x86 instruction decoder.
>>
>> (Andrew, I ported some of -mm patches to -tip tree just for
>> preventing source code forking. This should be done on -tip,
>> because x86-instruction decoder has been discussed on -tip)
>>
>>
>> Jump Optimized Kprobes
>> ======================
>> o What is jump optimization?
>> Kprobes uses the int3 breakpoint instruction on x86 for instrumenting
>> probes into running kernel. Jump optimization allows kprobes to replace
>> breakpoint with a jump instruction for reducing probing overhead drastically.
>>
>>
>> o Advantage and Disadvantage
>> The advantage is process time performance. Usually, a kprobe hit takes
>> 0.5 to 1.0 microseconds to process. On the other hand, a jump optimized
>> probe hit takes less than 0.1 microseconds (actual number depends on the
>> processor). Here is a sample overheads.
>>
>> Intel(R) Xeon(R) CPU E5410 @ 2.33GHz (running in 2GHz)
>>
>> x86-32 x86-64
>> kprobe: 1.00us 1.05us
>> kprobe+booster: 0.45us 0.50us
>> kprobe+optimized: 0.05us 0.07us
>>
>> kretprobe : 1.77us 1.45us
>> kretprobe+booster: 1.30us 0.90us
>> kretprobe+optimized: 1.02us 0.40us
>
>
> Nice!

Thanks :)


>> However, there is a disadvantage (the law of equivalent exchange :)) too,
>> which is memory consumption. Jump optimization requires optimized_kprobe
>> data structure, and additional bigger instruction buffer than kprobe,
>> which contains exception emulating code (push/pop registers), copied
>> instructions, and a jump. Those data consumes 145 bytes(x86-32) of
>> memory per probe.
>
>
>
> But can we consider it as a small problem, assuming that kprobes are
> rarely intended for a massive use in once? I guess that usually, not a
> lot of functions are probed simultaneously.

Hm, yes and no, systemtap may use massive kprobes, because it supports
"wildcard" probes. However, optimizing in default may be acceptable.



>> Briefly speaking, an optimized kprobe 5 times faster and 3 times bigger
>> than a kprobe.
>>
>> Anyway, you can choose that you'd like to optimize your kprobes by setting
>> KPROBE_FLAG_OPTIMIZE to kp->flags field.
>>
>> o How to use it?
>> What you need to optimize your *probe is just adding KPROBE_FLAG_OPTIMIZE
>> to kp.flags before registering.
>>
>> E.g.
>> (setup handler/addr/symbol...)
>> kp->flags |= KPROBE_FLAG_OPTIMIZE;
>> (register kp)
>>
>> That's all. :-)
>
>
>
> May be it's better to set this flag as default-enable. Hm?

Yeah, this flag is just for the case without the last patch.
(in that case, user has to ensure that the kprobe can be optimized)

>> kprobes decodes probed function and checks whether the target instructions
>> can be optimized(replaced with a jump) safely. If it can't, kprobes clears
>> KPROBE_FLAG_OPTIMIZE from kp->flags. So, you can check it after registering.
>>
>>
>> o How it works?
>> kprobe jump optimization looks like an aggregated kprobe.
>>
>> Before preparing optimization, kprobe inserts original(user-defined)
>> kprobe on the specified address. So, even if the kprobe is not
>> possible to be optimized, it just fall back to a normal kprobe.
>>
>> - Safety check
>> First, kprobe decodes whole body of probed function and checks
>> whether there is NO indirect jump, and near jump which jumps into the
>> region which will be replaced by a jump instruction (except the 1st
>> byte of jump), because if some jump instruction jumps into the middle
>> of another instruction, which causes unexpectable results.
>> Kprobe also measures the length of instructions which will be replaced
>> by a jump instruction, because a jump instruction is longer than 1 byte,
>> it may replaces multiple instructions, and it checkes whether those
>> instructions can be executed out-of-line.
>>
>> - Preparing detour code
>> Next, kprobe prepares "detour" buffer, which contains exception emulating
>> code (push/pop registers, call handler), copied instructions(kprobes copies
>> instructions which will be replaced by a jump, to the detour buffer), and
>> a jump which jumps back to the original execution path.
>>
>> - Pre-optimization
>> After preparing detour code, kprobe kicks kprobe-optimizer workqueue to
>> optimize kprobe. To wait other optimized_kprobes, kprobe optimizer will
>> delay to work.
>> When the optimized_kprobe is hit before optimization, its handler
>> changes IP(instruction pointer) to detour code and exits. So, the
>> instructions which were copied to detour buffer are not executed.
>
>
> I have some trouble to understand these three last lines.
> The detour code has been set at this time, so if we jump to it, its
> instructions (saved original code overwritten by jump, and jump to the rest)
> will be executed. No?

Oh, yes, sorry for confusing. It should be "the original instructions which
will be replaced by a jump are not executed, instead of that, copied
instructions are executed."

>> - Optimization
>> Kprobe-optimizer doesn't start instruction-replacing soon, it waits
>> synchronize_sched for safety, because some processors are possible to be
>> interrpted on the instructions which will be replaced by a jump instruction.
>> As you know, synchronize_sched() can ensure that all interruptions which were
>> executed when synchronize_sched() was called are done, only if CONFIG_PREEMPT=n.
>> So, this version supports only the kernel with CONFIG_PREEMPT=n.(*)
>> After that, kprobe-optimizer replaces the 4 bytes right after int3 breakpoint
>> with relative-jump destination, and synchronize caches on all processors. Next,
>> it replaces int3 with relative-jump opcode, and synchronize caches again.
>>
>>
>> (*)This optimization-safety checking may be replaced with stop-machine method
>> which ksplice is done for supporting CONFIG_PREEMPT=y kernel.
>>
>
>
>
> I have to look at this series :-)

Thank you!

>
> Thanks,
> Frederic.
>

--
Masami Hiramatsu

Software Engineer
Hitachi Computer Products (America) Inc.
Software Solutions Division

e-mail: mhiramat@xxxxxxxxxx

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/