Re: [PATCH -v5 12/17] x86/kprobes: Fix ordering

From: Mathieu Desnoyers
Date: Thu Nov 14 2019 - 10:22:29 EST


----- On Nov 14, 2019, at 10:13 AM, paulmck paulmck@xxxxxxxxxx wrote:

> On Thu, Nov 14, 2019 at 10:06:17AM -0500, Mathieu Desnoyers wrote:
>> ----- On Nov 14, 2019, at 8:53 AM, Peter Zijlstra peterz@xxxxxxxxxxxxx wrote:
>>
>> > On Wed, Nov 13, 2019 at 10:42:32AM -0500, Mathieu Desnoyers wrote:
>> >> ----- On Nov 11, 2019, at 8:13 AM, Peter Zijlstra peterz@xxxxxxxxxxxxx wrote:
>> >>
>> >> > Kprobes does something like:
>> >> >
>> >> > register:
>> >> > arch_arm_kprobe()
>> >> > text_poke(INT3)
>> >> > /* guarantees nothing, INT3 will become visible at some point, maybe */
>> >> >
>> >> > kprobe_optimizer()
>> >> > /* guarantees the bytes after INT3 are unused */
>> >> > syncrhonize_rcu_tasks();
>> >>
>> >> syncrhonize -> synchronize
>> >
>> > Fixed.
>> >
>> >> > text_poke_bp(JMP32);
>> >> > /* implies IPI-sync, kprobe really is enabled */
>> >> >
>> >> >
>> >> > unregister:
>> >> > __disarm_kprobe()
>> >> > unoptimize_kprobe()
>> >> > text_poke_bp(INT3 + tail);
>> >> > /* implies IPI-sync, so tail is guaranteed visible */
>> >> > arch_disarm_kprobe()
>> >> > text_poke(old);
>> >> > /* guarantees nothing, old will maybe become visible */
>> >> >
>> >> > synchronize_rcu()
>> >> >
>> >> > free-stuff
>> >> >
>> >> > Now the problem is that on register, the synchronize_rcu_tasks() does
>> >> > not imply sufficient to guarantee all CPUs have already observed INT3
>> >> > (although in practise this is exceedingly unlikely not to have
>> >>
>> >> practise -> practice
>> >
>> > And fixed.
>> >
>> >> > happened) (similar to how MEMBARRIER_CMD_PRIVATE_EXPEDITED does not
>> >> > imply MEMBARRIER_CMD_PRIVATE_EXPEDITED_SYNC_CORE).
>> >> >
>> >> > Worse, even if it did, we'd have to do 2 synchronize calls to provide
>> >> > the guarantee we're looking for, the first to ensure INT3 is visible,
>> >> > the second to guarantee nobody is then still using the instruction
>> >> > bytes after INT3.
>> >>
>> >> I'm not entirely convinced about this last statement though. AFAIU:
>> >>
>> >> - Swapping between some instruction and INT3 is OK,
>> >> - Swapping back between that INT3 and _original_ instruction is OK,
>> >> - Anything else needs to have explicit core serialization.
>> >
>> > So far, agreed.
>> >
>> >> So I understand the part about requiring the synchronize call to guarantee
>> >> that nobody is then still using the instruction bytes following INT3, but not
>> >> the rationale for the first synchronization. What operation would theoretically
>> >> follow this first synchronize call ?
>> >
>> > I'm not completely sure what you're asking, so I'm going to explain too
>> > much and hope I answered your question along the way. If not, please be
>> > a little more specific.
>> >
>> > First:
>> >
>> > So what can happen with optimized kprobes is that the original
>> > instruction is <5 bytes and we want to write a JMP.d32 (5 bytes).
>> > Something like:
>> >
>> > 83e: 48 89 e5 mov %rsp,%rbp
>> > 841: 48 85 c0 test %rax,%rax
>> >
>> > And we put a kprobe on the MOV. Then we poke INT3 at 0x83e and IPI-sync.
>> > At that point the kprobe is active:
>> >
>> > 83e: cc 89 e5 int3 ; 2 byte garbage
>> > 841: 48 85 c0 test %rax,%rax
>> >
>> > Now we want to optimize that thing. But imagine a task being preempted
>> > at 0x841. If we poke in the JMP.d32 the above turns into gibberish
>> >
>> > 83e: e9 12 34 56 78 jmp +0x12345678
>> >
>> > Then our task resumes, loads the instruction at 0x841, which then reads:
>> >
>> > 841: 56 78 c0
>> >
>> > And goes *bang*.
>>
>> Thanks for the reminder. I somehow forgot that optimized kprobes covered
>> instructions smaller than 5 bytes.
>>
>> >
>> > So what we do, after enabling the regular kprobe, is call
>> > synchronize_rcu_tasks() to wait for each task to have passed through
>> > schedule(). That guarantees no task is preempted inside the kprobe
>> > shadow (when it triggers it ensures it resumes execution at an
>> > instruction boundary further than 5 bytes away).
>>
>> Indeed, given that synchronize_rcu_tasks() awaits for voluntary context
>> switches (or user-space execution), it guarantees that no task was preempted
>> within the kprobe shadow.
>>
>> Considering that synchronize_rcu_tasks() is meant only for code rewriting,
>> I wonder if it would make sense to include the core serializing guarantees
>> within this RCU API ?
>
> As in have synchronize_rcu_tasks() do the IPI-sync love before doing
> the current wait-for-voluntary-context-switch work?

This is what I have in mind, yes, based on the assumption that the only
intended use-case for synchronize_rcu_tasks() is code patching.

> I don't know of
> any objection to that approach. Certainly there is no possible
> argument based on latency. ;-)

Indeed! :)

Thanks,

Mathieu

--
Mathieu Desnoyers
EfficiOS Inc.
http://www.efficios.com