Re: [PATCH -v5 12/17] x86/kprobes: Fix ordering

From: Paul E. McKenney
Date: Thu Nov 14 2019 - 10:13:26 EST


On Thu, Nov 14, 2019 at 10:06:17AM -0500, Mathieu Desnoyers wrote:
> ----- On Nov 14, 2019, at 8:53 AM, Peter Zijlstra peterz@xxxxxxxxxxxxx wrote:
>
> > On Wed, Nov 13, 2019 at 10:42:32AM -0500, Mathieu Desnoyers wrote:
> >> ----- On Nov 11, 2019, at 8:13 AM, Peter Zijlstra peterz@xxxxxxxxxxxxx wrote:
> >>
> >> > Kprobes does something like:
> >> >
> >> > register:
> >> > arch_arm_kprobe()
> >> > text_poke(INT3)
> >> > /* guarantees nothing, INT3 will become visible at some point, maybe */
> >> >
> >> > kprobe_optimizer()
> >> > /* guarantees the bytes after INT3 are unused */
> >> > syncrhonize_rcu_tasks();
> >>
> >> syncrhonize -> synchronize
> >
> > Fixed.
> >
> >> > text_poke_bp(JMP32);
> >> > /* implies IPI-sync, kprobe really is enabled */
> >> >
> >> >
> >> > unregister:
> >> > __disarm_kprobe()
> >> > unoptimize_kprobe()
> >> > text_poke_bp(INT3 + tail);
> >> > /* implies IPI-sync, so tail is guaranteed visible */
> >> > arch_disarm_kprobe()
> >> > text_poke(old);
> >> > /* guarantees nothing, old will maybe become visible */
> >> >
> >> > synchronize_rcu()
> >> >
> >> > free-stuff
> >> >
> >> > Now the problem is that on register, the synchronize_rcu_tasks() does
> >> > not imply sufficient to guarantee all CPUs have already observed INT3
> >> > (although in practise this is exceedingly unlikely not to have
> >>
> >> practise -> practice
> >
> > And fixed.
> >
> >> > happened) (similar to how MEMBARRIER_CMD_PRIVATE_EXPEDITED does not
> >> > imply MEMBARRIER_CMD_PRIVATE_EXPEDITED_SYNC_CORE).
> >> >
> >> > Worse, even if it did, we'd have to do 2 synchronize calls to provide
> >> > the guarantee we're looking for, the first to ensure INT3 is visible,
> >> > the second to guarantee nobody is then still using the instruction
> >> > bytes after INT3.
> >>
> >> I'm not entirely convinced about this last statement though. AFAIU:
> >>
> >> - Swapping between some instruction and INT3 is OK,
> >> - Swapping back between that INT3 and _original_ instruction is OK,
> >> - Anything else needs to have explicit core serialization.
> >
> > So far, agreed.
> >
> >> So I understand the part about requiring the synchronize call to guarantee
> >> that nobody is then still using the instruction bytes following INT3, but not
> >> the rationale for the first synchronization. What operation would theoretically
> >> follow this first synchronize call ?
> >
> > I'm not completely sure what you're asking, so I'm going to explain too
> > much and hope I answered your question along the way. If not, please be
> > a little more specific.
> >
> > First:
> >
> > So what can happen with optimized kprobes is that the original
> > instruction is <5 bytes and we want to write a JMP.d32 (5 bytes).
> > Something like:
> >
> > 83e: 48 89 e5 mov %rsp,%rbp
> > 841: 48 85 c0 test %rax,%rax
> >
> > And we put a kprobe on the MOV. Then we poke INT3 at 0x83e and IPI-sync.
> > At that point the kprobe is active:
> >
> > 83e: cc 89 e5 int3 ; 2 byte garbage
> > 841: 48 85 c0 test %rax,%rax
> >
> > Now we want to optimize that thing. But imagine a task being preempted
> > at 0x841. If we poke in the JMP.d32 the above turns into gibberish
> >
> > 83e: e9 12 34 56 78 jmp +0x12345678
> >
> > Then our task resumes, loads the instruction at 0x841, which then reads:
> >
> > 841: 56 78 c0
> >
> > And goes *bang*.
>
> Thanks for the reminder. I somehow forgot that optimized kprobes covered
> instructions smaller than 5 bytes.
>
> >
> > So what we do, after enabling the regular kprobe, is call
> > synchronize_rcu_tasks() to wait for each task to have passed through
> > schedule(). That guarantees no task is preempted inside the kprobe
> > shadow (when it triggers it ensures it resumes execution at an
> > instruction boundary further than 5 bytes away).
>
> Indeed, given that synchronize_rcu_tasks() awaits for voluntary context
> switches (or user-space execution), it guarantees that no task was preempted
> within the kprobe shadow.
>
> Considering that synchronize_rcu_tasks() is meant only for code rewriting,
> I wonder if it would make sense to include the core serializing guarantees
> within this RCU API ?

As in have synchronize_rcu_tasks() do the IPI-sync love before doing
the current wait-for-voluntary-context-switch work? I don't know of
any objection to that approach. Certainly there is no possible
argument based on latency. ;-)

Thanx, Paul