Re: [PATCH v2 3/6] riscv: ftrace: prepare ftrace for atomic code patching
From: Björn Töpel
Date: Wed Sep 11 2024 - 13:16:56 EST
Tao Chiu <andybnac@xxxxxxxxx> writes:
> Björn Töpel <bjorn@xxxxxxxxxx> 於 2024年9月11日 週三 下午10:37寫道:
>
>>
>> Andy Chiu <andybnac@xxxxxxxxx> writes:
>>
>> > On Wed, Aug 14, 2024 at 02:57:52PM +0200, Björn Töpel wrote:
>> >> Björn Töpel <bjorn@xxxxxxxxxx> writes:
>> >>
>> >> > Andy Chiu <andy.chiu@xxxxxxxxxx> writes:
>> >> >
>> >> >> We use an AUIPC+JALR pair to jump into a ftrace trampoline. Since
>> >> >> instruction fetch can break down to 4 byte at a time, it is impossible
>> >> >> to update two instructions without a race. In order to mitigate it, we
>> >> >> initialize the patchable entry to AUIPC + NOP4. Then, the run-time code
>> >> >> patching can change NOP4 to JALR to eable/disable ftrcae from a
>> >> > enable ftrace
>> >> >
>> >> >> function. This limits the reach of each ftrace entry to +-2KB displacing
>> >> >> from ftrace_caller.
>> >> >>
>> >> >> Starting from the trampoline, we add a level of indirection for it to
>> >> >> reach ftrace caller target. Now, it loads the target address from a
>> >> >> memory location, then perform the jump. This enable the kernel to update
>> >> >> the target atomically.
>> >> >
>> >> > The +-2K limit is for direct calls, right?
>> >> >
>> >> > ...and this I would say breaks DIRECT_CALLS (which should be implemented
>> >> > using call_ops later)?
>> >>
>> >> Thinking a bit more, and re-reading the series.
>> >>
>> >> This series is good work, and it's a big improvement for DYNAMIC_FTRACE,
>> >> but
>> >>
>> >> +int ftrace_make_call(struct dyn_ftrace *rec, unsigned long addr)
>> >> +{
>> >> + unsigned long distance, orig_addr;
>> >> +
>> >> + orig_addr = (unsigned long)&ftrace_caller;
>> >> + distance = addr > orig_addr ? addr - orig_addr : orig_addr - addr;
>> >> + if (distance > JALR_RANGE)
>> >> + return -EINVAL;
>> >> +
>> >> + return __ftrace_modify_call(rec->ip, addr, false);
>> >> +}
>> >> +
>> >>
>> >> breaks WITH_DIRECT_CALLS. The direct trampoline will *never* be within
>> >> the JALR_RANGE.
>> >
>> >
>> > Yes, it is hardly possible that a direct trampoline will be within the
>> > range.
>> >
>> > Recently I have been thinking some solutions to address the issue. One
>> > solution is replaying AUIPC at function entries. The idea has two sides.
>> > First, if we are returning back to the second instruction at trap return,
>> > then do sepc -= 4 so it executes the up-to-date AUIPC. The other side is
>> > to fire synchronous IPI that does remote fence.i at right timings to
>> > prevent concurrent executing on a mix of old and new instructions.
>> >
>> > Consider replacing instructions at a function's patchable entry with the
>> > following sequence:
>> >
>> > Initial state:
>> > --------------
>> > 0: AUIPC
>> > 4: JALR
>> >
>> > Step1:
>> > write(0, "J +8")
>> > fence w,w
>> > send sync local+remote fence.i
>> > ------------------------
>> > 0: J +8
>> > 4: JALR
>> >
>> > Step2:
>> > write(4, "JALR'")
>> > fence w,w
>> > send sync local+remote fence.i
>> > ------------------------
>> > 0: J +8
>> > 4: JALR'
>> >
>> > Step3:
>> > write(0, "AUIPC'")
>> > fence w,w
>> > send sync local+remote fence.i (to activate the call)
>> > -----------------------
>> > 0: AUIPC'
>> > 4: JALR'
>> >
>> > The following execution sequences are acceptable:
>> > - AUIPC, JALR
>> > - J +8, (skipping {JALR | JALR'})
>> > - AUIPC', JALR'
>> >
>> > And here are sequences that we want to prevent:
>> > - AUIPC', JALR
>> > - AUIPC, JALR'
>> >
>> > The local core should never execute the forbidden sequence.
>> >
>> > By listing all possible combinations of executing sequence on a remote
>> > core, we can find that the dangerous seqence is impossible to happen:
>> >
>> > let f be the fence.i at step 1, 2, 3. And let numbers be the location of
>> > code being executed. Mathematically, here are all combinations at a site
>> > happening on a remote core:
>> >
>> > fff04 -- updated seq
>> > ff0f4 -- impossible, would be ff0f04, updated seq
>> > ff04f -- impossible, would be ff08f, safe seq
>> > f0ff4 -- impossible, would be f0ff04, updated seq
>> > f0f4f -- impossible, would be f0f08f (safe), or f0f0f04 (updated)
>> > f04ff -- impossible, would be f08ff, safe seq
>> > 0fff4 -- impossible, would be 0fff04, updated seq
>> > 0ff4f -- impossible, would be 0ff08f (safe), or 0ff0f04 (updated)
>> > 0f4ff -- impossible, would be 0f08ff (safe), 0f0f08f (safe), 0f0f0f04 (updated)
>> > 04fff -- old seq
>> >
>> > After the 1st 'fence.i', remote cores should observe (J +8, JALR) or (J +8, JALR')
>> > After the 2nd 'fence.i', remote cores should observe (J +8, JALR') or (AUIPC', JALR')
>> > After the 3rd 'fence.i', remote cores should observe (AUIPC', JALR')
>> >
>> > Remote cores should never execute (AUIPC',JALR) or (AUIPC,JALR')
>> >
>> > To correctly implement the solution, the trap return code must match JALR
>> > and adjust sepc only for patchable function entries. This is undocumently
>> > possible because we use t0 as source and destination registers for JALR
>> > at function entries. Compiler never generates JALR that uses the same
>> > register pattern.
>> >
>> > Another solution is inspired by zcmt, and perhaps we can optimize it if
>> > the hardware does support zcmt. First, we allocate a page and divide it
>> > into two halves. The first half of the page are 255 x 8B destination
>> > addresses. Then, starting from offset 2056, the second half of the page
>> > is composed by a series of 2 x 4 Byte instructions:
>> >
>> > 0: ftrace_tramp_1
>> > 8: ftrace_tramp_2
>> > ...
>> > 2040: ftrace_tramp_255
>> > 2048: ftrace_tramp_256 (not used when configured with 255 tramps)
>> > 2056:
>> > ld t1, -2048(t1)
>> > jr t1
>> > ld t1, -2048(t1)
>> > jr t1
>> > ...
>> > 4088:
>> > ld t1, -2048(t1)
>> > jr t1
>> > 4096:
>> >
>> > It is possible to expand to 511 trampolines by adding a page
>> > below, and making a load+jr sequence from +2040 offset.
>> >
>> > When the kernel boots, we direct AUIPCs at patchable entries to the page,
>> > and disable the call by setting the second instruction to NOP4. Then, we
>> > can effectively enable/disable/modify a call by setting only the
>> > instruction at JALR. It is possible to utilize most of the current patch
>> > set to achieve atomic patching. A missing part is to allocate and manage
>> > trampolines for ftrace users.
>>
>> (I will need to digest above in detail!)
>>
>> I don't think it's a good idea to try to handle direct calls w/o
>> call_ops. What I was trying to say is "add the call_ops patch to your
>> series, so that direct calls aren't broken". If direct calls depend on
>> call_ops -- sure, no worries. But don't try to get direct calls W/O
>> call_ops. That's a whole new bag of worms.
>>
>> Some more high-level thoughts: ...all this to workaround where we don't
>> want the call_ops overhead? Is there really a use-case with a platform
>> that doesn't handle the text overhead of call_ops?
>
> Sorry for making any confusions. I have no strong personal preference
> on what we should do. Just want to have a technical discussion on what
> is possible if we want to optimize code size.
Understood! I'm -- as you know -- really eager to have a text patching
mechanism that is not horrible. ;-) I'd rather wait with the text size
optimizations.
TL;DR -- your series is fine from my POV, but it's missing call_ops, so
that it doesn't break direct calls.
I will take your patch series, and Puranjay's call_ops series see how
they play together.
We can discuss it at LPC next week!
Cheers,
Björn