Re: [RFC PATCH] riscv: Implement HAVE_DYNAMIC_FTRACE_WITH_CALL_OPS
From: Andy Chiu
Date: Wed Mar 20 2024 - 12:42:31 EST
Hi Björn,
On Fri, Mar 15, 2024 at 4:50 AM Björn Töpel <bjorn@kernelorg> wrote:
>
> Björn Töpel <bjorn@xxxxxxxxxx> writes:
>
> > Puranjay Mohan <puranjay12@xxxxxxxxx> writes:
> >
> >> Björn Töpel <bjorn@xxxxxxxxxx> writes:
> >>
> >>>
> >>> Hmm, depending on RISC-V's CMODX path, the pro/cons CALL_OPS vs dynamic
> >>> trampolines changes quite a bit.
> >>>
> >>> The more I look at the pains of patching two instruction ("split
> >>> immediates"), the better "patch data" + one insn patching look.
> >>
> >> I was looking at how dynamic trampolines would be implemented for RISC-V.
> >>
> >> With CALL-OPS we need to patch the auipc+jalr at function entry only, the
> >> ops pointer above the function can be patched atomically.
> >>
> >> With a dynamic trampoline we need a auipc+jalr pair at function entry to jump
> >> to the trampoline and then another auipc+jalr pair to jump from trampoline to
> >> ops->func. When the ops->func is modified, we would need to update the
> >> auipc+jalr at in the trampoline.
> >>
> >> So, I am not sure how to move forward here, CALL-OPS or Dynamic trampolines?
> >
> > Yeah. Honestly, we need to figure out the patching story prior
> > choosing the path, so let's start there.
> >
> > After reading Mark's reply, and discussing with OpenJDK folks (who does
> > the most crazy text patching on all platforms), having to patch multiple
> > instructions (where the address materialization is split over multiple
> > instructions) is a no-go. It's just a too big can of worms. So, if we
> > can only patch one insn, it's CALL_OPS.
> >
> > A couple of options (in addition to Andy's), and all require a
> > per-function landing address ala CALL_OPS) tweaking what Mark is doing
> > on Arm (given the poor branch range).
> >
> > ...and maybe we'll get RISC-V rainbows/unicorns in the future getting
> > better reach (full 64b! ;-)).
> >
> > A) Use auipc/jalr, only patch jalr to take us to a common
> > dispatcher/trampoline
> >
> > | <func_trace_target_data_8B> # probably on a data cache-line != func .text to avoid ping-pong
> > | ...
> > | func:
> > | ...make sure ra isn't messed up...
> > | aupic
> > | nop <=> jalr # Text patch point -> common_dispatch
> > | ACTUAL_FUNC
> > |
> > | common_dispatch:
> > | load <func_trace_target_data_8B> based on ra
> > | jalr
> > | ...
> >
> > The auipc is never touched, and will be overhead. Also, we need a mv to
> > store ra in a scratch register as well -- like Arm. We'll have two insn
> > per-caller overhead for a disabled caller.
> >
> > B) Use jal, which can only take us +/-1M, and requires multiple
> > dispatchers (and tracking which one to use, and properly distribute
> > them. Ick.)
> >
> > | <func_trace_target_data_8B> # probably on a data cache-line != func .text to avoid ping-pong
> > | ...
> > | func:
> > | ...make sure ra isn't messed up...
> > | nop <=> jal # Text patch point -> within_1M_to_func_dispatch
> > | ACTUAL_FUNC
> > |
> > | within_1M_to_func_dispatch:
> > | load <func_trace_target_data_8B> based on ra
> > | jalr
> >
> > C) Use jal, which can only take us +/-1M, and use a per-function
> > trampoline requires multiple dispatchers (and tracking which one to
> > use). Blows up text size A LOT.
> >
> > | <func_trace_target_data_8B> # somewhere, but probably on a different cacheline than the .text to avoid ping-ongs
> > | ...
> > | per_func_dispatch
> > | load <func_trace_target_data_8B> based on ra
> > | jalr
> > | func:
> > | ...make sure ra isn't messed up...
> > | nop <=> jal # Text patch point -> per_func_dispatch
> > | ACTUAL_FUNC
>
> Brendan proposed yet another option, "in-function dispatch":
>
> D)
> | <func_trace_target_data_8B_per_function> # idk somewhere
> | ...
> | func:
> | mv tmp1, ra
> | aupic tmp2, <func_trace_target_data_8B_per_function:upper>
> | mv tmp3, zero <=> ld tmp3, tmp2<func_trace_target_data_8B_per_function:lower>
> | nop <=> jalr ra, tmp3
> | ACTUAL_FUNC
My patch series takes a similar "in-function dispatch" approach. A
difference is that the <func_trace_target_data_8B_per_function> is
embedded within each function entry. I'd like to have it moved to a
run-time allocated array to reduce total text size.
Another difference is that my series changes the first instruction to
"j ACTUAL_FUNC" for the "ftrace disable" case. As long as the
architecture guarantees the atomicity of the first instruction, then
we are safe. For example, we are safe if the first instruction could
only be "mv tmp, ra" or "j ACTUAL_FUNC". And since the loaded address is
always valid, we can fix "mv + jalr" down so we don't have to
play with the exception handler trick. The guarantee from arch would
require ziccif (in RVA22) though, but I think it is the same for us
(unless with stop_machine). For ziccif, I would rather call that out
during boot than blindly assume.
However, one thing I am not very sure is: do we need a destination
address in a "per-function" manner? It seems like most of the time the
destination address can only be ftrace_call, or ftrace_regs_call. If
the number of destination addresses is very few, then we could
potentially reduce the size of
<func_trace_target_data_8B_per_function>.
>
> There are 4 CMODX possiblities:
> mv, nop: fully disabled, no problems
> mv, jalr: We will jump to zero. We would need to have the inst
> page/access fault handler take care of this case. Especially
> if we align the instructions so that they can be patched
> together, being interrupted in the middle and taking this
> path will be rare.
> ld, nop: no problems
> ld, jalr: fully enabled, no problems
>
> Patching is a 64b store/sd, and we only need a fence.i at the end, since
> we can handle all 4 possibilities.
>
> For the disabled case we'll have:
> A) mv, aupic, nop
> D) mv, aupic, mv, nop.
>
> Puranjay, I've flipped. Let's go Mark's CALL_OPS together with a new
> text patch mechanism w/o stop_machine().
>
>
> Björn
Cheers,
Andy