Re: [PATCH 3/3] x86/ftrace: Use text_poke()

From: Andy Lutomirski
Date: Wed Oct 23 2019 - 00:25:18 EST


On Tue, Oct 22, 2019 at 4:49 PM Alexei Starovoitov
<alexei.starovoitov@xxxxxxxxx> wrote:
>
> On Tue, Oct 22, 2019 at 03:45:26PM -0700, Andy Lutomirski wrote:
> >
> >
> > >> On Oct 22, 2019, at 2:58 PM, Alexei Starovoitov <alexei.starovoitov@xxxxxxxxx> wrote:
> > >>
> > >> ïOn Tue, Oct 22, 2019 at 05:04:30PM -0400, Steven Rostedt wrote:
> > >> I gave a solution for this. And that is to add another flag to allow
> > >> for just the minimum to change the ip. And we can even add another flag
> > >> to allow for changing the stack if needed (to emulate a call with the
> > >> same parameters).
> > >
> > > your solution is to reduce the overhead.
> > > my solution is to remove it competely. See the difference?
> > >
> > >> By doing this work, live kernel patching will also benefit. Because it
> > >> is also dealing with the unnecessary overhead of saving regs.
> > >> And we could possibly even have kprobes benefit from this if a kprobe
> > >> doesn't need full regs.
> > >
> > > Neither of two statements are true. The per-function generated trampoline
> > > I'm talking about is bpf specific. For a function with two arguments it's just:
> > > push rbp
> > > mov rbp, rsp
> > > push rdi
> > > push rsi
> > > lea rdi,[rbp-0x10]
> > > call jited_bpf_prog
> > > pop rsi
> > > pop rdi
> > > leave
> > > ret
> >
> > Why are you saving rsi? You said upthread that youâre saving the args, but rsi is already available in rsi.
>
> because rsi is caller saved. The above example is for probing something
> like tcp_set_state(struct sock *sk, int state) that everyone used to
> kprobe until we got a tracepoint there.
> The main bpf prog has only one argument R1 == rdi on x86,
> but it's allowed to clobber all caller saved regs.
> Just like x86 function that accepts one argument in rdi can clobber rsi and others.
> So it's essential to save 'sk' and 'state' for tcp_set_state()
> to continue as nothing happened.

Oh, right, you're hijacking the very first instruction, so you know
that the rest of the arg regs as well as rax are unused.

But I find it hard to believe that this is a particularly meaningful
optimization compared to the version that saves all the C-clobbered
registers. Steven,

Also, Alexei, are you testing on a CONFIG_FRAME_POINTER=y kernel? The
ftrace code has a somewhat nasty special case to make
CONFIG_FRAME_POINTER=y work right, and your example trampoline does
not but arguably should have exaclty the same fixup. For good
performance, you should be using CONFIG_FRAME_POINTER=n.

Steven, with your benchmark, could you easily make your actual ftrace
hook do nothing at all and get a perf report on the result (i.e. call
the traced function in a loop a bunch of times under perf record -e
cycles or similar)? It would be interesting to see exactly what
trampoline code you're generating and just how bad it is. ISTM it
should be possible to squeeze very good performance out of ftrace. I
suppose you could also have a fancier mode than just "IP" that
specifies that the caller knows exactly which registers are live and
what they are. Then you could generate code that's exactly as good as
Alexei's.