Re: [RFC PATCH v2 0/4] dynamic indirect call promotion

From: Nadav Amit
Date: Tue Feb 05 2019 - 03:50:14 EST

> On Feb 1, 2019, at 4:05 PM, Edward Cree <ecree@xxxxxxxxxxxxxx> wrote:
> This series introduces 'dynamic_calls', branch trees of static calls (updated
> at runtime using text patching), to avoid making indirect calls to common
> targets. The basic mechanism is
> if (func ==
> call_static_key_1(args);
> else if (func ==
> call_static_key_2(args);
> /* ... */
> else
> (*func)(args); /* typically leads to a retpoline nowadays */
> with some additional statistics-gathering to allow periodic relearning of
> branch targets. Creating and calling a dynamic call table are each a single
> line in the consuming code, although they expand to a nontrivial amount of
> data and text in the kernel image.
> This is essentially indirect branch prediction, performed in software because
> we can't trust hardware to get it right. While the processor may speculate
> into the function calls, this is *probably* OK since they are known to be
> functions that frequently get called in this path, and thus are much less
> likely to contain side-channel information leaks than a completely arbitrary
> target PC from a branch target buffer.

My rationale is that it is ok since I presume that even after Spectre v2 is
addressed in HW, speculation might be possible using valid BTB targets
(matching the source RIP). This is somewhat equivalent to having software

> Moreover, when the speculation is
> accurate we positively want to be able to speculate into the callee.
> The branch target statistics are collected with percpu variables, counting
> both 'hits' on the existing branch targets and 'misses', divided into counts
> for up to four specific targets (first-come-first-served) and a catch-all
> miss count used once that table is full.
> When the total number of specific misses on a cpu reaches 1000, work is
> triggered which adds up counts across all CPUs and chooses the two most-
> popular call targets to patch into the call path.
> If instead the catch-all miss count reaches 1000, the counts and specific
> targets for that cpu are discarded, since either the target is too
> unpredictable (lots of low-frequency callees rather than a few dominating
> ones) or the targets that populated the table were by chance unpopular ones.
> To ensure that the static key target does not change between the if () check
> and the call, the whole dynamic_call must take place in an RCU read-side
> critical section (which, since the callee does not know it is being called in
> this manner, then lasts at least until the callee returns), and the patching
> at re-learning time is done with the help of a static_key to switch callers
> off the dynamic_call path and RCU synchronisation to ensure none are still on
> it. In cases where RCU cannot be used (e.g. because some callees need to RCU
> synchronise), it might be possible to add a variant that uses
> synchronize_rcu_tasks() when updating, but this series does not attempt this.

I wonder why. This seems like an easy solution, and according to Josh, Steven
Rostedt and the documentation appears to be valid.

> The dynamic_calls created by this series are opt-in, partly because of the
> abovementioned rcu_read_lock requirement.
> My attempts to measure the performance impact of dynamic_calls have been
> inconclusive; the effects on an RX-side UDP packet rate test were within
> Â1.5% and nowhere near statistical significance (p around 0.2-0.3 with n=6
> in a Welch t-test). This could mean that dynamic_calls are ineffective,
> but it could also mean that many more sites need converting before any gain
> shows up, or it could just mean that my testing was insufficiently sensitive
> or measuring the wrong thing. Given these poor results, this series is
> clearly not 'ready', hence the RFC tags, but hopefully it will inform the
> discussion in this area.

So I wasted quite some time on this profiling/implementation, and the
results that you get are not surprising. I am afraid that Josh and you
repeat some of the things I did before, which I now consider to be

Like you, I initially used a bunch of C macros to hook into the call
locations (although yours are nicer than mine). Similarly to this patch-set,
I originally change calling locations semi-manually, through an endless
process of: recording indirect branches (using performance counters);
running a python script to create a Coccinelle script to change the callers
and function definitions; and then applying the patches and fixing whatever
got broken.

It took me a while to understand this is the wrong approach. The callers
IMHO should not be changed - programmers should not need to understand some
macro is actually a function call. The complaints that PeterZ and others
give regarding PV infrastructure are relevant here as well.

I presume that the reason you did not see a performance improvement is that
there are hundreds of call sites in the network stack that need to be
modified. Most of the function pointers are concentrated in all kind of
âopsâ structures, so it is feasible to annotate them. Yet, changing the
callers will make the code somewhat ugly.

Indeed, it is possible to optimize your code. For example using some variant
of this_cpu_add() (*) would be better than this_cpu_ptr() followed by an add
operation. Or avoiding the WARN_ON_ONCE() if debug is not enabled somehow.
Yet, I donât think it is the right approach.

As I stated before, I think that the best solution is to use a GCC plugin,
similar to the version that I have sent before. I have an updated version
that allows to use custom attributes to opt-in certain branches, and
potentially, if you insist on inlining multiple branches, provide the number
of branches (**). Such a solution will not enable the calling code to be
written in C and would require a plugin for each architecture. Nevertheless,
the code would be more readable.

Feel free to try my code and give me feedback. I did not get a feedback on my
last version. Is there a fundamental problem with my plugin? Did you try it
and got bad results, or perhaps it did not build? Why do you prefer an approach
which requires annotation of the callers, instead of something that is much
more transparent?

(*) a variant that does not yell since you didnât disable preemption, but
does not prevent preemption or disable IRQs.

(**) I didnât send the latest version yet, since Iâm still struggling to make
it compatible with the PV infrastructure. If you want, I'll send the code as
it is right now.