Re: [PATCH v2 0/4] Static calls

From: Nadav Amit
Date: Wed Dec 12 2018 - 01:00:13 EST


> On Dec 11, 2018, at 10:05 AM, Josh Poimboeuf <jpoimboe@xxxxxxxxxx> wrote:
>
> On Fri, Dec 07, 2018 at 04:06:32PM +0000, Edward Cree wrote:
>> Sorry if this has been pointed out before (it's a very long thread), but
>> in the out-of-line implementation, it appears that static_call_update()
>> never alters key->func. Am I right in thinking that this should be
>> fixed by adding 'WRITE_ONCE(key->func, func);' just after the call to
>> arch_static_call_transform() on line 159 of include/linux/static_call.h?
>
> Yes, you're right about both bugs in the out-of-line case: key->func
> needs to be written, and __static_call_update() needs to be called by
> static_call_update. I was so focused on getting the inline case working
> that I overlooked those.
>
>> Some background (why does key->func matter for the
>> CONFIG_HAVE_STATIC_CALL_OUTLINE case?): I am experimenting with
>> combining these static calls with the 'indirect call wrappers' notion
>> that Paolo Abeni has been working on [1], using runtime instrumentation
>> to determine a list of potential callees. (This allows us to cope with
>> cases where the callees are in modules, or where different workloads may
>> use different sets of callees for a given call site, neither of which is
>> handled by Paolo's approach).
>> The core of my design looks something like:
>>
>> static int dynamic_call_xyz(int (*func)(some_args), some_args)
>> {
>> if (func == dynamic_call_xyz_1.func)
>> return static_call(dynamic_call_xyz_1, some_args);
>> if (func == dynamic_call_xyz_2.func)
>> return static_call(dynamic_call_xyz_2, some_args);
>> return (*func)(some_args);
>> }
>>
>> albeit with a bunch of extra (and currently rather ugly) stuff to collect
>> the statistics needed to decide what to put in the static call keys, and
>> mechanisms (RCU in my current case) to ensure that the static call isn't
>> changed between checking its .func and actually calling it.
>>
>> -Ed
>>
>> PS: not on list, please keep me in CC.
>>
>> [1] https://na01.safelinks.protection.outlook.com/?url=https%3A%2F%2Flwn.net%2FArticles%2F773985%2F&amp;data=02%7C01%7Cnamit%40vmware.com%7C147894d6c56d4ce6c1fc08d65f933adf%7Cb39138ca3cee4b4aa4d6cd83d9dd62f0%7C0%7C0%7C636801483286688490&amp;sdata=QOZcfWXjvPqoR0oujtf1QTQLenv%2BiEu6jUA5fiav6Mo%3D&amp;reserved=0
>
> Thanks, this sounds very interesting. Adding Nadav to CC, as he has
> been looking at a different approach to solving the same problem:
>
> https://na01.safelinks.protection.outlook.com/?url=https%3A%2F%2Flkml.kernel.org%2Fr%2F20181018005420.82993-1-namit%40vmware.com&amp;data=02%7C01%7Cnamit%40vmware.com%7C147894d6c56d4ce6c1fc08d65f933adf%7Cb39138ca3cee4b4aa4d6cd83d9dd62f0%7C0%7C0%7C636801483286688490&amp;sdata=bCI2N2xKUVyyyPAnWDH3mC3DsSk%2Bzy5nmI0DV%2F%2FaYYw%3D&amp;reserved=0

Thanks for ccâing me. (I didnât know about the other patch-sets.)

Allow me to share my experience, because I was studying this issue for some
time and have implemented more than I shared, since the code need more
cleanup. Some of the proposed approaches are things we either considered or
actually implemented (and later dropped). Eventually, our design was guided
by performance profiling and a grain of âwhatâs academicâ consideration.

I think that eventually you would want to go with one central mechanism for
the various situations:

1. Registered targets (static) and automatically learnt targets (dynamic).
Registration does not work in some cases (e.g., file-system function
pointers, there are too many of those). On the other hand, if you know your
target it is obviously simpler/better.

2. With/without retpoline fallback. Weâve have always had the retpoline as
fallback, but if you use a registration mechanism, itâs not necessary.

3. Single and multiple targets. For multiple targets we decided to use
outline block in order not to inflate the code for no reason. There were
over 10000 indirect branches in our kernel build, but in our workloads only
~500 were actually run.

If you got with the approach that Edward mentioned, you may want to
associate each âfunction" with identifier (think about file_operations
having an additional field that holds a unique ID, or using the struct
address). This would allow you to use a âbinary searchâ to find the right
target, which would be slightly more efficient. We actually used a
binary-search for a different reason - learning the most frequent syscalls
per process and calling them in this manner (we actually had an indirection
table to choose the right one).

3. Call-chains which are mostly fixed (next). You want to unroll those.

4. Per-process calls. The case that bothered us the most is seccomp. On our
setup, systemd installed 17(!) BPF seccomp programs on Redis server, causing
every syscall to go through 17 indirect branches to invoke them. But
similarly you have mmu-notifiers and others. We used a per-process
trampoline page for this matter.

Now there is of course the question of whether to go through automatic
inference of the indirect call sites (using asm macros/GCC plugin) or
manually marking them (using C macros). Initially we used C macros, which we
created using semi-automatically generated Coccinelle scripts. As I
remembered how I was crucified in the past over âuglificationâ of the code,
I thought a transparent modification of the code would be better, so we went
with asm macros for our prototype.

Finally, I should mention that the impact of most of these mechanisms should
not be significant (or even positive) if retpolines were not used. Having
said that, the automatically-learnt indirect branch promotion (with a single
target) showed up to roughly 2% performance improvement.

Please let me know how you want to proceed. I didnât know about your
patch-set, but I think that having two similar (yet different) separate
mechanisms is not great. If you want, Iâll finish addressing the issues
youâve raised and send another RFC.

Regards,
Nadav