Re: [PATCH v2 0/4] Static calls

From: Nadav Amit
Date: Wed Dec 12 2018 - 13:14:07 EST

> On Dec 12, 2018, at 9:11 AM, Edward Cree <ecree@xxxxxxxxxxxxxx> wrote:
> On 12/12/18 05:59, Nadav Amit wrote:
>> Thanks for ccâing me. (I didnât know about the other patch-sets.)
> Well in my case, that's because I haven't posted any yet. (Will follow up
> shortly with what I currently have, though it's not pretty.)
> Looking at your patches, it seems you've got a much more developed learning
> mechanism. Mine on the other hand is brutally simple but runs continuously
> (i.e. after we patch we immediately enter the next 'relearning' phase);
> since it never does anything but prod a handful of percpu variables, this
> shouldn't be too costly.
> Also, you've got the macrology for making all indirect calls use this,
> whereas at present I just have an open-coded instance on a single call site
> (I went with deliver_skb in the networking stack).
> So I think where we probably want to go from here is:
> 1) get Josh's static_calls in. AIUI Linus seems to prefer the out-of-line
> approach; I'd say ditch the inline version (at least for now).
> 2) build a relpolines patch series that uses
> i) static_calls for the text-patching part
> ii) as much of Nadav's macrology as is applicable
> iii) either my or Nadav's learning mechanism; we can experiment with both,
> bikeshed it incessantly etc.
> Seem reasonable?

Mostly yes. I have a few reservations (and letâs call them optpolines from
now on, since Josh disliked the previous name).

First, I still have to address the issues that Josh raised before, and try
to use gcc plugin instead of (most) of the macros. Specifically, I need to
bring back (from my PoC code) the part that sets multiple targets.

Second, (2i) is not very intuitive for me. Using the out-of-line static
calls seems to me as less performant than the inline (potentially, I didnât

Anyhow, the use of out-of-line static calls seems to me as
counter-intuitive. I think (didnât measure) that it may add more overhead
than it saves due to the additional call, ret, and so on - at least if
retpolines are not used. For multiple targets it may be useful in saving
some memory if the outline block is dynamically allocated (as I did in my
yet unpublished code). But thatâs not how itâs done in Joshâs code.

If we talk about inline implementation there is a different problem that
prevents me of using Joshâs static-calls as-is. I tried to avoid reading to
compared target from memory and therefore used an immediate. This should
prevent data cache misses and even when the data is available is faster by
one cycle. But it requires the patching of both the âcmp %target-reg, immâ
and âcall rel-targetâ to be patched âatomicallyâ. So the static-calls
mechanism wouldnât be sufficient.

Based on Joshâs previous feedback, I thought of improving the learning using
some hysteresis. Anyhow, note that there are quite a few cases in which you
wouldnât want optpolines. The question is whether in general it would be an
opt-in or opt-out mechanism.

Let me know what you think.

BTW: When it comes to deliver_skb, you have packet_type as an identifier.
You can use it directly or through an indirection table to figure the
target. Hereâs a chunk of assembly magic that I used in a similar case:

.macro _call_table val:req bit:req max:req val1:req bit1:req
test $(1 << \bit), %al
.if \val1 + (1 << \bit1) >= \max
jnz syscall_relpoline_\val1
jmp syscall_relpoline_\val
jnz call_table_\val1\()_\bit1

# fall-through to no carry, val unchange, going to next bit
call_table \val,\bit1,\max
call_table \val1,\bit1,\max

.macro call_table val:req bit:req max:req
_call_table \val,\bit,\max,%(\val + (1 << \bit)),%(\bit + 1)

mov %esi, %eax
call_table val=0 bit=0 max=16