Re: [PATCH v2 0/4] Static calls
From: Nadav Amit
Date: Wed Dec 12 2018 - 16:15:22 EST
> On Dec 12, 2018, at 10:33 AM, Edward Cree <ecree@xxxxxxxxxxxxxx> wrote:
>
> On 12/12/18 18:14, Nadav Amit wrote:
>> Second, (2i) is not very intuitive for me. Using the out-of-line static
>> calls seems to me as less performant than the inline (potentially, I didnât
>> check).
>>
>> Anyhow, the use of out-of-line static calls seems to me as
>> counter-intuitive. I think (didnât measure) that it may add more overhead
>> than it saves due to the additional call, ret, and so on
> AIUI the outline version uses a tail-call (i.e. jmpq *target) rather than an
> additional call and ret. So I wouldn't expect it to be too expensive.
> More to the point, it seems like it's easier to get right than the inline
> version, and if we get the inline version working later we can introduce it
> without any API change, much as Josh's existing patches have both versions
> behind a Kconfig switch.
I see. For my outlined blocks I used the opposite approach - a call followed
by jmp (instead of jmp followed by call that Josh did). Does the stack look
correct when you first do the jmp? It seems to me that you will not see the
calling function on the stack in this case. Can it have an effect on
live-patching, debugging?
>> I tried to avoid reading to
>> compared target from memory and therefore used an immediate. This should
>> prevent data cache misses and even when the data is available is faster by
>> one cycle. But it requires the patching of both the âcmp %target-reg, immâ
>> and âcall rel-targetâ to be patched âatomicallyâ. So the static-calls
>> mechanism wouldnât be sufficient.
> The approach I took to deal with that (since though I'm doing a read from
> memory, it's key->func in .data rather than the jmp immediate in .text) was
> to have another static_call (though a plain static_key could also be used)
> to 'skip' the fast-path while it's actually being patched. Then, since all
> my callers were under the rcu_read_lock, I just needed to synchronize_rcu()
> after switching off the fast-path to make sure no threads were still in it.
> I'm not sure how that would be generalised to all cases, though; we don't
> want to force every indirect call to take the rcu_read_lock as that means
> no callee can ever synchronize_rcu(). I guess we could have our own
> separate RCU read lock just for indirect call patching? (What does kgraft
> do?)
I used a similar approach to a certain extent. (Iâm going to describe the
implementation following the discussion with Andy Lutomirski): We use a
ârestartable sectionâ that if we need to be preempted in this block of code,
we restart the entire section. Then, we use synchronize_rcu() like you do
after patching.
>> Based on Joshâs previous feedback, I thought of improving the learning using
>> some hysteresis. Anyhow, note that there are quite a few cases in which you
>> wouldnât want optpolines. The question is whether in general it would be an
>> opt-in or opt-out mechanism.
> I was working on the assumption that it would be opt-in, wrapping a macro
> around indirect calls that are known to have a fairly small number of hot
> targets. There are plenty of indirect calls in the kernel that are only
> called once in a blue moon, e.g. in control-plane operations like ethtool;
> we don't really need to bulk up .text with trampolines for all of them.
On the other hand, Iâm not sure the static_call interface is so intuitive.
And extending it into âdynamic_callâ might be even worse. As I initially
used an opt-in approach, I can tell you that it was very exhausting.