Re: [RFC PATCH 0/5] x86: dynamic indirect call promotion
From: Nadav Amit
Date: Sat Dec 01 2018 - 02:08:41 EST
> On Nov 29, 2018, at 7:19 AM, Josh Poimboeuf <jpoimboe@xxxxxxxxxx> wrote:
>
> On Wed, Nov 28, 2018 at 10:06:52PM -0800, Andy Lutomirski wrote:
>> On Wed, Nov 28, 2018 at 7:24 PM Andy Lutomirski <luto@xxxxxxxxxxxxxx> wrote:
>>> On Nov 28, 2018, at 6:06 PM, Nadav Amit <namit@xxxxxxxxxx> wrote:
>>>
>>>>> On Nov 28, 2018, at 5:40 PM, Andy Lutomirski <luto@xxxxxxxxxx> wrote:
>>>>>
>>>>>> On Wed, Nov 28, 2018 at 4:38 PM Josh Poimboeuf <jpoimboe@xxxxxxxxxx> wrote:
>>>>>> On Wed, Nov 28, 2018 at 07:34:52PM +0000, Nadav Amit wrote:
>>>>>>>> On Nov 28, 2018, at 8:08 AM, Josh Poimboeuf <jpoimboe@xxxxxxxxxx> wrote:
>>>>>>>>
>>>>>>>>> On Wed, Oct 17, 2018 at 05:54:15PM -0700, Nadav Amit wrote:
>>>>>>>>> This RFC introduces indirect call promotion in runtime, which for the
>>>>>>>>> matter of simplification (and branding) will be called here "relpolines"
>>>>>>>>> (relative call + trampoline). Relpolines are mainly intended as a way
>>>>>>>>> of reducing retpoline overheads due to Spectre v2.
>>>>>>>>>
>>>>>>>>> Unlike indirect call promotion through profile guided optimization, the
>>>>>>>>> proposed approach does not require a profiling stage, works well with
>>>>>>>>> modules whose address is unknown and can adapt to changing workloads.
>>>>>>>>>
>>>>>>>>> The main idea is simple: for every indirect call, we inject a piece of
>>>>>>>>> code with fast- and slow-path calls. The fast path is used if the target
>>>>>>>>> matches the expected (hot) target. The slow-path uses a retpoline.
>>>>>>>>> During training, the slow-path is set to call a function that saves the
>>>>>>>>> call source and target in a hash-table and keep count for call
>>>>>>>>> frequency. The most common target is then patched into the hot path.
>>>>>>>>>
>>>>>>>>> The patching is done on-the-fly by patching the conditional branch
>>>>>>>>> (opcode and offset) that is used to compare the target to the hot
>>>>>>>>> target. This allows to direct all cores to the fast-path, while patching
>>>>>>>>> the slow-path and vice-versa. Patching follows 2 more rules: (1) Only
>>>>>>>>> patch a single byte when the code might be executed by any core. (2)
>>>>>>>>> When patching more than one byte, ensure that all cores do not run the
>>>>>>>>> to-be-patched-code by preventing this code from being preempted, and
>>>>>>>>> using synchronize_sched() after patching the branch that jumps over this
>>>>>>>>> code.
>>>>>>>>>
>>>>>>>>> Changing all the indirect calls to use relpolines is done using assembly
>>>>>>>>> macro magic. There are alternative solutions, but this one is
>>>>>>>>> relatively simple and transparent. There is also logic to retrain the
>>>>>>>>> software predictor, but the policy it uses may need to be refined.
>>>>>>>>>
>>>>>>>>> Eventually the results are not bad (2 VCPU VM, throughput reported):
>>>>>>>>>
>>>>>>>>> base relpoline
>>>>>>>>> ---- ---------
>>>>>>>>> nginx 22898 25178 (+10%)
>>>>>>>>> redis-ycsb 24523 25486 (+4%)
>>>>>>>>> dbench 2144 2103 (+2%)
>>>>>>>>>
>>>>>>>>> When retpolines are disabled, and if retraining is off, performance
>>>>>>>>> benefits are up to 2% (nginx), but are much less impressive.
>>>>>>>>
>>>>>>>> Hi Nadav,
>>>>>>>>
>>>>>>>> Peter pointed me to these patches during a discussion about retpoline
>>>>>>>> profiling. Personally, I think this is brilliant. This could help
>>>>>>>> networking and filesystem intensive workloads a lot.
>>>>>>>
>>>>>>> Thanks! I was a bit held-back by the relatively limited number of responses.
>>>>>>
>>>>>> It is a rather, erm, ambitious idea, maybe they were speechless :-)
>>>>>>
>>>>>>> I finished another version two weeks ago, and every day I think: "should it
>>>>>>> be RFCv2 or v1â, ending up not sending itâ
>>>>>>>
>>>>>>> There is one issue that I realized while working on the new version: Iâm not
>>>>>>> sure it is well-defined what an outline retpoline is allowed to do. The
>>>>>>> indirect branch promotion code can change rflags, which might cause
>>>>>>> correction issues. In practice, using gcc, it is not a problem.
>>>>>>
>>>>>> Callees can clobber flags, so it seems fine to me.
>>>>>
>>>>> Just to check I understand your approach right: you made a macro
>>>>> called "call", and you're therefore causing all instances of "call" to
>>>>> become magic? This is... terrifying. It's even plausibly worse than
>>>>> "#define if" :) The scariest bit is that it will impact inline asm as
>>>>> well. Maybe a gcc plugin would be less alarming?
>>>>
>>>> It is likely to look less alarming. When I looked at the inline retpoline
>>>> implementation of gcc, it didnât look much better than what I did - it
>>>> basically just emits assembly instructions.
>>>
>>> To be clear, that wasnât a NAK. It was merely a âthis is alarming.â
>>
>> Although... how do you avoid matching on things that really don't want
>> this treatment? paravirt ops come to mind.
>
> Paravirt ops don't use retpolines because they're patched into direct
> calls during boot. So Nadav's patches won't touch them.
Actually, the way itâs handled is slightly more complicated - yes, the CALL
macro should not be applied, as Josh said, but the question is how it is
achieved.
The basic idea is that the CALL macro should only be applied to C
source-files and not to assembly files and for macros.s, which holds the PV
call macros. I will recheck it is done this way.
Regards,
Nadav