Re: HalfSipHash Acceptable Usage

From: Andy Lutomirski
Date: Wed Dec 21 2016 - 21:01:42 EST


On Wed, Dec 21, 2016 at 9:25 AM, Linus Torvalds
<torvalds@xxxxxxxxxxxxxxxxxxxx> wrote:
> On Wed, Dec 21, 2016 at 7:55 AM, George Spelvin
> <linux@xxxxxxxxxxxxxxxxxxx> wrote:
>>
>> How much does kernel_fpu_begin()/kernel_fpu_end() cost?
>
> It's now better than it used to be, but it's absolutely disastrous
> still. We're talking easily many hundreds of cycles. Under some loads,
> thousands.
>
> And I warn you already: it will _benchmark_ a hell of a lot better
> than it will work in reality. In benchmarks, you'll hit all the
> optimizations ("oh, I've already saved away all the FP registers, no
> need to do it again").
>
> In contrast, in reality, especially with things like "do it once or
> twice per incoming packet", you'll easily hit the absolute worst
> cases, where not only does it take a few hundred cycles to save the FP
> state, you'll then return to user space in between packets, which
> triggers the slow-path return code and reloads the FP state, which is
> another few hundred cycles plus.

Hah, you're thinking that the x86 code works the way that Rik and I
want it to work, and you just made my day. :) What actually happens
is that the state is saved in kernel_fpu_begin() and restored in
kernel_fpu_end(), and it'll take a few hundred cycles best case. If
you do it a bunch of times in a loop, you *might* trigger a CPU
optimization that notices that the state being saved is the same state
that was just restored, but you're still going to pay the full restore
code each round trip no matter what.

The code is much clearer in 4.10 kernels now that I deleted the unused
"lazy" branches.

>
> Similarly, in benchmarks you'll hit the "modern CPU's power on the AVX
> unit and keep it powered up for a while afterwards", while in real
> life you would quite easily hit the "oh, AVX is powered down because
> we were idle, now it powers up at half speed which is another latency
> hit _and_ the AVX unit won't run full out anyway".

I *think* that was mostly fixed in Broadwell or thereabouts (in terms
of latency -- throughput and power consumption still suffers).