Re: [PATCH 00/12] x86/crypto: Fix RBP usage in several crypto .S files

From: Ingo Molnar
Date: Thu Sep 14 2017 - 05:29:06 EST



* Ingo Molnar <mingo@xxxxxxxxxx> wrote:

> 1)
>
> Note how R12 is used immediately, right in the next instruction:
>
> vpaddq (TBL), Y_0, XFER
>
> I.e. the RBP fixes lengthen the program order data dependencies - that's a new
> constraint and a few extra cycles per loop iteration if the workload is
> address-generator bandwidth limited on that.
>
> A simple way to ease that constraint would be to move the 'TLB' load up into the
> loop, body, to the point where 'T1' is used for the last time - which is:
>
>
> mov a, T1 # T1 = a # MAJB
> and c, T1 # T1 = a&c # MAJB
>
> add y0, y2 # y2 = S1 + CH # --
> or T1, y3 # y3 = MAJ = (a|c)&b)|(a&c) # MAJ
>
> + mov frame_TBL(%rsp), TBL
>
> add y1, h # h = k + w + h + S0 # --
>
> add y2, d # d = k + w + h + d + S1 + CH = d + t1 # --
>
> add y2, h # h = k + w + h + S0 + S1 + CH = t1 + S0# --
> add y3, h # h = t1 + S0 + MAJ # --
>
> Note how this moves up the 'TLB' reload by 4 instructions.

Note that in this case 'TBL' would have to be initialized before the 1st
iteration, via something like:

movq $4, frame_SRND(%rsp)

+ mov frame_TBL(%rsp), TBL

.align 16
loop1:
vpaddq (TBL), Y_0, XFER
vmovdqa XFER, frame_XFER(%rsp)
FOUR_ROUNDS_AND_SCHED

Thanks,

Ingo