RE: objtool clac/stac handling change..

From: David Laight
Date: Mon Jul 13 2020 - 05:32:41 EST


From: Linus Torvalds
> Sent: 10 July 2020 23:37
> On Tue, Jul 7, 2020 at 5:35 AM David Laight <David.Laight@xxxxxxxxxx> wrote:
> >
> >
> > So separate copy and checksum passes should easily exceed 4 bytes/clock,
> > but I suspect that doing them together never does.
> > (Unless the buffer is too big for the L1 cache.)
>
> Its' the "touch the caches twice" that is the problem".
>
> And it's not the "buffer is too big for L1", it's "the source, the
> destination and any incidentals are too big for L1" with the
> additional noise from replacement policies etc.

That's really what I meant.
L1D is actually (probably) only 32kB.
I guess that gives you 8k for the buffer.

It is a shame you can't use the AVX instructions in kernel.
(Although saving them probably costs more than the gain.)
Then you could use something based on:
10: load ymm,src+idx // 32 bytes
store ymm,tgt+idx
addq sum0,ymm // eight 32bit adds
rotate ymm,16 // Pretty sure there in an instruction for this!
addq sum1,ymm
add idx,32
jnz 10b
It is then possibly to determine the correct result from sum0/sum1.
On very recent Intel cpu that might even run at 1 iteration/clock!
(Probably needs and unroll and explicit interleave.)
At one iteration every 2 clocks it matches the ADDX[OC] loop
but includes the write.

David

-
Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK
Registration No: 1397386 (Wales)