RE: framebuffer corruption due to overlapping stp instructions on arm64

From: David Laight
Date: Mon Aug 06 2018 - 06:17:02 EST


From: Mikulas Patocka
> Sent: 05 August 2018 15:36
> To: David Laight
...
> There's an instruction movntdqa (and vmovntdqa) that can actually do
> prefetch on write-combining memory type. It's the only instruction that
> can do it.
>
> It this instruction is used on non-write-combining memory type, it behaves
> like movdqa.
>
...
> I benchmarked it on a processor with ERMS - for writes to the framebuffer,
> there's no difference between memcpy, 8-byte writes, rep stosb, rep stosq,
> mmx, sse, avx - all this method achieve 16-17 GB/s

The combination of write-combining, posted writes and a fast PCIe slave
are probably why there is little difference.

> For reading from the framebuffer:
> 323 MB/s - memcpy (using avx2)
> 91 MB/s - explicit 8-byte reads
> 249 MB/s - rep movsq
> 307 MB/s - rep movsb

You must be getting the ERMS hardware optimised 'rep movsb'.

> 90 MB/s - mmx
> 176 MB/s - sse
> 4750 MB/s - sse movntdqa
> 330 MB/s - avx

avx512 is probably faster still.

> 5369 MB/s - avx vmovntdqa
>
> So - it may make sense to introduce a function memcpy_from_framebuffer()
> that uses movntdqa or vmovntdqa on CPUs that support it.

For kernel space it ought to be just memcpy_fromio().

Can you easily repeat the tests using a non-write-combining map of the
same PCIe slave?

I can probably run the same measurements against our rather leisurely
FPGA based PCIe slave.
IIRC PCIe reads happen every 128 clocks of the cards 62.5MHz clock,
increasing the size of the registers makes a significant different.
I've not tried mapping write-combining and using (v)movntdaq.
I'm not sure what effect write-combining would have if the whole BAR
were mapped that way - so I'll either have to map the physical addresses
twice or add in another BAR.

David

-
Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK
Registration No: 1397386 (Wales)