RE: [PATCH] x86: only use ERMS for user copies for larger sizes

From: David Laight
Date: Fri Nov 23 2018 - 05:12:33 EST


From: David Laight
> Sent: 23 November 2018 09:35
> From: Linus Torvalds
> > Sent: 22 November 2018 18:58
> ...
> > Oh, and I just noticed that on x86 we expressly use our old "safe and
> > sane" functions: see __inline_memcpy(), and its use in
> > __memcpy_{from,to}io().
> >
> > So the "falls back to memcpy" was always a red herring. We don't
> > actually do that.
> >
> > Which explains why things work.
>
> It doesn't explain why I've seen single byte PCIe TLP generated
> by memcpy_to/fromio().
>
> I've had to change code to use readq/writeq loops because the
> byte accesses are so slow - even when PIO performance should
> be 'good enough'.
>
> It might have been changed since last time I tested it.
> But I don't remember seeing a commit go by.

I've just patched my driver and redone the test on a 4.13 (ubuntu) kernel.
Calling memcpy_fromio(kernel_buffer, PCIe_address, length)
generates a lot of single byte TLP.

What the code normally does is 64bit aligned PCIe reads with
multiple writes and shifts to avoid writing beyond the end of
the kernel buffer for 'odd' length transfers.

Most of our PIO copies are actually direct to/from userspace.
While copy_to/from_user() will work on PCIe memory, it is 'rep mosvb'.
We also mmap() the PCIe space into process memory - and have be
careful not to use memcpy() in usespace.

On suitable systems userspace can use the AVX256 instructions to
get wide reads. Much harder and more expensive in the kernel.

In practise most of the bulk data transfers are requested by the PCIe slave.
But there are times when PIO ones are needed, and 64 bit transfers are
8 times faster than 8 bit ones.
This is all made more significant because it takes our fpga about 500ns
to complete a single word PCIe read.

David

-
Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK
Registration No: 1397386 (Wales)