RE: [PATCH] x86: only use ERMS for user copies for larger sizes

From: David Laight
Date: Mon Nov 26 2018 - 05:01:33 EST


From: Linus Torvalds
> Sent: 23 November 2018 16:36
>
> On Fri, Nov 23, 2018 at 2:12 AM David Laight <David.Laight@xxxxxxxxxx> wrote:
> >
> > I've just patched my driver and redone the test on a 4.13 (ubuntu) kernel.
> > Calling memcpy_fromio(kernel_buffer, PCIe_address, length)
> > generates a lot of single byte TLP.
>
> I just tested it too - it turns out that the __inline_memcpy() code
> never triggers, and "memcpy_toio()" just generates a memcpy.
>
> So that code seems entirely dead.
>
> And, in fact, the codebase I looked at was the historical one, because
> I had been going back and looking at the history. The modern tree
> *does* have the "__inline_memcpy()" function I pointed at, but it's
> not actually hooked up to anything!
>
> This actually has been broken for _ages_. The breakage goes back to
> 2010, and commit 6175ddf06b61 ("x86: Clean up mem*io functions"), and
> it seems nobody really ever noticed - or thought that it was ok.

It probably was ok in 2010 - that predates ERMS copy.

> That commit claims that iomem has no special significance on x86, but
> that really really isn't true, exactly because the access size does
> matter.

I suspect that memcpy_to/fromio() should only be used for IO space
that has 'memory-like' semantics.
So it shouldn't really matter what size accesses are done.
OTOH the 'io' side is likely to be slow so you want to limit the
number of io cycles (reads in particular).
With memory-like semantics it is ok to read full words at both ends
of the buffer to avoid extra transfers.
Indeed, on PCIe, the misaligned transfer for the last 8 bytes is
probably optimal.

> And as mentioned, the generic memory copy routines are not at all
> appropriate, and that has nothing to do with ERMS. Our "do it by hand"
> memory copy routine does things like this:
>
> .Lless_16bytes:
> cmpl $8, %edx
> jb .Lless_8bytes
> /*
> * Move data from 8 bytes to 15 bytes.
> */
> movq 0*8(%rsi), %r8
> movq -1*8(%rsi, %rdx), %r9
> movq %r8, 0*8(%rdi)
> movq %r9, -1*8(%rdi, %rdx)
> retq
>
> and note how for a 8-byte copy, it will do *two* reads of the same 8
> bytes, and *two* writes of the same 8 byte destination. That's
> perfectly ok for regular memory, and it means that the code can handle
> an arbitrary 8-15 byte copy without any conditionals or loop counts,
> but it is *not* ok for iomem.

I'd say it is ok for memcpy_to/fromio() since that should only really
be used for targets with memory-like semantics - and could be documented
as such.
But doing the same transfers twice is definitely sub-optimal.

> Of course, in practice it all just happens to work in almost all
> situations (because a lot of iomem devices simply won't care), and
> manual access to iomem is basically extremely rare these days anyway,
> but it's definitely entirely and utterly broken.
>
> End result: we *used* to do this right. For the last eight years our
> "memcpy_{to,from}io()" has been entirely broken, and apparently even
> the people who noticed oddities like David, never reported it as
> breakage but instead just worked around it in drivers.
>
> Ho humm.
>
> Let me write a generic routine in lib/iomap_copy.c (which already does
> the "user specifies chunk size" cases), and hook it up for x86.
>
> David, are you using a bus analyzer or something to do your testing?
> I'll have a trial patch for you asap.

We've a TLP monitor built into our fpga image so can look at the last
few TLPs (IIRC a 32kB buffer).

Testing patches is a bit harder.
The test system isn't one I build kernels on.

Is there a 'sensible' amd64 kernel config that contains most of the drivers
a modern system might need?
It is a PITA trying to build kernels that will load on all my test systems
(since I tend to move the disks between systems).
Rebuilding the ubuntu config just takes too long and generates a ramdisk
that doesn't fit in /boot.

David

-
Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK
Registration No: 1397386 (Wales)