Re: [PATCH] arm: Remove IO memcpy for Big-Endian

From: Julian Vetter
Date: Wed Dec 04 2024 - 02:31:06 EST




On 03.12.24 11:08, Arnd Bergmann wrote:
On Tue, Dec 3, 2024, at 09:38, Julian Vetter wrote:
From: Julian Vetter <julian@xxxxxxxxxxxxxxxx>

Recently a new IO memcpy was added in libs/iomem_copy.c. So, remove the
byte-wise IO memcpy operations used in ARM big endian builds and fall
back to the new generic implementation. It will be slightly faster,
because it uses machine word accesses if the memory is aligned and falls
back to byte-wise accesses if its not.

Signed-off-by: Julian Vetter <julian@xxxxxxxxxxxxxxxx>
---
arch/arm/include/asm/io.h | 11 ----------
arch/arm/kernel/io.c | 46 ---------------------------------------
2 files changed, 57 deletions(-)


Hello Arnd,
first, sorry, that I messed up my sender email while sending the patch.

I'm not sure if this is safe on all platforms. Big-endian arm
is extremely rare in practice, and in comes in multiple variants
that behave slightly differently:

- On modern ARMv7 the byte-invariant big-endian "BE8" mode
generally well-behaved and works as one would expect it to.

- There is one ARMv5 "BE32" based platform, the ixp4xx, which
works differently, and this in turn allows multiple configurations
for its buses where a byte-swap is performed in the PCI
controller.

When the little-endian I/O string operations got optimized to
calling the word-based helpers in commit 7ddfe625cbc1 ("ARM:
optimize memset_io()/memcpy_fromio()/memcpy_toio()"), Russell
intentionally left the big-endian versions alone, which I think
was done for the case of PCI on ixp4xx, but could have been
out of general caution.


Wow, thank you for the explanation!

Before we apply your patch, I think the minimum would be to
have Linus Walleij try it out on an an ixp4xx with a driver
that uses these functions. Maybe Russell remembers the exact
constraints that led to using byte access for big-endian
mmio string operations, and whether the new lib/iomem_copy.c
version causes problems.

I also looked at the little-endian arm32 version, and
it seems that here the generic code would work fine, but
the custom variant is likely much faster when both the
source and destination buffers are aligned, as it can
do larger MMIO transactions using ldm/stm instructions,
though the generic version would be a bit better if the
in-memory buffer is unaligned.
We could get the best of both by implementing optimized
arm32 versions __iowrite32_copy()/__ioread32_copy and
using those in the generic memcpy_{from,to}io.


ok, then I will wait for now, to see if it works with the current version. If not, I will have a look at using the __iowrite32_copy/__ioread32_copy functions.

Arnd