Re: [PATCH] arm: Remove IO memcpy for Big-Endian

From: Arnd Bergmann
Date: Tue Dec 03 2024 - 05:58:00 EST


On Tue, Dec 3, 2024, at 09:38, Julian Vetter wrote:
> From: Julian Vetter <julian@xxxxxxxxxxxxxxxx>
>
> Recently a new IO memcpy was added in libs/iomem_copy.c. So, remove the
> byte-wise IO memcpy operations used in ARM big endian builds and fall
> back to the new generic implementation. It will be slightly faster,
> because it uses machine word accesses if the memory is aligned and falls
> back to byte-wise accesses if its not.
>
> Signed-off-by: Julian Vetter <julian@xxxxxxxxxxxxxxxx>
> ---
> arch/arm/include/asm/io.h | 11 ----------
> arch/arm/kernel/io.c | 46 ---------------------------------------
> 2 files changed, 57 deletions(-)

I'm not sure if this is safe on all platforms. Big-endian arm
is extremely rare in practice, and in comes in multiple variants
that behave slightly differently:

- On modern ARMv7 the byte-invariant big-endian "BE8" mode
generally well-behaved and works as one would expect it to.

- There is one ARMv5 "BE32" based platform, the ixp4xx, which
works differently, and this in turn allows multiple configurations
for its buses where a byte-swap is performed in the PCI
controller.

When the little-endian I/O string operations got optimized to
calling the word-based helpers in commit 7ddfe625cbc1 ("ARM:
optimize memset_io()/memcpy_fromio()/memcpy_toio()"), Russell
intentionally left the big-endian versions alone, which I think
was done for the case of PCI on ixp4xx, but could have been
out of general caution.

Before we apply your patch, I think the minimum would be to
have Linus Walleij try it out on an an ixp4xx with a driver
that uses these functions. Maybe Russell remembers the exact
constraints that led to using byte access for big-endian
mmio string operations, and whether the new lib/iomem_copy.c
version causes problems.

I also looked at the little-endian arm32 version, and
it seems that here the generic code would work fine, but
the custom variant is likely much faster when both the
source and destination buffers are aligned, as it can
do larger MMIO transactions using ldm/stm instructions,
though the generic version would be a bit better if the
in-memory buffer is unaligned.
We could get the best of both by implementing optimized
arm32 versions __iowrite32_copy()/__ioread32_copy and
using those in the generic memcpy_{from,to}io.

Arnd