Re: [PATCH v2 03/13] sh: remove unaligned access for sh4a

From: Arnd Bergmann
Date: Sat May 15 2021 - 16:11:48 EST


On Sat, May 15, 2021 at 5:36 PM John Paul Adrian Glaubitz
<glaubitz@xxxxxxxxxxxxxxxxxxx> wrote:
> On 5/14/21 2:22 PM, Arnd Bergmann wrote:
> >> My Renesas SH4-Boards actually run an sh4a-Kernel, not an sh4-Kernel:
> >>
> >> root@tirpitz:~> uname -a
> >> Linux tirpitz 5.11.0-rc4-00012-g10c03c5bf422 #161 PREEMPT Mon Jan 18 21:10:17 CET 2021 sh4a GNU/Linux
> >> root@tirpitz:~>
> >>
> >> So, if this change reduces performance on sh4a, I would rather not merge it.
> >
> > It only makes a difference in very specific scenarios in which unaligned
> > accesses are done in a fast path, e.g. when forwarding network packet
> > at a high rate on a big-endian kernel (little-endian kernels wouldn't run into
> > this on IP headers). If you have a use case for this machine on which the
> > you can show a performance regression, I can add a patch on top to put
> > the optimized sh4a get_unaligned_le32() back. Dropping this patch
> > altogether would make the series much more complex because most of
> > the associated code gets removed in the end.
>
> Hmm, okay. But why does code which sits below arch/sh have to be removed anyway?
>
> I don't fully understand why it poses any maintenance burden/

What I'm removing is the part that lets architectures override the
generic version.

> > As I mentioned, supporting "movua" in the compiler likely has a much
> > larger impact on performance, as it would also help in user space, and
> > it should improve the networking case on little-endian kernels by replacing
> > the four separate byte loads/shift pairs with a movua plus a byteswap.
>
> The problem is that - at least in Debian - we use the sh4 baseline while the kernel
> supports both sh4 and sh4a, so we can't use any of these instructions in userland at
> the moment.

I tried building an sh7785lcr_defconfig with and without the patch,
and found that
the only affected files are:

- in-kernel nfs client
- crc32c/sha1/sha256 hash functions
- device probing for libata, scsi-core, scsi-disk, hid, r8168
(should not matter after boot)
- msdos partition parsing

Any nfs client performance difference is probably not even measurable even
at gigabit ethernet speed.
I see that the hash functions are notably different, but I don't know if the
output from the new generic code is actually better or worse than the
original. If you do think this is important, please try the version from

https://git.kernel.org/pub/scm/linux/kernel/git/arnd/asm-generic.git
unaligned-sh4a

against the version without the last change in that series. If you can find
a relevant test case that exercises it, you may want to add a custom
implementation of the hash functions as well.

Arnd