Re: [PATCH 0/1] riscv: better network performance with memcpy, uaccess
From: Akira Tsukamoto
Date: Sat Jun 05 2021 - 04:05:28 EST
On Sat, Jun 5, 2021 at 1:19 AM Palmer Dabbelt <palmer@xxxxxxxxxxx> wrote:
>
> On Fri, 04 Jun 2021 02:53:33 PDT (-0700), akira.tsukamoto@xxxxxxxxx wrote:
> > I am adding a cover letter to explain the history and details since
> > improvement is a combination with Gary's memcpy patch [1].
> >
> > Comparison of iperf3 benchmark results by applying Gary's memcpy patch and
> > my uaccess optimization patch. All results are from the same base kernel,
> > same rootfs and save BeagleV beta board.
> >
> > First left column : beaglev 5.13.rc4 kernel [2]
> > Second column : Added Palmer's memcpy in C + my uaccess patch [3]
> > Third column : Added Gary's memcpy + my uaccess patch [4]
> >
> > --- TCP recv ---
> > 686 Mbits/sec | 700 Mbits/sec | 904 Mbits/sec
> > 683 Mbits/sec | 701 Mbits/sec | 898 Mbits/sec
> > 695 Mbits/sec | 702 Mbits/sec | 905 Mbits/sec
> >
> > --- TCP send ---
> > 383 Mbits/sec | 390 Mbits/sec | 393 Mbits/sec
> > 384 Mbits/sec | 393 Mbits/sec | 392 Mbits/sec
> >
> > --- UDP send ---
> > 307 Mbits/sec | 358 Mbits/sec | 402 Mbits/sec
> > 307 Mbits/sec | 359 Mbits/sec | 402 Mbits/sec
> >
> > --- UDP recv ---
> > 630 Mbits/sec | 799 Mbits/sec | 875 Mbits/sec
> > 730 Mbits/sec | 796 Mbits/sec | 873 Mbits/sec
> >
> >
> > The uaccess patch is reducing pipeline stall of read after write (RAW)
> > by unroling load and store.
> > The main reason for using assembler inside uaccess.S is because the
> > __asm_to/copy_from_user() handling page fault must be done manually inside
> > the functions.
> >
> > The above result is combination from Gary $B!G (Bs memcpy speeding up
> > by reducing
> > the S-mode and M-mode switching and my uaccess reducing pipeline stall for
> > user space uses syscall with large data.
> >
> > We had a discussion of improving network performance on the BeagleV beta
> > board with Palmer.
> >
> > Palmer suggested to use C-based string routines, which checks the unaligned
> > address and use 8 bytes aligned copy if the both src and dest are aligned
> > and if not use the current copy function.
> >
> > The Gary's assembly version of memcpy is improving by not using unaligned
> > access in 64 bit boundary, uses shifting it after reading with offset of
> > aligned access, because every misaligned access is trapped and switches to
> > opensbi in M-mode. The main speed up is coming from avoiding S-mode (kernel)
> > and M-mode (opensbi) switching.
> >
> > Processing network packets require a lot of unaligned access for the packet
> > header, which is not able to change the design of the header format to be
> > aligned.
> > And user applications pass large packet data with send/recf() and sendto/
> > recvfrom() to repeat less function calls for reading and writing data for the
> > optimization.
>
> Makes sense. I'm still not opposed to moving to a C version, but it'd
> need to be a fairly complicated one. I think having a fast C memcpy
> would likely benefit a handful of architectures, as everything we're
> talking about is an algorithmic improvement that can be expressed in C.
>
> Given that the simple memcpy doesn't perform well for your workload, I'm
> fine taking the assembly version.
Thanks, for merging them.
I agree that having a fast C memcpy would benefit many architectures.
I will make the patches for lib/string.c by extending your memcpy and send
them after I finish other priorities. The current functions in lib/string.c
use a byte copy, while most linux capable cpus moved to 64 bits.
Akira
>
> Thanks!
>
> >
> > Akira
> >
> > [1] https://lkml.org/lkml/2021/2/16/778
> > [2] https://github.com/mcd500/linux-jh7100/tree/starlight-sdimproved
> > [3] https://github.com/mcd500/linux-jh7100/tree/starlight-sd-palmer-string
> > [4] https://github.com/mcd500/linux-jh7100/tree/starlight-sd-gary
> >
> > Akira Tsukamoto (1):
> > riscv: prevent pipeline stall in __asm_to/copy_from_user
> >
> > arch/riscv/lib/uaccess.S | 106 +++++++++++++++++++++++++++------------
> > 1 file changed, 73 insertions(+), 33 deletions(-)