Re: [PATCH 0/3] lib/string: optimized mem* functions

From: Matteo Croce
Date: Wed Jun 30 2021 - 20:30:15 EST


On Sun, 27 Jun 2021 02:19:59 +0200
Matteo Croce <mcroce@xxxxxxxxxxxxxxxxxxx> wrote:

> On Fri, Jun 25, 2021 at 7:45 PM Nick Desaulniers
> <ndesaulniers@xxxxxxxxxx> wrote:
> >
> > On Thu, Jun 24, 2021 at 6:02 PM Matteo Croce
> > <mcroce@xxxxxxxxxxxxxxxxxxx> wrote:
> > >
> > > From: Matteo Croce <mcroce@xxxxxxxxxxxxx>
> > >
> > > Rewrite the generic mem{cpy,move,set} so that memory is accessed
> > > with the widest size possible, but without doing unaligned
> > > accesses.
> > >
> > > This was originally posted as C string functions for RISC-V[1],
> > > but as there was no specific RISC-V code, it was proposed for the
> > > generic lib/string.c implementation.
> > >
> > > Tested on RISC-V and on x86_64 by undefining
> > > __HAVE_ARCH_MEM{CPY,SET,MOVE} and HAVE_EFFICIENT_UNALIGNED_ACCESS.
> > >
> > > Further testing on big endian machines will be appreciated, as I
> > > don't have such hardware at the moment.
> >
> > Hi Matteo,
> > Neat patches. Do you have you any benchmark data showing the
> > claimed improvements? Is it worthwhile to define these only when
> > CC_OPTIMIZE_FOR_PERFORMANCE/CC_OPTIMIZE_FOR_PERFORMANCE_O3 are
> > defined, not CC_OPTIMIZE_FOR_SIZE? I'd be curious to know the delta
> > in ST_SIZE of these functions otherwise.
> >
>
> I compared the current versions with the new one with bloat-o-meter,
> the kernel grows by ~400 bytes on x86_64 and RISC-V
>
> x86_64
>
> $ scripts/bloat-o-meter vmlinux.orig vmlinux
> add/remove: 0/0 grow/shrink: 4/1 up/down: 427/-6 (421)
> Function old new delta
> memcpy 29 351 +322
> memset 29 117 +88
> strlcat 68 78 +10
> strlcpy 50 57 +7
> memmove 56 50 -6
> Total: Before=8556964, After=8557385, chg +0.00%
>
> RISC-V
>
> $ scripts/bloat-o-meter vmlinux.orig vmlinux
> add/remove: 0/0 grow/shrink: 4/2 up/down: 432/-36 (396)
> Function old new delta
> memcpy 36 324 +288
> memset 32 148 +116
> strlcpy 116 132 +16
> strscpy_pad 84 96 +12
> strlcat 176 164 -12
> memmove 76 52 -24
> Total: Before=1225371, After=1225767, chg +0.03%
>
> I will post benchmarks made on a RISC-V machine which can't handle
> unaligned accesses, and it will be the first user of the new
> functions.
>
> > For big endian, you ought to be able to boot test in QEMU. I think
> > you'd find out pretty quickly if any of the above had issues.
> > (Enabling KASAN is probably also a good idea for a test, too). Check
> > out
> > https://github.com/ClangBuiltLinux/boot-utils
> > for ready made images and scripts for launching various
> > architectures and endiannesses.
> >
>
> Will do!
>

I finally made the benchmarks on RISC-V.

The current byte-at-time memcpy() always copy 74 Mbyte/sec, no matter
the alignment of the buffers, while my implementation do 114 Mb/s on
two aligned buffers and 107 Mb/s when need to shift and merge words.

For memset(), the current byte-at-time implementation always writes 140
Mb/s, my word-wise implementation always copies 241 Mb/s. Both
implementations have the same performance with aligned and unaligned
buffers.

The memcpy() test is a simple loop which calls memcpy with different
offsets on a 32MB buffer:

#define NUM_PAGES (1 << (MAX_ORDER + 2))
#define PG_SIZE (PAGE_SIZE * NUM_PAGES)


page1 = alloc_contig_pages(NUM_PAGES, GFP_KERNEL, NUMA_NO_NODE, 0);
page2 = alloc_contig_pages(NUM_PAGES, GFP_KERNEL, NUMA_NO_NODE, 0);

src = page_to_virt(page1);
dst = page_to_virt(page2);

for (i = 0; i < sizeof(void*); i++) {
for (j = 0; j < sizeof(void*); j++) {
t0 = ktime_get();
memcpy(dst + i, src + j, PG_SIZE - max(i, j));
t1 = ktime_get();
printk("Strings selftest: memcpy(dst+%d, src+%d), distance %lu: %llu Mb/s\n",
i, j, (j - i) % sizeof(long),
PG_SIZE * (1000000000l / 1048576l) / (t1-t0));
}
printk("\n");
}

Similarly, the memset() one:

page = alloc_contig_pages(NUM_PAGES, GFP_KERNEL, NUMA_NO_NODE, 0);
dst = page_to_virt(page);

for (i = 0; i < sizeof(void*); i++) {
t0 = ktime_get();
memset(dst + i, 0, PG_SIZE - i);
t1 = ktime_get();
printk("Strings selftest: memset(dst+%d): %llu Mb/s\n",
i, PG_SIZE * (1000000000l / 1048576l) / (t1-t0));
}

The results for the current one are:

[ 27.893931] Strings selftest: memcpy testing with size: 32 Mb
[ 28.315272] Strings selftest: memcpy(dst+0, src+0), distance 0: 75 Mb/s
[ 28.736485] Strings selftest: memcpy(dst+0, src+1), distance 1: 75 Mb/s
[ 29.156826] Strings selftest: memcpy(dst+0, src+2), distance 2: 76 Mb/s
[ 29.576199] Strings selftest: memcpy(dst+0, src+3), distance 3: 76 Mb/s
[ 29.994360] Strings selftest: memcpy(dst+0, src+4), distance 4: 76 Mb/s
[ 30.411766] Strings selftest: memcpy(dst+0, src+5), distance 5: 76 Mb/s
[ 30.828124] Strings selftest: memcpy(dst+0, src+6), distance 6: 76 Mb/s
[ 31.243514] Strings selftest: memcpy(dst+0, src+7), distance 7: 76 Mb/s

[...]

[ 52.077251] Strings selftest: memcpy(dst+7, src+0), distance 1: 74 Mb/s
[ 52.508115] Strings selftest: memcpy(dst+7, src+1), distance 2: 74 Mb/s
[ 52.939309] Strings selftest: memcpy(dst+7, src+2), distance 3: 74 Mb/s
[ 53.370493] Strings selftest: memcpy(dst+7, src+3), distance 4: 74 Mb/s
[ 53.801865] Strings selftest: memcpy(dst+7, src+4), distance 5: 74 Mb/s
[ 54.233287] Strings selftest: memcpy(dst+7, src+5), distance 6: 74 Mb/s
[ 54.664990] Strings selftest: memcpy(dst+7, src+6), distance 7: 74 Mb/s
[ 55.086996] Strings selftest: memcpy(dst+7, src+7), distance 0: 75 Mb/s

[ 55.109680] Strings selftest: memset testing with size: 32 Mb
[ 55.337873] Strings selftest: memset(dst+0): 140 Mb/s
[ 55.565905] Strings selftest: memset(dst+1): 140 Mb/s
[ 55.793987] Strings selftest: memset(dst+2): 140 Mb/s
[ 56.022140] Strings selftest: memset(dst+3): 140 Mb/s
[ 56.250259] Strings selftest: memset(dst+4): 140 Mb/s
[ 56.478283] Strings selftest: memset(dst+5): 140 Mb/s
[ 56.706296] Strings selftest: memset(dst+6): 140 Mb/s
[ 56.934335] Strings selftest: memset(dst+7): 140 Mb/s

While for the proposed one:

[ 38.843970] Strings selftest: memcpy testing with size: 32 Mb
[ 39.124047] Strings selftest: memcpy(dst+0, src+0), distance 0: 114 Mb/s
[ 39.421848] Strings selftest: memcpy(dst+0, src+1), distance 1: 107 Mb/s
[ 39.719613] Strings selftest: memcpy(dst+0, src+2), distance 2: 107 Mb/s
[ 40.017310] Strings selftest: memcpy(dst+0, src+3), distance 3: 107 Mb/s
[ 40.314939] Strings selftest: memcpy(dst+0, src+4), distance 4: 107 Mb/s
[ 40.612485] Strings selftest: memcpy(dst+0, src+5), distance 5: 107 Mb/s
[ 40.910054] Strings selftest: memcpy(dst+0, src+6), distance 6: 107 Mb/s
[ 41.207577] Strings selftest: memcpy(dst+0, src+7), distance 7: 107 Mb/s

[...]

[ 55.682517] Strings selftest: memcpy(dst+7, src+0), distance 1: 107 Mb/s
[ 55.980262] Strings selftest: memcpy(dst+7, src+1), distance 2: 107 Mb/s
[ 56.277862] Strings selftest: memcpy(dst+7, src+2), distance 3: 107 Mb/s
[ 56.575514] Strings selftest: memcpy(dst+7, src+3), distance 4: 107 Mb/s
[ 56.873142] Strings selftest: memcpy(dst+7, src+4), distance 5: 107 Mb/s
[ 57.170840] Strings selftest: memcpy(dst+7, src+5), distance 6: 107 Mb/s
[ 57.468553] Strings selftest: memcpy(dst+7, src+6), distance 7: 107 Mb/s
[ 57.748231] Strings selftest: memcpy(dst+7, src+7), distance 0: 114 Mb/s

[ 57.772721] Strings selftest: memset testing with size: 32 Mb
[ 57.905358] Strings selftest: memset(dst+0): 241 Mb/s
[ 58.037974] Strings selftest: memset(dst+1): 241 Mb/s
[ 58.170619] Strings selftest: memset(dst+2): 241 Mb/s
[ 58.303228] Strings selftest: memset(dst+3): 241 Mb/s
[ 58.435808] Strings selftest: memset(dst+4): 241 Mb/s
[ 58.568373] Strings selftest: memset(dst+5): 241 Mb/s
[ 58.700968] Strings selftest: memset(dst+6): 241 Mb/s
[ 58.833499] Strings selftest: memset(dst+7): 241 Mb/s

Anyway, I have to submit a v2 because the current one fails to build on
compilers which can't understand that byte_long is constant:

lib/string.c:52:39: error: initializer element is not constant
static const unsigned int word_mask = bytes_long - 1;
^~~~~~~~~~

Regards,
--
per aspera ad upstream