Re: [linus:master] [fortify] 239d87327d: vm-scalability.throughput 17.3% improvement

From: Kees Cook
Date: Thu Jan 09 2025 - 16:12:52 EST


On Thu, Jan 09, 2025 at 09:52:31PM +0100, Mateusz Guzik wrote:
> On Thu, Jan 09, 2025 at 12:38:04PM -0800, Kees Cook wrote:
> > On Thu, Jan 09, 2025 at 08:51:44AM -0800, Kees Cook wrote:
> > > On Thu, Jan 09, 2025 at 02:57:58PM +0800, kernel test robot wrote:
> > > > kernel test robot noticed a 17.3% improvement of vm-scalability.throughput on:
> > > >
> > > > commit: 239d87327dcd361b0098038995f8908f3296864f ("fortify: Hide run-time copy size from value range tracking")
> > > > https://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git master
> > >
> > > Well that is unexpected. There should be no binary output difference
> > > with that patch. I will investigate...
> >
> > It looks like hiding the size value from GCC has the side-effect of
> > breaking memcpy inlining in many places. I would expect this to make
> > things _slower_, though. O_o

I think it's disabling value-range-based inlining, I'm trying to
construct some tests...

> This depends on what was emitted in place and what CPU is executing it.
>
> Notably if gcc elected to emit rep movs{q,b}, the CPU at hand does
> not have FSRM and the size is low enough, then such code can indeed be
> slower than suffering a call to memcpy (which does not issue rep mov).
>
> I had seen gcc go to great pains to align a buffer for rep movsq even
> when it was guaranteed to not be necessary for example.
>
> Can you disasm an example affected spot?

I tried to find the most self-contained example I could, and I ended up
with:

static void ipv6_rpl_addr_decompress(struct in6_addr *dst,
const struct in6_addr *daddr,
const void *post, unsigned char pfx)
{
memcpy(dst, daddr, pfx);
memcpy(&dst->s6_addr[pfx], post, IPV6_PFXTAIL_LEN(pfx));
}

Before 239d87327dcd ("fortify: Hide run-time copy size from value range
tracking"), the assembler is:

ffffffff8209f0e0 <ipv6_rpl_addr_decompress>:
ffffffff8209f0e0: e8 5b 62 fe fe call ffffffff81085340 <__fentry__>
ffffffff8209f0e5: 0f b6 c1 movzbl %cl,%eax
ffffffff8209f0e8: 49 89 d0 mov %rdx,%r8
ffffffff8209f0eb: 83 f8 08 cmp $0x8,%eax
ffffffff8209f0ee: 73 24 jae ffffffff8209f114 <ipv6_rpl_addr_decompress+0x34>
ffffffff8209f0f0: a8 04 test $0x4,%al
ffffffff8209f0f2: 75 64 jne ffffffff8209f158 <ipv6_rpl_addr_decompress+0x78>
ffffffff8209f0f4: 85 c0 test %eax,%eax
ffffffff8209f0f6: 74 09 je ffffffff8209f101 <ipv6_rpl_addr_decompress+0x21>
ffffffff8209f0f8: 0f b6 16 movzbl (%rsi),%edx
ffffffff8209f0fb: 88 17 mov %dl,(%rdi)
ffffffff8209f0fd: a8 02 test $0x2,%al
ffffffff8209f0ff: 75 65 jne ffffffff8209f166 <ipv6_rpl_addr_decompress+0x86>
ffffffff8209f101: ba 10 00 00 00 mov $0x10,%edx
ffffffff8209f106: 48 01 c7 add %rax,%rdi
ffffffff8209f109: 4c 89 c6 mov %r8,%rsi
ffffffff8209f10c: 48 29 c2 sub %rax,%rdx
ffffffff8209f10f: e9 bc 33 21 00 jmp ffffffff822b24d0 <__memcpy>
ffffffff8209f114: 48 8b 16 mov (%rsi),%rdx
ffffffff8209f117: 4c 8d 4f 08 lea 0x8(%rdi),%r9
ffffffff8209f11b: 49 83 e1 f8 and $0xfffffffffffffff8,%r9
ffffffff8209f11f: 48 89 17 mov %rdx,(%rdi)
ffffffff8209f122: 48 8b 54 06 f8 mov -0x8(%rsi,%rax,1),%rdx
ffffffff8209f127: 48 89 54 07 f8 mov %rdx,-0x8(%rdi,%rax,1)
ffffffff8209f12c: 48 89 fa mov %rdi,%rdx
ffffffff8209f12f: 4c 29 ca sub %r9,%rdx
ffffffff8209f132: 48 29 d6 sub %rdx,%rsi
ffffffff8209f135: 01 c2 add %eax,%edx
ffffffff8209f137: 83 e2 f8 and $0xfffffff8,%edx
ffffffff8209f13a: 83 fa 08 cmp $0x8,%edx
ffffffff8209f13d: 72 c2 jb ffffffff8209f101 <ipv6_rpl_addr_decompress+0x21>
ffffffff8209f13f: 83 e2 f8 and $0xfffffff8,%edx
ffffffff8209f142: 31 c9 xor %ecx,%ecx
ffffffff8209f144: 41 89 ca mov %ecx,%r10d
ffffffff8209f147: 83 c1 08 add $0x8,%ecx
ffffffff8209f14a: 4e 8b 1c 16 mov (%rsi,%r10,1),%r11
ffffffff8209f14e: 4f 89 1c 11 mov %r11,(%r9,%r10,1)
ffffffff8209f152: 39 d1 cmp %edx,%ecx
ffffffff8209f154: 72 ee jb ffffffff8209f144 <ipv6_rpl_addr_decompress+0x64>
ffffffff8209f156: eb a9 jmp ffffffff8209f101 <ipv6_rpl_addr_decompress+0x21>
ffffffff8209f158: 8b 16 mov (%rsi),%edx
ffffffff8209f15a: 89 17 mov %edx,(%rdi)
ffffffff8209f15c: 8b 54 06 fc mov -0x4(%rsi,%rax,1),%edx
ffffffff8209f160: 89 54 07 fc mov %edx,-0x4(%rdi,%rax,1)
ffffffff8209f164: eb 9b jmp ffffffff8209f101 <ipv6_rpl_addr_decompress+0x21>
ffffffff8209f166: 0f b7 54 06 fe movzwl -0x2(%rsi,%rax,1),%edx
ffffffff8209f16b: 66 89 54 07 fe mov %dx,-0x2(%rdi,%rax,1)
ffffffff8209f170: eb 8f jmp ffffffff8209f101 <ipv6_rpl_addr_decompress+0x21>

With the size hidden, it becomes:

ffffffff82096260 <ipv6_rpl_addr_decompress>:
ffffffff82096260: e8 db e0 fe fe call ffffffff81084340 <__fentry__>
ffffffff82096265: 55 push %rbp
ffffffff82096266: 0f b6 e9 movzbl %cl,%ebp
ffffffff82096269: 53 push %rbx
ffffffff8209626a: 48 89 d3 mov %rdx,%rbx
ffffffff8209626d: 48 89 ea mov %rbp,%rdx
ffffffff82096270: e8 9b 0a 21 00 call ffffffff822a6d10 <__memcpy>
ffffffff82096275: ba 10 00 00 00 mov $0x10,%edx
ffffffff8209627a: 48 89 de mov %rbx,%rsi
ffffffff8209627d: 5b pop %rbx
ffffffff8209627e: 48 89 c7 mov %rax,%rdi
ffffffff82096281: 48 29 ea sub %rbp,%rdx
ffffffff82096284: 48 01 ef add %rbp,%rdi
ffffffff82096287: 5d pop %rbp
ffffffff82096288: e9 83 0a 21 00 jmp ffffffff822a6d10 <__memcpy>
ffffffff8209628d: 0f 1f 00 nopl (%rax)

In the former, it looks like it is calculating how many 8, 4, and single
byte copy loops to perform for the first memcpy, since it knows the value
range must be [0..255]. In both cases the second memcpy is tail-called.

> Gcc has a bunch of magic switches to tell it what to emit in line, the
> thing to do is to convince it to roll with a bunch of mov (not rep mov)
> for sizes small enough(tm). What constitutes small enough depends on the
> uarch.

I found -mmemcpy-strategy:
https://gcc.gnu.org/onlinedocs/gcc/x86-Options.html#index-mmemcpy-strategy_003dstrategy

But I don't see where to find the algs, and it seems like the above asm
is being produced when GCC thinks a value is in a certain range (rather
than compile-time known), which would make sense as far as what the
commit did: removed visibility into value ranges.

-Kees

--
Kees Cook