Re: [linus:master] [fortify] 239d87327d: vm-scalability.throughput 17.3% improvement
From: Mateusz Guzik
Date: Thu Jan 09 2025 - 17:02:09 EST
On Thu, Jan 9, 2025 at 10:12 PM Kees Cook <kees@xxxxxxxxxx> wrote:
>
> On Thu, Jan 09, 2025 at 09:52:31PM +0100, Mateusz Guzik wrote:
> > On Thu, Jan 09, 2025 at 12:38:04PM -0800, Kees Cook wrote:
> > > On Thu, Jan 09, 2025 at 08:51:44AM -0800, Kees Cook wrote:
> > > > On Thu, Jan 09, 2025 at 02:57:58PM +0800, kernel test robot wrote:
> > > > > kernel test robot noticed a 17.3% improvement of vm-scalability.throughput on:
> > > > >
> > > > > commit: 239d87327dcd361b0098038995f8908f3296864f ("fortify: Hide run-time copy size from value range tracking")
> > > > > https://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git master
> > > >
> > > > Well that is unexpected. There should be no binary output difference
> > > > with that patch. I will investigate...
> > >
> > > It looks like hiding the size value from GCC has the side-effect of
> > > breaking memcpy inlining in many places. I would expect this to make
> > > things _slower_, though. O_o
>
> I think it's disabling value-range-based inlining, I'm trying to
> construct some tests...
>
> > This depends on what was emitted in place and what CPU is executing it.
> >
> > Notably if gcc elected to emit rep movs{q,b}, the CPU at hand does
> > not have FSRM and the size is low enough, then such code can indeed be
> > slower than suffering a call to memcpy (which does not issue rep mov).
> >
> > I had seen gcc go to great pains to align a buffer for rep movsq even
> > when it was guaranteed to not be necessary for example.
> >
> > Can you disasm an example affected spot?
>
> I tried to find the most self-contained example I could, and I ended up
> with:
>
> static void ipv6_rpl_addr_decompress(struct in6_addr *dst,
> const struct in6_addr *daddr,
> const void *post, unsigned char pfx)
> {
> memcpy(dst, daddr, pfx);
> memcpy(&dst->s6_addr[pfx], post, IPV6_PFXTAIL_LEN(pfx));
> }
>
Well I did what I should have from the get go and took the liberty of
looking at the profile.
%stddev %change %stddev
\ | \
[snip]
0.00 +6.5 6.54 ± 66%
perf-profile.calltrace.cycles-pp.memcpy_orig.copy_page_from_iter_atomic.generic_perform_write.shmem_file_write_iter.do_iter_readv_writev
Disassembling copy_page_from_iter_atomic *prior* to the change indeed
reveals rep movsq as I suspected (second to last instruction):
<+919>: mov (%rax),%rdx
<+922>: lea 0x8(%rsi),%rdi
<+926>: and $0xfffffffffffffff8,%rdi
<+930>: mov %rdx,(%rsi)
<+933>: mov %r8d,%edx
<+936>: mov -0x8(%rax,%rdx,1),%rcx
<+941>: mov %rcx,-0x8(%rsi,%rdx,1)
<+946>: sub %rdi,%rsi
<+949>: mov %rsi,%rdx
<+952>: sub %rsi,%rax
<+955>: lea (%r8,%rdx,1),%ecx
<+959>: mov %rax,%rsi
<+962>: shr $0x3,%ecx
<+965>: rep movsq %ds:(%rsi),%es:(%rdi)
<+968>: jmp 0xffffffff819157c5 <copy_page_from_iter_atomic+869>
With the reported patch this is a call to memcpy.
This is the guy:
static __always_inline
size_t memcpy_from_iter(void *iter_from, size_t progress,
size_t len, void *to, void *priv2)
{
memcpy(to + progress, iter_from, len);
return 0;
}
I don't know what the specific bench is doing, I'm assuming passed
values were low enough that the overhead of spinning up rep movsq took
over.
gcc should retain the ability to optimize this, except it needs to be
convinced to not emit rep movsq for variable sizes (and instead call
memcpy).
For user memory access there is a bunch of hackery to inline rep mov
for CPUs where it does not suck for small sizes (see
rep_movs_alternative). Someone(tm) should port it over to memcpy
handling as well.
The expected state would be that for sizes known at compilation time
it rolls with movs as needed (no rep), otherwise emits the magic rep
movs/memcpy invocation, except for when it would be tail-called.
In your ipv6_rpl_addr_decompress example gcc went a little crazy,
which I mentioned does happen. However, most of the time it is doing a
good job instead and a now generated call to memcpy should make things
slower. I presume these spots are merely not being benchmarked here.
Note that going from inline movs (no rep) to a call to memcpy which
does movs (again no rep) comes with a "mere" function call overhead,
which is a different beast than spinning up rep movs on CPUs without
FSRM.
That is to say, contrary to the report above, I believe the change is
in fact a regression which just so happened to make things faster for
a specific case. The unintended speed up can be achieved without
regressing anything else by taming the craziness.
Reading the commit log I don't know what the way out is, perhaps you
could rope in some gcc folk to ask? Screwing with optimization to not
see a warning is definitely not the best option.
--
Mateusz Guzik <mjguzik gmail.com>