Re: [PATCH 1/2] uapi: Refactor __GENMASK() for speed-up
From: Yury Norov
Date: Thu Feb 13 2025 - 14:54:37 EST
+ Andrew, Matthias
On Wed, Feb 12, 2025 at 10:01:12PM +0800, I Hsin Cheng wrote:
> On Tue, Feb 11, 2025 at 01:44:18PM -0500, Yury Norov wrote:
> > On Tue, Feb 11, 2025 at 01:39:34PM -0500, Yury Norov wrote:
> > > On Wed, Feb 12, 2025 at 12:24:11AM +0800, I Hsin Cheng wrote:
> > > > The calculation of "((~_UL(0)) - (_UL(1) << (l)) + 1)" is to generate a
> > > > bitmask with "l" trailing zeroes, which is equivalent to
> > > > "(~_UL(0) << (l))".
> > >
> > > I used to think that GENMASK() is a compile-time macro. __GENMASK() is
> > > not, but it has very limited usage through the kernel, all in the uapi.
> > >
> > > > Refactor the calculation so the number of arithmetic instruction will be
> > > > reduced from 3 to 1.
> > >
> > > I'd like to look at it! Please share disassembly.
> > >
> > > > Signed-off-by: I Hsin Cheng <richard120310@xxxxxxxxx>
> > > > ---
> > > > Test is done to show the speed-up we can get from reducing the number of
> > > > instruction. The test machine runs with 6.9.0-0-generic kernel on x86_64
> > > > architecture with processor AMD Ryzen 7 5700X3D 8-Core Processor.
> > >
> > > So you CC arm maintainers and provide tests against x86_64. Are your
> > > improvements consistent for arm, power and other arches?
> > >
> > > Can you run bloat-o-meter against your change?
> >
> > Ah, sorry, overlooked you bloat-o-meter results in cover letter.
> > Anyways, can you provide it for each patch individually?
>
> Oh ok, let me paste them here first, I'll attach them along with v2 as
> well.
>
> In the section below, vmlinux_old uses old version of GENMASK() and
> GENMASK_ULL(), vmlinux_p1 use new version of GENMASK() and old version
> of GENMASK_ULL(), vmlinux_p2 use new version of GENMASK() and new
> version of GENMASK_ULL().
>
> $ ./scripts/bloat-o-meter vmlinux_old vmlinux_p1
> add/remove: 0/2 grow/shrink: 46/505 up/down: 464/-1717 (-1253)
So, I ran the build myself and I confirm that reverting c32ee3d9abd284b4
(which this patch actually does) helps to save over 1k of .text. In my
case it's 1269 bytes on x86_64 + defconfig for GENMASK() only.
I looked at some functions affected by this revert, and I found that
they call for_each_*_cpu() macros. This eventually boils down to bitmap
functions like this:
static __always_inline
unsigned long find_next_bit(const unsigned long *addr, unsigned long size,
unsigned long offset)
{
if (small_const_nbits(size)) {
unsigned long val;
if (unlikely(offset >= size))
return size;
val = *addr & GENMASK(size - 1, offset);
return val ? __ffs(val) : size;
}
return _find_next_bit(addr, size, offset);
}
The 'size' here is NR_CPUS, which on my machine is 64.
GENMASK is used with non-constant parameters, but it's OK because
from compiler's point of view, the GENMASK_INPUT_CHECK() which is
just:
BUILD_BUG_ON_ZERO(const_true((l) > (h)))
It is passed as the 'offset > size', and it's is indeed a constant
expression at that point because it's explicitly tested before.
Is this for_each_cpu() thing the only case - I don't know, but it's
enough to consider reverting back to the original version.
I Hsing,
At first, thank you for catching this up.
I think performance testing part is trivial here, so let's focus on
code generation and consistency.
Can you please run your build against few recent GCC and clang versions?
Can you also build the kernel for ARM64? You don't need to run it on
real hardware (but maybe on VM). I just need to make sure that the
result is consistent for all important arches.
Matthias didn't mention how did he build his kernel. Did clang emit
that warning against W=0, 1 or higher, and which code triggered the
warning? Maybe clang already fixed it?
Can you please check how would it work if NR_CPUS is set to be say 1024?
This way the small_const_nbits optimization will be disabled.
If you will find that clang still emits warnings at lower than W=2,
can you please resend this patches together with a patch that
suppresses clang warning?
Can you also please attach one or two examples of code generation for
real functions, not an artificial one as you did before. And maybe a
link to goldbolt?
For the next iteration can you please make sure that you refer your
series as a revert of Matthias's patch?
Thank you for discovering this. I realize that I'm asking you to do
some extra work, but we all need to be 100% sure that it's not a fluke
before reverting the c32ee3d9abd284b4 because it potentially leads to
suppressing another clang warning.
Thanks,
Yury