Re: [PATCH v2] bitmap: Switch from inline to __always_inline

From: Brian Norris
Date: Fri Jul 12 2024 - 16:42:34 EST


Hi Yury,

On Fri, Jul 12, 2024 at 10:02:55AM -0700, Yury Norov wrote:
> On Thu, Jul 11, 2024 at 09:37:11AM -0700, Brian Norris wrote:
> > Based on recent discussions, it seems like a good idea to inline the
> > bitmap functions which make up cpumask*() implementations, as well as a
> > few other bitmap users, to ensure the inlining doesn't break in the
> > future and produce the same problems, as well as to give the best chance
> > that the intended small_const_nbits() optimizations carry through.
>
> small_const_nbits() optimization always carries through. As far as I
> understand this things, the problem is that inliner may make a (wrong)
> no-go decision before unwinding the small_const_nbits() part.
>
> small_const_nbits() always leads to eliminating either inline or outline
> code, but inliner seemingly doesn't understand it. This leads to having
> those tiny functions uninlined.

I think we're more or less saying the same thing -- if for whatever
reason the compiler doesn't inline, then these will never be "small
consts", and therefore the machinery effectively "doesn't carry
through".

Per previous discussions, there are many reasons a compiler may ignore
the 'inline' hint.

> But I'm not sure about that and don't know how to check what happens
> under the compilers' hood. Can compiler gurus please clarify?
>
> > This change has been previously proposed in the past as:
> >
> > Subject: [PATCH 1/3] bitmap: switch from inline to __always_inline
> > https://lore.kernel.org/all/20221027043810.350460-2-yury.norov@xxxxxxxxx/
> >
> > In the end, this looks like a rebase of Yury's work, although
> > technically it's a rewrite.
>
> I like it more, when it's split onto several patches. We already have
> my bitmap.h rework and your cpumask.h one with their own reasonable
> motivations. Can you keep that patch structure please? Then, we can add

I'm not quite sure what you mean. I imitated the patch structure of your
patch 1 that I linked, which had the following file-diff list:

include/linux/bitmap.h | 46 ++++++-------
include/linux/cpumask.h | 144 +++++++++++++++++++--------------------
include/linux/find.h | 40 +++++------
include/linux/nodemask.h | 86 +++++++++++------------

Are you suggesting I should split that into 2 (one for cpumask and one
for the rest)?

> separate patches for find.h and nodemask.h. If rebasing the old bitmap.h
> patch isn't clean and takes some effort, feel free to add your
> co-developed-by.

I don't really care who gets authorship, but I did manually rewrite it,
since there were enough conflicts, and it's basically scriptable. Feel
free to suggest which Author/S-o-b/Co-developed combo makes sense for
that, and I can adopt it in v3 :)

> > According to bloat-o-meter, my vmlinux decreased in size by a total of
> > 1538 bytes (-0.01%) with this change.
> >
> > [0] CONFIG_HOTPLUG_PARALLEL=y ('select'ed for x86 as of [1]) and
> > CONFIG_GCOV_PROFILE_ALL.
> >
> > [1] commit 0c7ffa32dbd6 ("x86/smpboot/64: Implement
> > arch_cpuhp_init_parallel_bringup() and enable it")
> >
> > Cc: Yury Norov <yury.norov@xxxxxxxxx>
> > Signed-off-by: Brian Norris <briannorris@xxxxxxxxxxxx>
>
> So... This whole optimization is only worth the reviewer's time if we
> can prove it's sustainable across different compilers and configurations.
>
> Just saying that under some particular config it works, brings little
> value for those who are not interested in that config. Moreover, if
> one enables GCOV, he likely doesn't care much about the .text size.
> And for those using release configs, it only brings uncertainty.

I think I partially disagree. If it's neutral for one compiler but
helpful for the other, that's still a win. But otherwise, I agree we
should have some accounting of diversity -- not just a single test.

> Let's test it broader? We've got 2 main compilers - gcc and llvm, and
> at least two configurations that may be relevant: defconfig and your
> HOTPLUG_PARALLEL + GCOV_PROFILE_ALL. And it would be nice to prove
> that the optimization is sustainable across compiler versions.
>
> I have the following table in mind:
>
> defconfig your conf
> GCC 9 | -1900 | |
> CLANG 13 | | |
> GCC 13 | | |
> CLANG 18 | -100 | -1538 |
>
> The defconfig nubmers above are from my past experiments. And you can
> add more configs/compilers, of course.

Perhaps I wasn't too clear, but I was actually testing a few things:

* my ChromeOS-y build, with gcov and clang -- to ensure the section
mismatch warning/error disappears
* a pure mainline / semi-defaults build (which happened to use my host
distro's gcc 13), to do size comparisons. I also happened to enable
gcov in the currently-posted experiments.

But agreed, I can get a larger matrix with a bit more specifity in my
commit log. TBD on the ease of getting a Clang 13 or GCC 9 toolchain,
but I think I can convince my distro to provide them.

> > --- a/include/linux/bitmap.h
> > +++ b/include/linux/bitmap.h
> > @@ -203,7 +203,7 @@ unsigned long bitmap_find_next_zero_area_off(unsigned long *map,
> > * the bit offset of all zero areas this function finds is multiples of that
> > * power of 2. A @align_mask of 0 means no alignment is required.
> > */
> > -static inline unsigned long
> > +static __always_inline unsigned long
> > bitmap_find_next_zero_area(unsigned long *map,
> > unsigned long size,
> > unsigned long start,
>
> Let's split them such that return type would be at the same line as
> the function name. It would help to distinguish function declaration
> from invocations in grep output:

Ack. I thought I was being consistent with existing cpumask.h style, but
I got this set wrong. Will fix.

Brian