Re: [PATCH v2] bitmap: Switch from inline to __always_inline

From: Yury Norov
Date: Fri Jul 12 2024 - 13:10:09 EST


Hi Brian,

Thanks for taking over this!

+ Kees Cook for GCC

On Thu, Jul 11, 2024 at 09:37:11AM -0700, Brian Norris wrote:
> On recent (v6.6+) builds with Clang (based on Clang 18.0.0) and certain
> configurations [0], I'm finding that (lack of) inlining decisions may
> lead to section mismatch warnings like the following:
>
> WARNING: modpost: vmlinux.o: section mismatch in reference:
> cpumask_andnot (section: .text) ->
> cpuhp_bringup_cpus_parallel.tmp_mask (section: .init.data) ERROR:
> modpost: Section mismatches detected.
>
> or more confusingly:
>
> WARNING: modpost: vmlinux: section mismatch in reference:
> cpumask_andnot+0x5f (section: .text) -> efi_systab_phys (section:
> .init.data)
>
> The first warning makes a little sense, because
> cpuhp_bringup_cpus_parallel() (an __init function) calls
> cpumask_andnot() on tmp_mask (an __initdata symbol). If the compiler
> doesn't inline cpumask_andnot(), this may appear like a mismatch.
>
> The second warning makes less sense, but might be because efi_systab_phys
> and cpuhp_bringup_cpus_parallel.tmp_mask are laid out near each other,
> and the latter isn't a proper C symbol definition.
>
> In any case, it seems a reasonable solution to suggest more strongly to
> the compiler that these cpumask macros *must* be inlined, as 'inline' is
> just a recommendation.
>
> Based on recent discussions, it seems like a good idea to inline the
> bitmap functions which make up cpumask*() implementations, as well as a
> few other bitmap users, to ensure the inlining doesn't break in the
> future and produce the same problems, as well as to give the best chance
> that the intended small_const_nbits() optimizations carry through.

small_const_nbits() optimization always carries through. As far as I
understand this things, the problem is that inliner may make a (wrong)
no-go decision before unwinding the small_const_nbits() part.

small_const_nbits() always leads to eliminating either inline or outline
code, but inliner seemingly doesn't understand it. This leads to having
those tiny functions uninlined.

But I'm not sure about that and don't know how to check what happens
under the compilers' hood. Can compiler gurus please clarify?

> This change has been previously proposed in the past as:
>
> Subject: [PATCH 1/3] bitmap: switch from inline to __always_inline
> https://lore.kernel.org/all/20221027043810.350460-2-yury.norov@xxxxxxxxx/
>
> In the end, this looks like a rebase of Yury's work, although
> technically it's a rewrite.

I like it more, when it's split onto several patches. We already have
my bitmap.h rework and your cpumask.h one with their own reasonable
motivations. Can you keep that patch structure please? Then, we can add
separate patches for find.h and nodemask.h. If rebasing the old bitmap.h
patch isn't clean and takes some effort, feel free to add your
co-developed-by.

> According to bloat-o-meter, my vmlinux decreased in size by a total of
> 1538 bytes (-0.01%) with this change.
>
> [0] CONFIG_HOTPLUG_PARALLEL=y ('select'ed for x86 as of [1]) and
> CONFIG_GCOV_PROFILE_ALL.
>
> [1] commit 0c7ffa32dbd6 ("x86/smpboot/64: Implement
> arch_cpuhp_init_parallel_bringup() and enable it")
>
> Cc: Yury Norov <yury.norov@xxxxxxxxx>
> Signed-off-by: Brian Norris <briannorris@xxxxxxxxxxxx>

So... This whole optimization is only worth the reviewer's time if we
can prove it's sustainable across different compilers and configurations.

Just saying that under some particular config it works, brings little
value for those who are not interested in that config. Moreover, if
one enables GCOV, he likely doesn't care much about the .text size.
And for those using release configs, it only brings uncertainty.

Let's test it broader? We've got 2 main compilers - gcc and llvm, and
at least two configurations that may be relevant: defconfig and your
HOTPLUG_PARALLEL + GCOV_PROFILE_ALL. And it would be nice to prove
that the optimization is sustainable across compiler versions.

I have the following table in mind:

defconfig your conf
GCC 9 | -1900 | |
CLANG 13 | | |
GCC 13 | | |
CLANG 18 | -100 | -1538 |

The defconfig nubmers above are from my past experiments. And you can
add more configs/compilers, of course.

> ---
>
> Changes in v2:
> - rebase, update cpumask.h based on recent additions and removals
> - also convert bitmap.h, find.h, nodemask.h
>
> include/linux/bitmap.h | 122 ++++++++++++----------
> include/linux/cpumask.h | 216 +++++++++++++++++++++------------------
> include/linux/find.h | 50 ++++-----
> include/linux/nodemask.h | 86 ++++++++--------
> 4 files changed, 249 insertions(+), 225 deletions(-)
>
> diff --git a/include/linux/bitmap.h b/include/linux/bitmap.h
> index 8c4768c44a01..7f44b4428d77 100644
> --- a/include/linux/bitmap.h
> +++ b/include/linux/bitmap.h
> @@ -203,7 +203,7 @@ unsigned long bitmap_find_next_zero_area_off(unsigned long *map,
> * the bit offset of all zero areas this function finds is multiples of that
> * power of 2. A @align_mask of 0 means no alignment is required.
> */
> -static inline unsigned long
> +static __always_inline unsigned long
> bitmap_find_next_zero_area(unsigned long *map,
> unsigned long size,
> unsigned long start,

Let's split them such that return type would be at the same line as
the function name. It would help to distinguish function declaration
from invocations in grep output:

static __always_inline
unsigned long bitmap_find_next_zero_area(unsigned long *map,
unsigned long size,
unsigned long start,

Thanks,
Yury