Re: [PATCH] lib/clz_ctz.c: Fix __clzdi2() and __ctzdi2() for 32-bit kernels

From: Nick Desaulniers
Date: Mon Aug 28 2023 - 16:09:21 EST


On Fri, Aug 25, 2023 at 6:08 PM Linus Torvalds
<torvalds@xxxxxxxxxxxxxxxxxxxx> wrote:
>
> On Fri, 25 Aug 2023 at 17:52, Nick Desaulniers <ndesaulniers@xxxxxxxxxx> wrote:
> >
> > So 2 concerns where "I'll do it in inline asm" can pessimize codegen:
> > 1. You alluded to this, but what happens when one of these functions
> > is called with a constant?
>
> This is why our headers have a lot of __builtin_constant_p()'s in them..
>
> In this particular case, see the x86 asm/bitops.h code:
>
> #define ffs(x) (__builtin_constant_p(x) ? __builtin_ffs(x) :
> variable_ffs(x))
>
> but this is actually quite a common pattern, and it's often not about
> something like __builtin_ffs() at all.

I was a reviewer on commit fdb6649ab7c1 ("x86/asm/bitops: Use
__builtin_ctzl() to evaluate constant expressions"); I'm familiar with
the pattern.

>
> See all the other __builtin_constant_p()'s that we have in that same
> file because we basically just use different code sequences for
> constants.
>
> And that file isn't even unusual. We use it quite a lot when we care
> about code generation for some particular case.

More so my point was x86 bitops is missing
commit 2fcff790dcb4 ("powerpc: Use builtin functions for fls()/__fls()/fls64()")
treatment.

I've sent https://lore.kernel.org/llvm/20230828-x86_fls-v1-1-e6a31b9f79c3@xxxxxxxxxx/.

>
> > 2. by providing the definition of a symbol typically provided by libc
> > (and then not building with -ffreestanding) pessimizes libcall
> > optimization.
>
> .. and this is partly why we often avoid libgcc things, and do certain
> things by hand.

(Sorry if the following rant is prior knowledge, it's mostly for
reference for others cc'ed who might not know this)

Careful, `-ffreestanding` and libgcc are two orthogonal things (at
least in my mind).

-ffreestanding is to libc as --rtlib= is to the compiler runtime
(which is distinct from the libc)

`-ffreestanding` is more about "does the runtime environment somehow
provide libc symbols."

libgcc (or llvm's equivalent "compiler-rt") is not responsible for
providing symbols from libc. `--rtlib=` controls what compiler
runtime will be used, but in my experience, today's compilers don't
make codegen decisions on that value. These are mostly runtime
helpers for "idk how to do <complicated math thing, such as double
word division>" or "maybe you didn't want that inline."

What's brittle about making codegen decisions with regards to these
flags though is that these dependencies grow over time, and yet it's
not possible today (AFAIK) to specify what's the minimum target to
support.

For instance, IIRC glibc recently added support for one of the
kernel's string.h routines, maybe strlcpy or something.
https://sourceware.org/git/?p=glibc.git;a=commit;h=454a20c8756c9c1d55419153255fc7692b3d2199

When is it safe for the compiler to start transforming calls to other
functions into calls to strlcpy? (Guess: year 2033, because:) What
happens when dynamically linking against older versions of glib that
do not provide that symbol?

>
> The classic rule is "Don't do 64-bit divisions using the C '/' operator".
>
> So in the kernel you have to use do_div() and friends, because the
> library versions are often disgusting and might not know that 64/32 is
> much much cheaper and is what you want.

And thus the same problem exists for the kernel wrt --rtlib that I
alluded to above for strlcpy. By providing a partial implementation
of a compiler runtime (--rtlib=), the compiler will frequently emit
libcalls to symbols for which the kernel hasn't provided. You can
avoid open coded double word division in the kernel all you want but:
1. again you're probably pessimizing division by constant remainder by
using div64_u64. GCC is *really* good at replacing these when the
divisor is constant; IME better than clang.
2. even avoiding open coded division, the compiler can still insert
division; loop-elision can replace loops outright if the trip count is
adjusted by a determinable value. (see 3220022038b9).

By providing a partial compiler runtime, and then using every -mno-
flag in the book, you tie the compiler's hands wrt what it can emit vs
libcall. There's not even a way to express to today's compiler that
"we have a compiler runtime, it's just straight up missing things."

Personally, I think a clang-tidy check for open coded division is
perhaps a better way to enforce the kernel's posture than providing
half a compiler runtime then doing gymnastics in the code to work
around the emission of libcalls to __divdi3() or__aeabi_uldivmod()
(mostly on 32b platforms). A linkage failure is nice, but only occurs
on 32b targets and it doesn't disambiguate between the case of
developer open coded division vs compiler inserted division.

>
> And quite often we simply use other names - but then we also do *not*
> build with -freestanding, because -freestanding has at least
> traditionally meant that the compiler won't optimize the simple and
> obvious cases (typically things like "memcpy with a constant size").

Personal opinion: we very much do NOT want to use -ffreestanding for
those libcall optimizations.

I discussed this recently with ARCH=loongarch folks:
commit 3f301dc292eb ("LoongArch: Replace -ffreestanding with
finer-grained -fno-builtin's")

It is my intention to remove -ffreestanding from ARCH=i386.
https://github.com/ClangBuiltLinux/linux/issues/1583

I had to first fix a bug in LLVM though
https://reviews.llvm.org/D125285
So rather than remove it outright, we might need to retain it for
builds with older releases of clang for now.

Though as you allude to down thread, perhaps some things that were the
case in linux 1.0 / gcc 1.40 no longer hold. Which is why adding such
flags to kernel makefiles really ought to be accompanied by a comment
in sources linking to an issue tracker report, so that we might clean
these up one day.

>
> So we mix and match and pick the best option.

Gross, and like *could you not?* I suspect it's more so the case of a
developer not realising it's perhaps a compiler bug, or not reporting
such bug, and trying flags they're heard of once until something
links.

Any use of -ffreestanding for any arch had better have a comment to a
compiler vendor's bug tracker laying out why that's necessary for that
arch and not others.

Many kernel developers are allergic to filing formal compiler bugs in
places where compiler vendors are looking, IME.

>
> The kernel really doesn't care about architecture portability, because
> honestly, something like "ffs()" is entirely *trivial* to get right,
> compared to the real differences between architectures (eg VM and IO
> differences etc).
>
> Linus



--
Thanks,
~Nick Desaulniers