Re: for_each_cpu() is buggy for UP kernel?

From: Linus Torvalds
Date: Sun May 13 2018 - 14:21:49 EST

Next message: jacopo mondi: "Re: [PATCH 2/3] arm64: dts: renesas: r8a77995: Add VIN4"
Previous message: fcami: "[PATCH] libata: Apply NOLPM quirk for SAMSUNG PM830 CXM13D1Q."
In reply to: Thomas Gleixner: "Re: for_each_cpu() is buggy for UP kernel?"
Next in thread: Dmitry Vyukov: "Re: for_each_cpu() is buggy for UP kernel?"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]

On Tue, May 8, 2018 at 11:24 PM Dexuan Cui <decui@xxxxxxxxxxxxx> wrote:

> Should we fix the for_each_cpu() in include/linux/cpumask.h for UP?

As Thomas points out, this has come up before.

One of the issues is historical - we tried very hard to make the SMP code
not cause code generation problems for UP, and part of that was just that
all these loops were literally designed to entirely go away under UP. It
still *looks* syntactically like a loop, but an optimizing compiler will
see that there's nothing there, and "for_each_cpu(...) x" essentially just
turns into "x" on UP. An empty mask simply generally doesn't make sense,
since opn UP you also don't have any masking of CPU ops, so the mask is
ignored, and that helps the code generation immensely.

If you have to load and test the mask, you immediately lose out badly in
code generation.

So honestly, I'd really prefer to keep our current behavior. Perhaps with a
debug option that actually tests (on SMP - because that's what every
developer is actually _using_ these days) that the mask isn't empty. But
I'm not sure that would find this case, since presumably on SMP it might
never be empty.

Now, there is likely a fairly good argument that UP is getting _so_
uninteresting that we shouldn't even worry about code generation. But the
counter-argument to that is that if people are using UP in this day and
age, they probably are using some really crappy hardware that needs all the
help it can get.

At least for now, I'd rather have this inconsistency, because it really
makes a surprisingly *big* difference in code generation. From the little
test I just did, adding that mask testing to a *single* case of
for_each_cpu() added 20 instructions. I didn't look at exactly why that
happened (because the code generation was so radically different), but it
was very noticeable. I used your macro replacement in kernel/taskstats.c in
case you want to try to dig into what happened, but I'm not surprised. It
really turns an unconditional trivial loop into a much more complex thing
that needs to look at and test a value that we didn't care about before.

Maybe we should introduce a "for_each_cpu_maybe_empty()" helper for cases
like this?

Linus

Next message: jacopo mondi: "Re: [PATCH 2/3] arm64: dts: renesas: r8a77995: Add VIN4"
Previous message: fcami: "[PATCH] libata: Apply NOLPM quirk for SAMSUNG PM830 CXM13D1Q."
In reply to: Thomas Gleixner: "Re: for_each_cpu() is buggy for UP kernel?"
Next in thread: Dmitry Vyukov: "Re: for_each_cpu() is buggy for UP kernel?"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]