RE: for_each_cpu() is buggy for UP kernel?

From: Dexuan Cui
Date: Mon May 14 2018 - 23:02:44 EST


> From: Linus Torvalds <torvalds@xxxxxxxxxxxxxxxxxxxx>
> Sent: Sunday, May 13, 2018 11:22
> On Tue, May 8, 2018 at 11:24 PM Dexuan Cui <decui@xxxxxxxxxxxxx> wrote:
>
> > Should we fix the for_each_cpu() in include/linux/cpumask.h for UP?
>
> As Thomas points out, this has come up before.
>
> One of the issues is historical - we tried very hard to make the SMP code
> not cause code generation problems for UP, and part of that was just that
> all these loops were literally designed to entirely go away under UP. It
> still *looks* syntactically like a loop, but an optimizing compiler will
> see that there's nothing there, and "for_each_cpu(...) x" essentially just
> turns into "x" on UP. An empty mask simply generally doesn't make sense,
> since opn UP you also don't have any masking of CPU ops, so the mask is
> ignored, and that helps the code generation immensely.
>
> If you have to load and test the mask, you immediately lose out badly in
> code generation.
Thank you all for the insights and the detailed background introduction!

> So honestly, I'd really prefer to keep our current behavior. Perhaps with a
> debug option that actually tests (on SMP - because that's what every
> developer is actually _using_ these days) that the mask isn't empty. But
> I'm not sure that would find this case, since presumably on SMP it might
> never be empty.
I agree.

> Now, there is likely a fairly good argument that UP is getting _so_
> uninteresting that we shouldn't even worry about code generation. But the
> counter-argument to that is that if people are using UP in this day and
> age, they probably are using some really crappy hardware that needs all the
> help it can get.
FWIW, I happened to find this issue in a SMP virtual machine, but the kernel
from a customer was built with CONFIG_SMP disabled. After spending 1 day
debugging the strange boot-up delay, which was caused by the unexpected
PIT interrupt storm, I finally tracked it down to the UP version of for_each_cpu().

The function exposing the issue is kernel/time/tick-broadcast.c:
tick_handle_oneshot_broadcast().

If you're OK with the below fix (not tested yet), I'll submit a patch for it:

--- a/kernel/time/tick-broadcast.c
+++ b/kernel/time/tick-broadcast.c
@@ -616,6 +616,10 @@ static void tick_handle_oneshot_broadcast(struct clock_event_device *dev)
now = ktime_get();
/* Find all expired events */
for_each_cpu(cpu, tick_broadcast_oneshot_mask) {
+#ifndef CONFIG_SMP
+ if (cpumask_empty(tick_broadcast_oneshot_mask))
+ break;
+#endif
td = &per_cpu(tick_cpu_device, cpu);
if (td->evtdev->next_event <= now) {
cpumask_set_cpu(cpu, tmpmask);

> At least for now, I'd rather have this inconsistency, because it really
> makes a surprisingly *big* difference in code generation. From the little
> test I just did, adding that mask testing to a *single* case of
> for_each_cpu() added 20 instructions. I didn't look at exactly why that
> happened (because the code generation was so radically different), but it
> was very noticeable. I used your macro replacement in kernel/taskstats.c in
> case you want to try to dig into what happened, but I'm not surprised. It
> really turns an unconditional trivial loop into a much more complex thing
> that needs to look at and test a value that we didn't care about before.
I agree.


> Maybe we should introduce a "for_each_cpu_maybe_empty()" helper for
> cases like this?
> Linus
Sounds like a good idea.

Thanks,
-- Dexuan