Re: [PATCH] a patch to fix the cpu-offline-online problem causedby pm_idle

From: Peter Zijlstra
Date: Sun Jan 30 2011 - 11:38:53 EST

On Sat, 2011-01-29 at 13:44 +0800, Luming Yu wrote:
> On Fri, Jan 28, 2011 at 6:30 PM, Peter Zijlstra <peterz@xxxxxxxxxxxxx> wrote:
> >> We have seen an extremely slow system under the CPU-OFFLINE-ONLIE test
> >> on a 4-sockets NHM-EX system.
> >
> > Slow is OK, cpu-hotplug isn't performance critical by any means.
> Here is one example that the "slow" is not acceptable. Maybe I should
> not use "slow" in the first place. It happnes after I resolved a
> similar NMI watchdog warnning in calibrate_delay_direct..
> Please note, I got the BUG in a 2.6.32-based kernel. Upstream behaves
> similar I guess.

Guessing is totally the wrong thing when you're sending stuff upstream,
esp ugly patches such as this. .32 is more than a year old, anything
could have happened.

> BUG: soft lockup - CPU#63 stuck for 61s! [migration/63:256]

> > If its slow but working, the test is broken, I don't see a reason to do
> > anything to the kernel, let alone the below.
> It not working sometimes, so I think it's not a solid feature right now.

But you didn't say anything about not working, you merely said slow. If
its not working, you need to very carefully explain what is not working,
where its deadlocked and how your patch solves this and how you avoid
wrecking stuff for everybody else.

> >> since it currently unnecessarily implicitly interact with
> >> CPU power management.
> >
> > daft statement at best, because if not for some misguided power
> > management purpose, what are you actually unplugging cpus for?
> > (misguided because unplug doesn't actually safe more power than simply
> > idling the cpu).
> It's a RAS feature and Suspend Resume also hits same code path I think.

That still doesn't say anything, also who in his right mind suspends a
nhm-ex system?

> > So you flip the pm_idle pointer protected unter hotplug mutex, but
> > that's not serialized against module loading, so what happens if you
> > concurrently load a module that sets another idle policy?
> >
> > Your changelog is vague at best, so what exactly is the purpose here? We
> > flip to default_idle(), which uses HLT, which is C1. Then you run
> > cpu_idle_wait(), which will IPI all cpus, all these CPUs (except one)
> > could have been in deep C states (C3+) so you get your slow wakeup
> > anyway.
> >
> > There-after you do the normal stop-machine hot-plug dance, which again
> > will IPI all cpus once, then you flip it back to the saved pm_idle
> > handler and again IPI all cpus.
> it needs 50-100us latency to send one IPI, you could get an idea on a
> large NHM-EX system which contains 64 logical processors. With
> Tickless and APIC timer stopped in C3 on NHM-EX, you could also have
> an idea about the problem I have.

Ok, so one IPI costs 50-100 us, even with 64 cpu, that's at most 6.4ms
nowhere near enough to trigger the NMI watchdog. So what does go wrong?

Why does your patch solve things, like said, it doesn't avoid the slow
IPI at all, you still IPI each cpu right after changing the pm_idle
function. Those IPIs will still hit C3+ states.

> Let me know if there are still questions .

Yeah, what are you smoking? Why do you wreck perfectly fine code for one
backward ass piece of hardware.

To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at
Please read the FAQ at