Re: [Bug #12650] Strange load average and ksoftirqd behavior with2.6.29-rc2-git1

From: Paul E. McKenney
Date: Mon Feb 16 2009 - 17:39:56 EST


On Mon, Feb 16, 2009 at 09:09:23PM +0100, Ingo Molnar wrote:
>
> * Paul E. McKenney <paulmck@xxxxxxxxxxxxxxxxxx> wrote:
>
> > Here the calls to rcu_process_callbacks() are only 75
> > microseconds apart, so that this function is consuming more
> > than 10% of a CPU. The strange thing is that I don't see a
> > raise_softirq() in between, though perhaps it gets inlined or
> > something that makes it invisible to ftrace.
>
> look at the latest trace please, that has even the most inline
> raise-softirq method instrumented, so all the raising is
> visible.

Ah, my apologies! This time looking at:

http://damien.wyart.free.fr/ksoftirqd_pb/trace_tip_2009.02.16_ksoftirqd_pb_abstime_proc.txt.gz


799.521187 | 1) <idle>-0 | | rcu_check_callbacks() {
799.521371 | 1) <idle>-0 | | rcu_check_callbacks() {
799.521555 | 1) <idle>-0 | | rcu_check_callbacks() {
799.521738 | 1) <idle>-0 | | rcu_check_callbacks() {
799.521934 | 1) <idle>-0 | | rcu_check_callbacks() {
799.522068 | 1) ksoftir-2324 | | rcu_check_callbacks() {
799.522208 | 1) <idle>-0 | | rcu_check_callbacks() {
799.522392 | 1) <idle>-0 | | rcu_check_callbacks() {
799.522575 | 1) <idle>-0 | | rcu_check_callbacks() {
799.522759 | 1) <idle>-0 | | rcu_check_callbacks() {
799.522956 | 1) <idle>-0 | | rcu_check_callbacks() {
799.523074 | 1) ksoftir-2324 | | rcu_check_callbacks() {
799.523214 | 1) <idle>-0 | | rcu_check_callbacks() {
799.523397 | 1) <idle>-0 | | rcu_check_callbacks() {
799.523579 | 1) <idle>-0 | | rcu_check_callbacks() {
799.523762 | 1) <idle>-0 | | rcu_check_callbacks() {
799.523960 | 1) <idle>-0 | | rcu_check_callbacks() {
799.524079 | 1) ksoftir-2324 | | rcu_check_callbacks() {
799.524220 | 1) <idle>-0 | | rcu_check_callbacks() {
799.524403 | 1) <idle>-0 | | rcu_check_callbacks() {
799.524587 | 1) <idle>-0 | | rcu_check_callbacks() {
799.524770 | 1) <idle>-0 | | rcu_check_callbacks() {
[ . . . ]

Yikes!!!

Why is rcu_check_callbacks() being invoked so often? It should be called
but once per jiffy, and here it is called no less than 22 times in about
3.5 milliseconds, meaning one call every 160 microseconds or so.

Hmmm...

Looks like we never return from:

799.521142 | 1) <idle>-0 | | tick_nohz_stop_sched_tick() {

Perhaps we are taking an interrupt immediately after the
local_irq_restore()? And at 799.521209 deciding to exit nohz mode.
And then deciding to go back into nohz mode at 799.521326, 117
microseconds later, after which we re-invoke rcu_check_callbacks(),
which again raises RCU's softirq.

And the reason we are invoking rcu_check_callbacks() so often appears
to be in in arch/x86/kernel/process_32.c cpu_idle() near line 107,
which explains my failure to reproduce on a 64-bit system:

void cpu_idle(void)
{
int cpu = smp_processor_id();

current_thread_info()->status |= TS_POLLING;

/* endless idle loop with no priority at all */
while (1) {
tick_nohz_stop_sched_tick(1);
while (!need_resched()) {

check_pgt_cache();
rmb();

if (rcu_pending(cpu))
rcu_check_callbacks(cpu, 0);

if (cpu_is_offline(cpu))
play_dead();

local_irq_disable();
__get_cpu_var(irq_stat).idle_timestamp = jiffies;
/* Don't trace irqs off for idle */
stop_critical_timings();
pm_idle();
start_critical_timings();
}
tick_nohz_restart_sched_tick();
preempt_enable_no_resched();
schedule();
preempt_disable();
}
}

If we go in and out of nohz mode quickly, we will invoke rcu_pending()
each time. I would expect rcu_pending() to return 0 most of the time,
but that apparently isn't the case with treercu...

What is the easiest way for me to make it easy to trace the return path
from __rcu_pending()? Make each return path call an empty function
located off where the compiler cannot see it, I guess... Diagnostic
patch along these lines below. Frederic, Damien, could you please give
it a go? (And of course please let me know if something else is
needed.)

Signed-off-by: Paul E. McKenney <paulmck@xxxxxxxxxxxxxxxxxx>
---

rcupdate.c | 23 +++++++++++++++++++++++
rcutree.c | 31 +++++++++++++++++++++++++------
2 files changed, 48 insertions(+), 6 deletions(-)

diff --git a/kernel/rcupdate.c b/kernel/rcupdate.c
index d92a76a..42bbf03 100644
--- a/kernel/rcupdate.c
+++ b/kernel/rcupdate.c
@@ -175,3 +175,26 @@ void __init rcu_init(void)
__rcu_init();
}

+void __rcu_pending_qs_pending(void)
+{
+}
+
+void __rcu_pending_callbacks_ready(void)
+{
+}
+
+void __rcu_pending_needs_gp(void)
+{
+}
+
+void __rcu_pending_new_completed(void)
+{
+}
+
+void __rcu_pending_new_gp(void)
+{
+}
+
+void __rcu_pending_fqs(void)
+{
+}
diff --git a/kernel/rcutree.c b/kernel/rcutree.c
index b2fd602..e2d72c3 100644
--- a/kernel/rcutree.c
+++ b/kernel/rcutree.c
@@ -1234,6 +1234,13 @@ void call_rcu_bh(struct rcu_head *head, void (*func)(struct rcu_head *rcu))
}
EXPORT_SYMBOL_GPL(call_rcu_bh);

+extern void __rcu_pending_qs_pending(void);
+extern void __rcu_pending_callbacks_ready(void);
+extern void __rcu_pending_needs_gp(void);
+extern void __rcu_pending_new_completed(void);
+extern void __rcu_pending_new_gp(void);
+extern void __rcu_pending_fqs(void);
+
/*
* Check to see if there is any immediate RCU-related work to be done
* by the current CPU, for the specified type of RCU, returning 1 if so.
@@ -1249,30 +1256,42 @@ static int __rcu_pending(struct rcu_state *rsp, struct rcu_data *rdp)
check_cpu_stall(rsp, rdp);

/* Is the RCU core waiting for a quiescent state from this CPU? */
- if (rdp->qs_pending)
+ if (rdp->qs_pending) {
+ __rcu_pending_qs_pending();
return 1;
+ }

/* Does this CPU have callbacks ready to invoke? */
- if (cpu_has_callbacks_ready_to_invoke(rdp))
+ if (cpu_has_callbacks_ready_to_invoke(rdp)) {
+ __rcu_pending_callbacks_ready();
return 1;
+ }

/* Has RCU gone idle with this CPU needing another grace period? */
- if (cpu_needs_another_gp(rsp, rdp))
+ if (cpu_needs_another_gp(rsp, rdp)) {
+ __rcu_pending_needs_gp();
return 1;
+ }

/* Has another RCU grace period completed? */
- if (ACCESS_ONCE(rsp->completed) != rdp->completed) /* outside of lock */
+ if (ACCESS_ONCE(rsp->completed) != rdp->completed) /* outside of lock */ {
+ __rcu_pending_new_completed();
return 1;
+ }

/* Has a new RCU grace period started? */
- if (ACCESS_ONCE(rsp->gpnum) != rdp->gpnum) /* outside of lock */
+ if (ACCESS_ONCE(rsp->gpnum) != rdp->gpnum) /* outside of lock */ {
+ __rcu_pending_new_gp();
return 1;
+ }

/* Has an RCU GP gone long enough to send resched IPIs &c? */
if (ACCESS_ONCE(rsp->completed) != ACCESS_ONCE(rsp->gpnum) &&
((long)(ACCESS_ONCE(rsp->jiffies_force_qs) - jiffies) < 0 ||
- (rdp->n_rcu_pending_force_qs - rdp->n_rcu_pending) < 0))
+ (rdp->n_rcu_pending_force_qs - rdp->n_rcu_pending) < 0)) {
+ __rcu_pending_fqs();
return 1;
+ }

/* nothing to do */
return 0;
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/