Re: [Bug #12650] Strange load average and ksoftirqd behavior with2.6.29-rc2-git1

From: Frederic Weisbecker
Date: Mon Feb 16 2009 - 23:34:52 EST


On Mon, Feb 16, 2009 at 02:39:44PM -0800, Paul E. McKenney wrote:
> On Mon, Feb 16, 2009 at 09:09:23PM +0100, Ingo Molnar wrote:
> >
> > * Paul E. McKenney <paulmck@xxxxxxxxxxxxxxxxxx> wrote:
> >
> > > Here the calls to rcu_process_callbacks() are only 75
> > > microseconds apart, so that this function is consuming more
> > > than 10% of a CPU. The strange thing is that I don't see a
> > > raise_softirq() in between, though perhaps it gets inlined or
> > > something that makes it invisible to ftrace.
> >
> > look at the latest trace please, that has even the most inline
> > raise-softirq method instrumented, so all the raising is
> > visible.
>
> Ah, my apologies! This time looking at:
>
> http://damien.wyart.free.fr/ksoftirqd_pb/trace_tip_2009.02.16_ksoftirqd_pb_abstime_proc.txt.gz
>
>
> 799.521187 | 1) <idle>-0 | | rcu_check_callbacks() {
> 799.521371 | 1) <idle>-0 | | rcu_check_callbacks() {
> 799.521555 | 1) <idle>-0 | | rcu_check_callbacks() {
> 799.521738 | 1) <idle>-0 | | rcu_check_callbacks() {
> 799.521934 | 1) <idle>-0 | | rcu_check_callbacks() {
> 799.522068 | 1) ksoftir-2324 | | rcu_check_callbacks() {
> 799.522208 | 1) <idle>-0 | | rcu_check_callbacks() {
> 799.522392 | 1) <idle>-0 | | rcu_check_callbacks() {
> 799.522575 | 1) <idle>-0 | | rcu_check_callbacks() {
> 799.522759 | 1) <idle>-0 | | rcu_check_callbacks() {
> 799.522956 | 1) <idle>-0 | | rcu_check_callbacks() {
> 799.523074 | 1) ksoftir-2324 | | rcu_check_callbacks() {
> 799.523214 | 1) <idle>-0 | | rcu_check_callbacks() {
> 799.523397 | 1) <idle>-0 | | rcu_check_callbacks() {
> 799.523579 | 1) <idle>-0 | | rcu_check_callbacks() {
> 799.523762 | 1) <idle>-0 | | rcu_check_callbacks() {
> 799.523960 | 1) <idle>-0 | | rcu_check_callbacks() {
> 799.524079 | 1) ksoftir-2324 | | rcu_check_callbacks() {
> 799.524220 | 1) <idle>-0 | | rcu_check_callbacks() {
> 799.524403 | 1) <idle>-0 | | rcu_check_callbacks() {
> 799.524587 | 1) <idle>-0 | | rcu_check_callbacks() {
> 799.524770 | 1) <idle>-0 | | rcu_check_callbacks() {
> [ . . . ]
>
> Yikes!!!
>
> Why is rcu_check_callbacks() being invoked so often? It should be called
> but once per jiffy, and here it is called no less than 22 times in about
> 3.5 milliseconds, meaning one call every 160 microseconds or so.
>
> Hmmm...
>
> Looks like we never return from:
>
> 799.521142 | 1) <idle>-0 | | tick_nohz_stop_sched_tick() {
>
> Perhaps we are taking an interrupt immediately after the
> local_irq_restore()? And at 799.521209 deciding to exit nohz mode.
> And then deciding to go back into nohz mode at 799.521326, 117
> microseconds later, after which we re-invoke rcu_check_callbacks(),
> which again raises RCU's softirq.
>
> And the reason we are invoking rcu_check_callbacks() so often appears
> to be in in arch/x86/kernel/process_32.c cpu_idle() near line 107,
> which explains my failure to reproduce on a 64-bit system:
>
> void cpu_idle(void)
> {
> int cpu = smp_processor_id();
>
> current_thread_info()->status |= TS_POLLING;
>
> /* endless idle loop with no priority at all */
> while (1) {
> tick_nohz_stop_sched_tick(1);
> while (!need_resched()) {
>
> check_pgt_cache();
> rmb();
>
> if (rcu_pending(cpu))
> rcu_check_callbacks(cpu, 0);
>
> if (cpu_is_offline(cpu))
> play_dead();
>
> local_irq_disable();
> __get_cpu_var(irq_stat).idle_timestamp = jiffies;
> /* Don't trace irqs off for idle */
> stop_critical_timings();
> pm_idle();
> start_critical_timings();
> }
> tick_nohz_restart_sched_tick();
> preempt_enable_no_resched();
> schedule();
> preempt_disable();
> }
> }
>
> If we go in and out of nohz mode quickly, we will invoke rcu_pending()
> each time. I would expect rcu_pending() to return 0 most of the time,
> but that apparently isn't the case with treercu...
>
> What is the easiest way for me to make it easy to trace the return path
> from __rcu_pending()? Make each return path call an empty function
> located off where the compiler cannot see it, I guess... Diagnostic
> patch along these lines below. Frederic, Damien, could you please give
> it a go? (And of course please let me know if something else is
> needed.)


No, you don't need that, you can use ftrace_printk, it will generate a C-comment like
inside the functions, ie:

__rcu_pending() {
/* pending_qs */
}

I've converted your below patch with ftrace_printks and tested it under an old P2
with rcu_tree and 1000 Hz. I made a trace during an idle state, and well, looks like I'm
lucky :-)
I guess I successfully reproduced the softirq/rcu overhead.
Please find the below patch to trace the rcu_pending return path, as well as the trace I made.
Sorry, the trace is a bit buggy with sometimes flying orphans C like comments.
When I will have more time, I will fix that.

The trace is here http://dl.free.fr/uyWGgCbx4

It looks like it mostly returns 1 because of the waiting for quiescent state:

$ cat rcutrace | grep "/* pending_none" | wc -l
221
$ cat rcutrace | grep "/* pending_qs" | wc -l
248
$ cat rcutrace | grep "/* pending" | wc -l
469


diff --git a/kernel/rcutree.c b/kernel/rcutree.c
index b2fd602..c9e78f6 100644
--- a/kernel/rcutree.c
+++ b/kernel/rcutree.c
@@ -45,6 +45,7 @@
#include <linux/cpu.h>
#include <linux/mutex.h>
#include <linux/time.h>
+#include <linux/ftrace.h>

#ifdef CONFIG_DEBUG_LOCK_ALLOC
static struct lock_class_key rcu_lock_key;
@@ -1249,31 +1250,44 @@ static int __rcu_pending(struct rcu_state *rsp, struct rcu_data *rdp)
check_cpu_stall(rsp, rdp);

/* Is the RCU core waiting for a quiescent state from this CPU? */
- if (rdp->qs_pending)
+ if (rdp->qs_pending) {
+ ftrace_printk("pending_qs\n");
return 1;
+ }

/* Does this CPU have callbacks ready to invoke? */
- if (cpu_has_callbacks_ready_to_invoke(rdp))
+ if (cpu_has_callbacks_ready_to_invoke(rdp)) {
+ ftrace_printk("pending_ready_invoke\n");
return 1;
+ }

/* Has RCU gone idle with this CPU needing another grace period? */
- if (cpu_needs_another_gp(rsp, rdp))
+ if (cpu_needs_another_gp(rsp, rdp)) {
+ ftrace_printk("pending_gp\n");
return 1;
+ }

/* Has another RCU grace period completed? */
- if (ACCESS_ONCE(rsp->completed) != rdp->completed) /* outside of lock */
+ if (ACCESS_ONCE(rsp->completed) != rdp->completed) {/* outside of lock */
+ ftrace_printk("pending_gp_completed\n");
return 1;
+ }

/* Has a new RCU grace period started? */
- if (ACCESS_ONCE(rsp->gpnum) != rdp->gpnum) /* outside of lock */
+ if (ACCESS_ONCE(rsp->gpnum) != rdp->gpnum) { /* outside of lock */
+ ftrace_printk("pending_gp_new_started\n");
return 1;
+ }

/* Has an RCU GP gone long enough to send resched IPIs &c? */
if (ACCESS_ONCE(rsp->completed) != ACCESS_ONCE(rsp->gpnum) &&
((long)(ACCESS_ONCE(rsp->jiffies_force_qs) - jiffies) < 0 ||
- (rdp->n_rcu_pending_force_qs - rdp->n_rcu_pending) < 0))
+ (rdp->n_rcu_pending_force_qs - rdp->n_rcu_pending) < 0)) {
+ ftrace_printk("pending_ipi\n");
return 1;
+ }

+ ftrace_printk("pending_none\n");
/* nothing to do */
return 0;
}


> Signed-off-by: Paul E. McKenney <paulmck@xxxxxxxxxxxxxxxxxx>
> ---
>
> rcupdate.c | 23 +++++++++++++++++++++++
> rcutree.c | 31 +++++++++++++++++++++++++------
> 2 files changed, 48 insertions(+), 6 deletions(-)
>
> diff --git a/kernel/rcupdate.c b/kernel/rcupdate.c
> index d92a76a..42bbf03 100644
> --- a/kernel/rcupdate.c
> +++ b/kernel/rcupdate.c
> @@ -175,3 +175,26 @@ void __init rcu_init(void)
> __rcu_init();
> }
>
> +void __rcu_pending_qs_pending(void)
> +{
> +}
> +
> +void __rcu_pending_callbacks_ready(void)
> +{
> +}
> +
> +void __rcu_pending_needs_gp(void)
> +{
> +}
> +
> +void __rcu_pending_new_completed(void)
> +{
> +}
> +
> +void __rcu_pending_new_gp(void)
> +{
> +}
> +
> +void __rcu_pending_fqs(void)
> +{
> +}
> diff --git a/kernel/rcutree.c b/kernel/rcutree.c
> index b2fd602..e2d72c3 100644
> --- a/kernel/rcutree.c
> +++ b/kernel/rcutree.c
> @@ -1234,6 +1234,13 @@ void call_rcu_bh(struct rcu_head *head, void (*func)(struct rcu_head *rcu))
> }
> EXPORT_SYMBOL_GPL(call_rcu_bh);
>
> +extern void __rcu_pending_qs_pending(void);
> +extern void __rcu_pending_callbacks_ready(void);
> +extern void __rcu_pending_needs_gp(void);
> +extern void __rcu_pending_new_completed(void);
> +extern void __rcu_pending_new_gp(void);
> +extern void __rcu_pending_fqs(void);
> +
> /*
> * Check to see if there is any immediate RCU-related work to be done
> * by the current CPU, for the specified type of RCU, returning 1 if so.
> @@ -1249,30 +1256,42 @@ static int __rcu_pending(struct rcu_state *rsp, struct rcu_data *rdp)
> check_cpu_stall(rsp, rdp);
>
> /* Is the RCU core waiting for a quiescent state from this CPU? */
> - if (rdp->qs_pending)
> + if (rdp->qs_pending) {
> + __rcu_pending_qs_pending();
> return 1;
> + }
>
> /* Does this CPU have callbacks ready to invoke? */
> - if (cpu_has_callbacks_ready_to_invoke(rdp))
> + if (cpu_has_callbacks_ready_to_invoke(rdp)) {
> + __rcu_pending_callbacks_ready();
> return 1;
> + }
>
> /* Has RCU gone idle with this CPU needing another grace period? */
> - if (cpu_needs_another_gp(rsp, rdp))
> + if (cpu_needs_another_gp(rsp, rdp)) {
> + __rcu_pending_needs_gp();
> return 1;
> + }
>
> /* Has another RCU grace period completed? */
> - if (ACCESS_ONCE(rsp->completed) != rdp->completed) /* outside of lock */
> + if (ACCESS_ONCE(rsp->completed) != rdp->completed) /* outside of lock */ {
> + __rcu_pending_new_completed();
> return 1;
> + }
>
> /* Has a new RCU grace period started? */
> - if (ACCESS_ONCE(rsp->gpnum) != rdp->gpnum) /* outside of lock */
> + if (ACCESS_ONCE(rsp->gpnum) != rdp->gpnum) /* outside of lock */ {
> + __rcu_pending_new_gp();
> return 1;
> + }
>
> /* Has an RCU GP gone long enough to send resched IPIs &c? */
> if (ACCESS_ONCE(rsp->completed) != ACCESS_ONCE(rsp->gpnum) &&
> ((long)(ACCESS_ONCE(rsp->jiffies_force_qs) - jiffies) < 0 ||
> - (rdp->n_rcu_pending_force_qs - rdp->n_rcu_pending) < 0))
> + (rdp->n_rcu_pending_force_qs - rdp->n_rcu_pending) < 0)) {
> + __rcu_pending_fqs();
> return 1;
> + }
>
> /* nothing to do */
> return 0;

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/