Re: tty^Wrcu/perf lockdep trace.

From: Paul E. McKenney
Date: Sat Oct 05 2013 - 18:03:31 EST

Next message: Rafael J. Wysocki: "Re: [PATCH] PCI/PM: Removing the function pci_pm_complete()"
Previous message: Benjamin Herrenschmidt: "Re: [PATCH RFC 00/77] Re-design MSI/MSI-X interrupts enablementpattern"
In reply to: Peter Zijlstra: "Re: tty^Wrcu/perf lockdep trace."
Next in thread: Peter Zijlstra: "Re: tty^Wrcu/perf lockdep trace."
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]

On Sat, Oct 05, 2013 at 09:59:49PM +0200, Peter Zijlstra wrote:
> On Sat, Oct 05, 2013 at 09:28:02AM -0700, Paul E. McKenney wrote:
> > On Sat, Oct 05, 2013 at 06:05:11PM +0200, Peter Zijlstra wrote:
> > > On Fri, Oct 04, 2013 at 02:25:06PM -0700, Paul E. McKenney wrote:
> > > > > Why
> > > > > do we still have a per-cpu kthread in nocb mode? The idea is that we do
> > > > > not disturb the cpu, right? So I suppose these kthreads get to run on
> > > > > another cpu.
> > > >
> > > > Yep, the idea is that usermode figures out where to run them. Even if
> > > > usermode doesn't do that, this has the effect of getting them to be
> > > > more out of the way of real-time tasks.
> > > >
> > > > > Since its running on another cpu; we get into atomic and memory barriers
> > > > > anyway; so why not keep the logic the same as no-nocb but have another
> > > > > cpu check our nocb cpu's state.
> > > >
> > > > You can do that today by setting rcu_nocb_poll, but that results in
> > > > frequent polling wakeups even when the system is completely idle, which
> > > > is out of the question for the battery-powered embedded guys.
> > >
> > > So its this polling I don't get.. why is the different behaviour
> > > required? And why would you continue polling if the cpus were actually
> > > idle.
> >
> > The idea is to offload the overhead of doing the wakeup from (say)
> > a real-time thread/CPU onto some housekeeping CPU.
>
> Sure I get that that is the idea; what I don't get is why it needs to
> behave differently depending on NOCB.
>
> Why does a NOCB thingy need to wake up the kthread far more often?

A better question would be "Why do NOCB wakeups have higher risk of
hitting RCU/sched/perf deadlocks?"

In the !NOCB case, we rely on the scheduling-clock interrupt handler and
transition to idle to do the wakeups of ksoftirqd. These two environments
cannot possibly be holding scheduler or perf locks, so there is no risk
of deadlock in the common case.

Now there would be risk of deadlock in the !NOCB uncommon case where a
huge burst of call_rcu() invocations has caused more than 10,000 callbacks
to pile up on a single CPU -- in that case, call_rcu() will do the wakeup.
But only if interrupts are enabled, which again avoids the deadlock.

This deferral is OK because we are guaranteed that one of the following
things will eventually happen:

1. Another call_rcu() arrives with more than 10,000 callbacks on
the CPU, but when interrupts are enabled.

2. In NO_HZ_PERIODIC kernels, or in workloads that remain in the
kernel for a very long time, we eventually take a scheduling-clock
interrupt.

3. In !RCU_FAST_NO_HZ kernels, on attempted transition to an
idle-RCU state (idle and, for NO_HZ_FULL, userspace with
only one task runnable), rcu_needs_cpu() will refuse the
request to turn off the scheduling-clock interrupt, again
eventually taking a scheduling-clock interrupt.

4. In RCU_FAST_NO_HZ kernels, on transition to an idle-RCU
state, we advance the new callbacks and inform the RCU
core of their existence.

But none of these currently apply in the NOCB case. Instead, the idea
is to wake up the corresponding rcuo kthread when the first callback
arrives at an initially empty per-CPU list. Wakeups are omitted if the
list already has callbacks in it. Given that this produces a wakeup at
most once per grace period (several milliseconds at the very least),
I wasn't worried about it from a performance/scalability viewpoint.
Unfortunately, we have a deadlock issue.

My patch attempts to resolve this by moving the wakeup to the RCU core,
which is invoked by the scheduling-clock interrupt, and to the idle-entry
code.

> > > Is there some confusion between the nr_running==1 extended quiescent
> > > state and the nr_running==0 extended quiescent state?
> >
> > This is independent of the nr_running=1 extended quiescent state. The
> > wakeups only happen when runnning in the kernel. That said, a real-time
> > thread might want both rcu_nocb_poll=y and CONFIG_NO_HZ_FULL=y.
>
> So there's 3 behaviours?
>
> - CONFIG_NO_HZ_FULL=n
> - CONFIG_NO_HZ_FULL=y, rcu_nocb_poll=n
> - CONFIG_NO_HZ_FULL=y, rcu_nocb_poll=y

More like the following:

CONFIG_RCU_NOCB_CPU=n: Old behavior.
CONFIG_RCU_NOCB_CPU=y, rcu_nocb_poll=n: Wakeup deadlocks.
CONFIG_RCU_NOCB_CPU=y, rcu_nocb_poll=y: rcuo kthread periodically polls,
so no wakeups and (presumably) no deadlocks.

> What I'm trying to understand is why do all those things behave
> differently? For all 3 configs there's kthreads that do the GP advancing
> and can run on different cpus.

Yes, there are always the RCU grace-period kthreads, but these are
never awakened by call_rcu(). Instead, call_rcu() awakens either the
ksoftirqd kthread (old behavior) or the rcuo callback-offload kthread
(CONFIG_RCU_NOCB_CPU=y, rcu_nocb_poll=n behavior), in both cases, the
kthread for the current CPU.

The old ksoftirqd wakeup is avoided if interrupts were disabled on entry
to call_rcu(), as you noted earlier, which avoids the deadlock. The basic
idea of the patch is to do something similar in the CONFIG_RCU_NOCB_CPU=y,
rcu_nocb_poll=n case.

> And why does rcu_nocb_poll=y need to be terrible for power usage; surely
> we know when cpus are actually idle and can stop polling them.

In theory, we could do that. But in practice, what would wake us up
when the CPUs go non-idle?

1. We could do a wakeup on the idle-to-non-idle transition. That
would increase idle-to-non-idle latency, defeating the purpose
of rcu_nocb_poll=y. Plus there are workloads that enter and
exit idle extremely quickly, which would not be good for either
perforrmance, scalability, or energy efficiency.

2. We could have some other thread poll all the CPUs for activity,
for example, the RCU grace-period kthreads. This might actually
work, but there are some really ugly races involving CPUs becoming
active just long enough to post a callback, going to sleep,
with no other RCU activity in the system. This could easily
result in a system hang.

3. We could post a timeout to check for the corresponding CPU
being idle, but that just transfers the wakeups from idle from
the rcuo kthreads to the other CPUs.

4. I remove rcu_nocb_poll and see if anyone complains. That doesn't
solve the deadlock problem, but it does simplify RCU a bit. ;-)

Other thoughts?

Thanx, Paul

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

Next message: Rafael J. Wysocki: "Re: [PATCH] PCI/PM: Removing the function pci_pm_complete()"
Previous message: Benjamin Herrenschmidt: "Re: [PATCH RFC 00/77] Re-design MSI/MSI-X interrupts enablementpattern"
In reply to: Peter Zijlstra: "Re: tty^Wrcu/perf lockdep trace."
Next in thread: Peter Zijlstra: "Re: tty^Wrcu/perf lockdep trace."
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]