Re: INFO: task hung in perf_trace_event_unreg

From: Paul E. McKenney
Date: Mon Apr 09 2018 - 14:10:13 EST

Next message: Alex G.: "Re: [RFC PATCH 3/4] acpi: apei: Do not panic() in NMI because of GHES messages"
Previous message: Waiman Long: "[PATCH 1/2] locking/qspinlock: Add stat tracking for pending vs slowpath"
In reply to: Dmitry Vyukov: "Re: INFO: task hung in perf_trace_event_unreg"
Next in thread: Dmitry Vyukov: "Re: INFO: task hung in perf_trace_event_unreg"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]

On Mon, Apr 09, 2018 at 06:28:16PM +0200, Dmitry Vyukov wrote:
> On Mon, Apr 9, 2018 at 6:20 PM, Paul E. McKenney
> <paulmck@xxxxxxxxxxxxxxxxxx> wrote:
> > On Mon, Apr 09, 2018 at 02:54:20PM +0200, Dmitry Vyukov wrote:
> >> On Mon, Apr 2, 2018 at 7:23 PM, Paul E. McKenney
> >> <paulmck@xxxxxxxxxxxxxxxxxx> wrote:
> >> >> >> >> >>
> >> >> >> >> >> > Hello,
> >> >> >> >> >> >
> >> >> >> >> >> > syzbot hit the following crash on upstream commit
> >> >> >> >> >> > 0adb32858b0bddf4ada5f364a84ed60b196dbcda (Sun Apr 1 21:20:27 2018 +0000)
> >> >> >> >> >> > Linux 4.16
> >> >> >> >> >> > syzbot dashboard link:
> >> >> >> >> >> > https://syzkaller.appspot.com/bug?extid=2dbc55da20fa246378fd
> >> >> >> >> >> >
> >> >> >> >> >> > Unfortunately, I don't have any reproducer for this crash yet.
> >> >> >> >> >> > Raw console output:
> >> >> >> >> >> > https://syzkaller.appspot.com/x/log.txt?id=5487937873510400
> >> >> >> >> >> > Kernel config:
> >> >> >> >> >> > https://syzkaller.appspot.com/x/.config?id=-2374466361298166459
> >> >> >> >> >> > compiler: gcc (GCC) 7.1.1 20170620
> >> >> >> >> >> >
> >> >> >> >> >> > IMPORTANT: if you fix the bug, please add the following tag to the commit:
> >> >> >> >> >> > Reported-by: syzbot+2dbc55da20fa246378fd@xxxxxxxxxxxxxxxxxxxxxxxxx
> >> >> >> >> >> > It will help syzbot understand when the bug is fixed. See footer for
> >> >> >> >> >> > details.
> >> >> >> >> >> > If you forward the report, please keep this part and the footer.
> >> >> >> >> >> >
> >> >> >> >> >> > REISERFS warning (device loop4): super-6502 reiserfs_getopt: unknown mount
> >> >> >> >> >> > option "g ï;eïKï×>pquota"
> >> >> >> >> >
> >> >> >> >> > Might not hurt to look into the above, though perhaps this is just syzkaller
> >> >> >> >> > playing around with mount options.
> >> >> >> >> >
> >> >> >> >> >> > INFO: task syz-executor3:10803 blocked for more than 120 seconds.
> >> >> >> >> >> > Not tainted 4.16.0+ #10
> >> >> >> >> >> > "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
> >> >> >> >> >> > syz-executor3 D20944 10803 4492 0x80000002
> >> >> >> >> >> > Call Trace:
> >> >> >> >> >> > context_switch kernel/sched/core.c:2862 [inline]
> >> >> >> >> >> > __schedule+0x8fb/0x1ec0 kernel/sched/core.c:3440
> >> >> >> >> >> > schedule+0xf5/0x430 kernel/sched/core.c:3499
> >> >> >> >> >> > schedule_timeout+0x1a3/0x230 kernel/time/timer.c:1777
> >> >> >> >> >> > do_wait_for_common kernel/sched/completion.c:86 [inline]
> >> >> >> >> >> > __wait_for_common kernel/sched/completion.c:107 [inline]
> >> >> >> >> >> > wait_for_common kernel/sched/completion.c:118 [inline]
> >> >> >> >> >> > wait_for_completion+0x415/0x770 kernel/sched/completion.c:139
> >> >> >> >> >> > __wait_rcu_gp+0x221/0x340 kernel/rcu/update.c:414
> >> >> >> >> >> > synchronize_sched.part.64+0xac/0x100 kernel/rcu/tree.c:3212
> >> >> >> >> >> > synchronize_sched+0x76/0xf0 kernel/rcu/tree.c:3213
> >> >> >> >> >>
> >> >> >> >> >> I don't think this is a perf issue. Looks like something is preventing
> >> >> >> >> >> rcu_sched from completing. If there's a CPU that is running in kernel
> >> >> >> >> >> space and never scheduling, that can cause this issue. Or if RCU
> >> >> >> >> >> somehow missed a transition into idle or user space.
> >> >> >> >> >
> >> >> >> >> > The RCU CPU stall warning below strongly supports this position ...
> >> >> >> >>
> >> >> >> >> I think this is this guy then:
> >> >> >> >>
> >> >> >> >> https://syzkaller.appspot.com/bug?id=17f23b094cd80df750e5b0f8982c521ee6bcbf40
> >> >> >> >>
> >> >> >> >> #syz dup: INFO: rcu detected stall in __process_echoes
> >> >> >> >
> >> >> >> > Seems likely to me!
> >> >> >> >
> >> >> >> >> Looking retrospectively at the various hang/stall bugs that we have, I
> >> >> >> >> think we need some kind of priority between them. I.e. we have rcu
> >> >> >> >> stalls, spinlock stalls, workqueue hangs, task hangs, silent machine
> >> >> >> >> hang and maybe something else. It would be useful if they fire
> >> >> >> >> deterministically according to priorities. If there is an rcu stall,
> >> >> >> >> that's always detected as CPU stall. Then if there is no RCU stall,
> >> >> >> >> but a workqueue stall, then that's always detected as workqueue stall,
> >> >> >> >> etc.
> >> >> >> >> Currently if we have an RCU stall (effectively CPU stall), that can be
> >> >> >> >> detected either RCU stall or a task hung, producing 2 different bug
> >> >> >> >> reports (which is bad).
> >> >> >> >> One can say that it's only a matter of tuning timeouts, but at least
> >> >> >> >> task hung detector has a problem that if you set timeout to X, it can
> >> >> >> >> detect hung anywhere between X and 2*X. And on one hand we need quite
> >> >> >> >> large timeout (a minute may not be enough), and on the other hand we
> >> >> >> >> can't wait for an hour just to make sure that the machine is indeed
> >> >> >> >> dead (these things happen every few minutes).
> >> >> >> >
> >> >> >> > I suppose that we could have a global variable that was set to the
> >> >> >> > priority of the complaint in question, which would suppress all
> >> >> >> > lower-priority complaints. Might need to be opt-in, though -- I would
> >> >> >> > guess that not everyone is going to be happy with one complaint suppressing
> >> >> >> > others, especially given the possibility that the two complaints might
> >> >> >> > be about different things.
> >> >> >> >
> >> >> >> > Or did you have something more deft in mind?
> >> >> >>
> >> >> >>
> >> >> >> syzkaller generally looks only at the first report. One does not know
> >> >> >> if/when there will be a second one, or the second one can be induced
> >> >> >> by the first one, and we generally want clean reports on a non-tainted
> >> >> >> kernel. So we don't just need to suppress lower priority ones, we need
> >> >> >> to produce the right report first.
> >> >> >> I am thinking maybe setting:
> >> >> >> - rcu stalls at 1.5 minutes
> >> >> >> - workqueue stalls at 2 minutes
> >> >> >> - task hungs at 2.5 minutes
> >> >> >> - and no output whatsoever at 3 minutes
> >> >> >> Do I miss anything? I think at least spinlocks. Should they go before
> >> >> >> or after rcu?
> >> >> >
> >> >> > That is what I know of, but the Linux kernel being what it is, there is
> >> >> > probably something more out there. If not now, in a few months. The
> >> >> > RCU CPU stall timeout can be set on the kernel-boot command line, but
> >> >> > you probably already knew that.
> >> >>
> >> >> Well, it's all based solely on a large number of patches and stopgaps.
> >> >> If we fix main problems for today, it's already good.
> >> >
> >> > Fair enough!
> >> >
> >> >> > Just for comparison, back in DYNIX/ptx days the RCU CPU stall timeout
> >> >> > was 1.5 -seconds-. ;-)
> >> >>
> >> >> Have you tried to instrument every basic block with a function call to
> >> >> collect coverage, check every damn memory access for validity, enable
> >> >> all thinkable and unthinkable debug configs and put the insanest load
> >> >> one can imagine from a swarm of parallel threads? It makes things a
> >> >> bit slower ;)
> >> >
> >> > Given that we wouldn't have had enough CPU or memory to accommodate
> >> > all of that back in DYNIX/ptx days, I am forced to answer "no". ;-)
> >> >
> >> >> >> This will require fixing task hung. Have not yet looked at workqueue detector.
> >> >> >> Does at least RCU respect the given timeout more or less precisely?
> >> >> >
> >> >> > Assuming that there is at least one CPU capable of taking scheduling-clock
> >> >> > interrupts, it should respect the timeout to within a few jiffies.
> >>
> >>
> >> Hi Paul,
> >>
> >> Speaking of stalls and rcu, we are seeing lots of crashes that go like this:
> >>
> >> INFO: rcu_sched self-detected stall on CPU[ 404.992530] INFO:
> >> rcu_sched detected stalls on CPUs/tasks:
> >> INFO: rcu_sched self-detected stall on CPU[ 454.347448] INFO:
> >> rcu_sched detected stalls on CPUs/tasks:
> >> INFO: rcu_sched self-detected stall on CPU[ 396.073634] INFO:
> >> rcu_sched detected stalls on CPUs/tasks:
> >>
> >> or like this:
> >>
> >> INFO: rcu_sched self-detected stall on CPU
> >> INFO: rcu_sched detected stalls on CPUs/tasks:
> >> 0-....: (125000 ticks this GP) idle=0ba/1/4611686018427387906
> >> softirq=57641/57641 fqs=31151
> >> 0-....: (125000 ticks this GP) idle=0ba/1/4611686018427387906
> >> softirq=57641/57641 fqs=31151
> >> (t=125002 jiffies g=31656 c=31655 q=910)
> >>
> >> INFO: rcu_sched self-detected stall on CPU
> >> INFO: rcu_sched detected stalls on CPUs/tasks:
> >> 0-....: (125000 ticks this GP) idle=49a/1/4611686018427387906
> >> softirq=65194/65194 fqs=31231
> >> 0-....: (125000 ticks this GP) idle=49a/1/4611686018427387906
> >> softirq=65194/65194 fqs=31231
> >> (t=125002 jiffies g=34421 c=34420 q=1119)
> >> (detected by 1, t=125002 jiffies, g=34421, c=34420, q=1119)
> >>
> >>
> >> and then there is an unintelligible mess of 2 reports. Such crashes go
> >> to trash bin, because we can't even say which function hanged. It
> >> seems that in all cases 2 different rcu stall detection facilities
> >> race with each other. Is it possible to make them not race?
> >
> > How about the following (untested, not for mainline) patch? It suppresses
> > all but the "main" RCU flavor, which is rcu_sched for !PREEMPT builds and
> > rcu_preempt otherwise. Either way, this is the RCU flavor corresponding
> > to synchronize_rcu(). This works well in the common case where there
> > is almost always an RCU grace period in flight.
> >
> > One reason that this patch is not for mainline is that I am working on
> > merging the RCU-bh, RCU-preempt, and RCU-sched flavors into one thing,
> > at which point there won't be any races. But that might be a couple
> > merge windows away from now.
> >
> > Thanx, Paul
> >
> > ------------------------------------------------------------------------
> >
> > diff --git a/kernel/rcu/tree.c b/kernel/rcu/tree.c
> > index 381b47a68ac6..31f7818f2d63 100644
> > --- a/kernel/rcu/tree.c
> > +++ b/kernel/rcu/tree.c
> > @@ -1552,7 +1552,7 @@ static void check_cpu_stall(struct rcu_state *rsp, struct rcu_data *rdp)
> > struct rcu_node *rnp;
> >
> > if ((rcu_cpu_stall_suppress && !rcu_kick_kthreads) ||
> > - !rcu_gp_in_progress(rsp))
> > + !rcu_gp_in_progress(rsp) || rsp != rcu_state_p)
> > return;
> > rcu_stall_kick_kthreads(rsp);
> > j = jiffies;
>
> But doesn't they both relate to the same rcu flavor? They both say
> rcu_sched. I assumed that the difference is "self-detected" vs "on
> CPUs/tasks", i.e. on the current CPU vs on other CPUs.

Right you are!

One approach would be to increase the value of RCU_STALL_RAT_DELAY,
which is currently two jiffies to (say) 20 jiffies. This is in
kernel/rcu/tree.h. But this would fail on a sufficiently overloaded
system -- and the failure of the two-jiffy delay is a bit of a surprise,
given interrupts disabled and all that. Are you by any chance loaded
heavily enough to see vCPU preemption?

I could avoid at least some of these timing issues instead using cmpxchg()
on ->jiffies_stall to allow only one CPU in, but leave the non-atomic
update to discourage overly long stall prints from running into the
next one. This is not perfect, either, and is roughly equivalent to
setting RCU_STALL_RAT_DELAY to many second's worth of jiffies, but
avoiding that minute's delay. But it should get rid of the duplication
in almost all cases, though it could allow a stall warning to overlap
with a later stall warning for that same grace period. Which can
already happen anyway. Also, a tens-of-seconds vCPU preemption can
still cause concurrent stall warnings, but if that is happening to you,
the concurrent stall warnings are probably the least of your problems.
Besides, we do need at least one CPU to actually report the stall, which
won't happen if that CPU's vCPU is indefinitely preempted. So there is
only so much I can do about that particular corner case.

So how does the following (untested) patch work for you?

Thanx, Paul

------------------------------------------------------------------------

commit 6a5ab1e68f8636d8823bb5a9aee35fc44c2be866
Author: Paul E. McKenney <paulmck@xxxxxxxxxxxxxxxxxx>
Date: Mon Apr 9 11:04:46 2018 -0700

rcu: Exclude near-simultaneous RCU CPU stall warnings

There is a two-jiffy delay between the time that a CPU will self-report
an RCU CPU stall warning and the time that some other CPU will report a
warning on behalf of the first CPU. This has worked well in the past,
but on busy systems, it is possible for the two warnings to overlap,
which makes interpreting them extremely difficult.

This commit therefore uses a cmpxchg-based timing decision that
allows only one report in a given one-minute period (assuming default
stall-warning Kconfig parameters). This approach will of course fail
if you are seeing minute-long vCPU preemption, but in that case the
overlapping RCU CPU stall warnings are the least of your worries.

Reported-by: Dmitry Vyukov <dvyukov@xxxxxxxxxx>
Signed-off-by: Paul E. McKenney <paulmck@xxxxxxxxxxxxxxxxxx>

diff --git a/kernel/rcu/tree.c b/kernel/rcu/tree.c
index 381b47a68ac6..b7246bcbf633 100644
--- a/kernel/rcu/tree.c
+++ b/kernel/rcu/tree.c
@@ -1429,8 +1429,6 @@ static void print_other_cpu_stall(struct rcu_state *rsp, unsigned long gpnum)
raw_spin_unlock_irqrestore_rcu_node(rnp, flags);
return;
}
- WRITE_ONCE(rsp->jiffies_stall,
- jiffies + 3 * rcu_jiffies_till_stall_check() + 3);
raw_spin_unlock_irqrestore_rcu_node(rnp, flags);

/*
@@ -1481,6 +1479,10 @@ static void print_other_cpu_stall(struct rcu_state *rsp, unsigned long gpnum)
sched_show_task(current);
}
}
+ /* Rewrite if needed in case of slow consoles. */
+ if (ULONG_CMP_GE(jiffies, READ_ONCE(rsp->jiffies_stall)))
+ WRITE_ONCE(rsp->jiffies_stall,
+ jiffies + 3 * rcu_jiffies_till_stall_check() + 3);

rcu_check_gp_kthread_starvation(rsp);

@@ -1525,6 +1527,7 @@ static void print_cpu_stall(struct rcu_state *rsp)
rcu_dump_cpu_stacks(rsp);

raw_spin_lock_irqsave_rcu_node(rnp, flags);
+ /* Rewrite if needed in case of slow consoles. */
if (ULONG_CMP_GE(jiffies, READ_ONCE(rsp->jiffies_stall)))
WRITE_ONCE(rsp->jiffies_stall,
jiffies + 3 * rcu_jiffies_till_stall_check() + 3);
@@ -1548,6 +1551,7 @@ static void check_cpu_stall(struct rcu_state *rsp, struct rcu_data *rdp)
unsigned long gpnum;
unsigned long gps;
unsigned long j;
+ unsigned long jn;
unsigned long js;
struct rcu_node *rnp;

@@ -1586,14 +1590,17 @@ static void check_cpu_stall(struct rcu_state *rsp, struct rcu_data *rdp)
ULONG_CMP_GE(gps, js))
return; /* No stall or GP completed since entering function. */
rnp = rdp->mynode;
+ jn = jiffies + 3 * rcu_jiffies_till_stall_check() + 3;
if (rcu_gp_in_progress(rsp) &&
- (READ_ONCE(rnp->qsmask) & rdp->grpmask)) {
+ (READ_ONCE(rnp->qsmask) & rdp->grpmask) &&
+ cmpxchg(&rsp->jiffies_stall, js, jn) == js) {

/* We haven't checked in, so go dump stack. */
print_cpu_stall(rsp);

} else if (rcu_gp_in_progress(rsp) &&
- ULONG_CMP_GE(j, js + RCU_STALL_RAT_DELAY)) {
+ ULONG_CMP_GE(j, js + RCU_STALL_RAT_DELAY) &&
+ cmpxchg(&rsp->jiffies_stall, js, jn) == js) {

/* They had a few time units to dump stack, so complain. */
print_other_cpu_stall(rsp, gpnum);

Next message: Alex G.: "Re: [RFC PATCH 3/4] acpi: apei: Do not panic() in NMI because of GHES messages"
Previous message: Waiman Long: "[PATCH 1/2] locking/qspinlock: Add stat tracking for pending vs slowpath"
In reply to: Dmitry Vyukov: "Re: INFO: task hung in perf_trace_event_unreg"
Next in thread: Dmitry Vyukov: "Re: INFO: task hung in perf_trace_event_unreg"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]