Re: [stable] 2.6.32.21 - uptime related crashes?

From: john stultz
Date: Tue Oct 25 2011 - 18:44:55 EST


On Sun, 2011-10-23 at 20:31 +0200, Ruben Kerkhof wrote:
> On Mon, Sep 5, 2011 at 01:26, Faidon Liambotis <paravoid@xxxxxxxxxx> wrote:
> > On Tue, Aug 30, 2011 at 03:38:29PM -0700, Greg KH wrote:
> >> On Thu, Aug 25, 2011 at 09:56:16PM +0300, Faidon Liambotis wrote:
> >> > On Thu, Jul 21, 2011 at 08:45:25PM +0200, Ingo Molnar wrote:
> >> > > * Peter Zijlstra <peterz@xxxxxxxxxxxxx> wrote:
> >> > >
> >> > > > On Thu, 2011-07-21 at 14:50 +0200, Nikola Ciprich wrote:
> >> > > > > thanks for the patch! I'll put this on our testing boxes...
> >> > > >
> >> > > > With a patch that frobs the starting value close to overflowing I hope,
> >> > > > otherwise we'll not hear from you in like 7 months ;-)
> >> > > >
> >> > > > > Are You going to push this upstream so we can ask Greg to push this to
> >> > > > > -stable?
> >> > > >
> >> > > > Yeah, I think we want to commit this with a -stable tag, Ingo?
> >> > >
> >> > > yeah - and we also want a Reported-by tag and an explanation of how
> >> > > it can crash and why it matters in practice. I can then stick it into
> >> > > the urgent branch for Linus. (probably will only hit upstream in the
> >> > > merge window though.)
> >> >
> >> > Has this been pushed or has the problem been solved somehow? Time is
> >> > against us on this bug as more boxes will crash as they reach 200 days
> >> > of uptime...
> >> >
> >> > In any case, feel free to use me as a Reported-by, my full report of the
> >> > problem being <20110430173905.GA25641@xxxxxx>.
> >> >
> >> > FWIW and if I understand correctly, my symptoms were caused by *two*
> >> > different bugs:
> >> > a) the 54 bits wraparound at 208 days that Peter fixed above,
> >> > b) a kernel crash at ~215 days related to RT tasks, fixed by
> >> > 305e6835e05513406fa12820e40e4a8ecb63743c (already in -stable).
> >>
> >> So, what do I do here as part of the .32-longterm kernel? Is there a
> >> fix that is in Linus's tree that I need to apply here?
> >>
> >> confused,
> >
> > Is this even pushed upstream? I checked Linus' tree and the proposed
> > patch is *not* merged there. I'm not really sure if it was fixed some
> > other way, though. I thought this was intended to be an "urgent" fix or
> > something?
> >
> > Regards,
> > Faidon
>
> I just had two crashes on two different machines, both with an uptime
> of 208 days.
> Both were 5520's running 2.6.34.8, but with a CONFIG_HZ of 1000
>
> 2011-10-23T16:49:18.618029+02:00 phy001 kernel: BUG: soft lockup -
> CPU#0 stuck for 17163091968s! [qemu-kvm:16949]

So were these actual crashes, or just softlockup false positives?

I had thought the earlier crash issue (div by zero) fix from PeterZ had
been already pushed upstream, but maybe that was just against 2.6.32 and
not 2.6.33?

The softlockup false positive issue should have been fixed by Peter's
"x86, intel: Don't mark sched_clock() as stable" below. But I'm not
seeing it upstream. Peter, is this still the right fix?

thanks
-john


From: Peter Zijlstra <a.p.zijlstra@xxxxxxxxx>
Subject: x86, intel: Don't mark sched_clock() as stable

Because the x86 sched_clock() implementation wraps at 54 bits and the
scheduler code assumes it wraps at the full 64bits we can get into
trouble after 208 days (~7 months) of uptime.

Signed-off-by: Peter Zijlstra <a.p.zijlstra@xxxxxxxxx>
---
arch/x86/kernel/cpu/intel.c | 7 +++++++
1 files changed, 7 insertions(+), 0 deletions(-)

diff --git a/arch/x86/kernel/cpu/intel.c b/arch/x86/kernel/cpu/intel.c
index ed6086e..c8dc48b 100644
--- a/arch/x86/kernel/cpu/intel.c
+++ b/arch/x86/kernel/cpu/intel.c
@@ -91,8 +91,15 @@ static void __cpuinit early_init_intel(struct cpuinfo_x86 *c)
if (c->x86_power & (1 << 8)) {
set_cpu_cap(c, X86_FEATURE_CONSTANT_TSC);
set_cpu_cap(c, X86_FEATURE_NONSTOP_TSC);
+ /*
+ * Unfortunately our __cycles_2_ns() implementation makes
+ * the raw sched_clock() interface wrap at 54-bits, which
+ * makes it unsuitable for direct use, so disable this
+ * for now.
+ *
if (!check_tsc_unstable())
sched_clock_stable = 1;
+ */
}

/*



--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/