Re: [stable] 2.6.32.21 - uptime related crashes?

From: Ruben Kerkhof
Date: Sun Oct 23 2011 - 14:31:57 EST


On Mon, Sep 5, 2011 at 01:26, Faidon Liambotis <paravoid@xxxxxxxxxx> wrote:
> On Tue, Aug 30, 2011 at 03:38:29PM -0700, Greg KH wrote:
>> On Thu, Aug 25, 2011 at 09:56:16PM +0300, Faidon Liambotis wrote:
>> > On Thu, Jul 21, 2011 at 08:45:25PM +0200, Ingo Molnar wrote:
>> > > * Peter Zijlstra <peterz@xxxxxxxxxxxxx> wrote:
>> > >
>> > > > On Thu, 2011-07-21 at 14:50 +0200, Nikola Ciprich wrote:
>> > > > > thanks for the patch! I'll put this on our testing boxes...
>> > > >
>> > > > With a patch that frobs the starting value close to overflowing I hope,
>> > > > otherwise we'll not hear from you in like 7 months ;-)
>> > > >
>> > > > > Are You going to push this upstream so we can ask Greg to push this to
>> > > > > -stable?
>> > > >
>> > > > Yeah, I think we want to commit this with a -stable tag, Ingo?
>> > >
>> > > yeah - and we also want a Reported-by tag and an explanation of how
>> > > it can crash and why it matters in practice. I can then stick it into
>> > > the urgent branch for Linus. (probably will only hit upstream in the
>> > > merge window though.)
>> >
>> > Has this been pushed or has the problem been solved somehow? Time is
>> > against us on this bug as more boxes will crash as they reach 200 days
>> > of uptime...
>> >
>> > In any case, feel free to use me as a Reported-by, my full report of the
>> > problem being <20110430173905.GA25641@xxxxxx>.
>> >
>> > FWIW and if I understand correctly, my symptoms were caused by *two*
>> > different bugs:
>> > a) the 54 bits wraparound at 208 days that Peter fixed above,
>> > b) a kernel crash at ~215 days related to RT tasks, fixed by
>> > 305e6835e05513406fa12820e40e4a8ecb63743c (already in -stable).
>>
>> So, what do I do here as part of the .32-longterm kernel? ÂIs there a
>> fix that is in Linus's tree that I need to apply here?
>>
>> confused,
>
> Is this even pushed upstream? I checked Linus' tree and the proposed
> patch is *not* merged there. I'm not really sure if it was fixed some
> other way, though. I thought this was intended to be an "urgent" fix or
> something?
>
> Regards,
> Faidon

I just had two crashes on two different machines, both with an uptime
of 208 days.
Both were 5520's running 2.6.34.8, but with a CONFIG_HZ of 1000

2011-10-23T16:49:18.618029+02:00 phy001 kernel: BUG: soft lockup -
CPU#0 stuck for 17163091968s! [qemu-kvm:16949]
2011-10-23T16:49:18.618054+02:00 phy001 kernel: Modules linked in:
xt_limit ebt_log ebt_limit ebt_arp ebtable_filter ebtable_nat ebtables
ufs nls_utf8 tun ipmi_devintf ipmi_si ipmi_msghandler bridge 8021q
garp stp llc bonding xt_comment xt_recent ip6t_REJECT
nf_conntrack_ipv6 ip6table_filter ip6_tables ipv6 kvm_intel kvm
ioatdma i2c_i801 igb iTCO_wdt dca iTCO_vendor_support serio_raw
i2c_core 3w_9xxx [last unloaded: scsi_wait_scan]
2011-10-23T16:49:18.618060+02:00 phy001 kernel: CPU 0
2011-10-23T16:49:18.618068+02:00 phy001 kernel: Modules linked in:
xt_limit ebt_log ebt_limit ebt_arp ebtable_filter ebtable_nat ebtables
ufs nls_utf8 tun ipmi_devintf ipmi_si ipmi_msghandler bridge 8021q
garp stp llc bonding xt_comment xt_recent ip6t_REJECT
nf_conntrack_ipv6 ip6table_filter ip6_tables ipv6 kvm_intel kvm
ioatdma i2c_i801 igb iTCO_wdt dca iTCO_vendor_support serio_raw
i2c_core 3w_9xxx [last unloaded: scsi_wait_scan]
2011-10-23T16:49:18.618072+02:00 phy001 kernel:
2011-10-23T16:49:18.618077+02:00 phy001 kernel: Pid: 16949, comm:
qemu-kvm Tainted: G M 2.6.34.8-68.local.fc13.x86_64 #1
X8DTU/X8DTU
2011-10-23T16:49:18.618083+02:00 phy001 kernel: RIP:
0010:[<ffffffffa007f92f>] [<ffffffffa007f92f>]
kvm_arch_vcpu_ioctl_run+0x764/0xa74 [kvm]
2011-10-23T16:49:18.618086+02:00 phy001 kernel: RSP:
0018:ffff880bafa29d18 EFLAGS: 00000202
2011-10-23T16:49:18.618088+02:00 phy001 kernel: RAX: ffff880002000000
RBX: ffff880bafa29dc8 RCX: ffff8805e45128a0
2011-10-23T16:49:18.618091+02:00 phy001 kernel: RDX: 000000000000cb80
RSI: 0000000004b2a3a0 RDI: 000000000b630000
2011-10-23T16:49:18.618093+02:00 phy001 kernel: RBP: ffffffff8100a60e
R08: 000000000000002b R09: 00000000760d0735
2011-10-23T16:49:18.618095+02:00 phy001 kernel: R10: 0000000000000000
R11: 0000000000000000 R12: 0000000000000001
2011-10-23T16:49:18.618097+02:00 phy001 kernel: R13: ffff880bafa29cc8
R14: ffffffffa007b536 R15: ffff880bafa29ca8
2011-10-23T16:49:18.618100+02:00 phy001 kernel: FS:
00007fe92cd38700(0000) GS:ffff880002000000(0000)
knlGS:fffff880009b8000
2011-10-23T16:49:18.618102+02:00 phy001 kernel: CS: 0010 DS: 002b ES:
002b CR0: 0000000080050033
2011-10-23T16:49:18.618104+02:00 phy001 kernel: CR2: 00000000c1a00044
CR3: 00000006b3f2e000 CR4: 00000000000026e0
2011-10-23T16:49:18.618107+02:00 phy001 kernel: DR0: 0000000000000000
DR1: 0000000000000000 DR2: 0000000000000000
2011-10-23T16:49:18.618109+02:00 phy001 kernel: DR3: 0000000000000000
DR6: 00000000ffff0ff0 DR7: 0000000000000400
2011-10-23T16:49:18.618112+02:00 phy001 kernel: Process qemu-kvm (pid:
16949, threadinfo ffff880bafa28000, task ffff880c242e0000)
2011-10-23T16:49:18.618114+02:00 phy001 kernel: Stack:
2011-10-23T16:49:18.618116+02:00 phy001 kernel: ffff88077b1a3ca8
ffffffff81d3cf38 ffff8805e4513f00 ffff880c242e0000
2011-10-23T16:49:18.618119+02:00 phy001 kernel: <0> ffff880c242e0000
ffff880bafa29fd8 ffff8805e4513ef8 0000000000015fd0
2011-10-23T16:49:18.618121+02:00 phy001 kernel: <0> 000000000000cb80
ffff880c242e0000 ffff880bafa28000 ffff880ab43f4038
2011-10-23T16:49:18.618123+02:00 phy001 kernel: Call Trace:
2011-10-23T16:49:18.618126+02:00 phy001 kernel: [<ffffffffa006e5ba>] ?
kvm_vcpu_ioctl+0xfd/0x56e [kvm]
2011-10-23T16:49:18.618129+02:00 phy001 kernel: [<ffffffff81011252>] ?
__switch_to_xtra+0x121/0x141
2011-10-23T16:49:18.618131+02:00 phy001 kernel: [<ffffffff8111ad5f>] ?
vfs_ioctl+0x32/0xa6
2011-10-23T16:49:18.618134+02:00 phy001 kernel: [<ffffffff8111b2d2>] ?
do_vfs_ioctl+0x483/0x4c9
2011-10-23T16:49:18.618137+02:00 phy001 kernel: [<ffffffff8111b36e>] ?
sys_ioctl+0x56/0x79
2011-10-23T16:49:18.618139+02:00 phy001 kernel: [<ffffffff81009c72>] ?
system_call_fastpath+0x16/0x1b
2011-10-23T16:49:18.618142+02:00 phy001 kernel: Code: df ff 90 48 01
00 00 48 8b 55 90 65 48 8b 04 25 90 e8 00 00 f6 04 10 aa 74 05 e8 05
06 f9 e0 f0 41 80 0f 02 fb 66 0f 1f 44 00 00 <ff> 83 b0 00 00 00 48 8b
b5 68 ff ff ff 83 66 14 ef 48 8b 3b 48

Can the necessary fix please be pushed upstream?

Kind regards,

Ruben Kerkhof
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/