Re: 2.6.32.21 - uptime related crashes?
From: Willy Tarreau
Date: Sat Apr 30 2011 - 16:16:09 EST
Hi Faidon,
On Sat, Apr 30, 2011 at 08:39:05PM +0300, Faidon Liambotis wrote:
> If I may add some, hopefully useful, information to the thread:
>
> At work we have an HP c7000 blade enclosure. It's populated with 8 ProLiant
> BL460c G1 (Xeon E5405, constant_tsc but not nonstop_tsc) and 4 ProLiant
> BL460c
> G6 (Xeon E5504, constant_tsc + nonstop_tsc). All were booted at the same day
> with Debian's 2.6.32-23~bpo50+1 kernel, i.e. upstream 2.6.32.21.
>
> We too experienced problems with just the G6 blades at near 215 days
> uptime (on the 19th of April), all at the same time. From our
> investigation, it seems that their cpu_clocks jumped suddenly far in the
> future and then almost immediately rolled over due to wrapping around
> 64-bits.
>
> Although all of their (G6s) clocks wrapped around *at the same time*, only
> one
> of them actually crashed at the time, with a second one crashing just a few
> days later, on the 28th.
>
> Three of them had the following on their logs:
> Apr 18 20:56:07 hn-05 kernel: [17966378.581971] tap0: no IPv6 routers
> present
> Apr 19 10:15:42 hn-05 kernel: [18446743935.365550] BUG: soft lockup - CPU#4
> stuck for 17163091968s! [kvm:25913]
> [...]
> Apr 19 10:15:42 hn-05 kernel: [18446743935.447275] [<ffffffff81010b42>] ?
> system_call_fastpath+0x16/0x1b
> Apr 19 10:18:32 hn-05 kernel: [ 31.587025] bond0.13: received packet with
> own address as source address
> (full oops at the end of this mail)
>
> Note that the 17163091968s time stuck was *the same* (+/- 1s) in *all
> three of them*.
>
> What's also very strange, although we're not very sure if related, is
> that when we took the first line of the above log entry and substracted
> 17966378.581971 from Apr 18 20:56:07, this resulted in a boot-time that
> differed several hours from the actual boot time (the latter being
> verified both with syslog and /proc/bstat's btime, which were both in
> agreement). This was verified post-mortem too, with the date checked to
> be correct.
Well, your report is by far the most complete of our 3 since you managed
to get this trace of wraping time.
I don't know if it means anything to anyone, but 17163091968 is exactly
0x3FF000000, or 1023*2^24. Too round to be a coincidence. This number
of seconds corresponds to 1000 * 2^32 jiffies at 250 Hz.
> IOW, we have some serious clock drift (calcuated it at runtime to
> ~0.1s/min) on these machines that hasn't been made apparent probably
> since they all run NTP. Moreover, we also saw that drift on other
> machines (1U, different vendor, different data center, E5520 CPUs) but
> not with the G1 blades. Our investigation showed that the drift is there
> constantly (it's not a one-time event) but we're not really sure if it's
> related with the Apr 18th jump event. Note that the drift is there even
> if we boot with "clocksource=hpet" but disappears when booted with
> "notsc". Also note that we've verified with 2.6.38 and the drift is
> still there.
>
> Regards,
> Faidon
>
> 1: The full backtrace is:
> Apr 18 20:56:07 hn-05 kernel: [17966378.581971] tap0: no IPv6 routers
> present
> Apr 19 06:25:02 hn-05 kernel: imklog 3.18.6, log source = /proc/kmsg
> started.
> Apr 19 10:15:42 hn-05 kernel: [18446743935.365550] BUG: soft lockup - CPU#4
> stuck for 17163091968s! [kvm:25913]
> Apr 19 10:15:42 hn-05 kernel: [18446743935.447056] Modules linked in:
> xt_pkttype ext2 tun kvm_intel kvm nf_conntrack_ipv6 ip6table_filter
> ip6_tables xt_tcpudp nf_conntrack_ipv4 nf_
> defrag_ipv4 xt_state nf_conntrack iptable_filter ip_tables x_tables 8021q
> garp bridge stp bonding dm_round_robin dm_multipath scsi_dh ipmi_poweroff
> ipmi_devintf loop snd_pcm snd_ti
> mer snd soundcore bnx2x psmouse snd_page_alloc serio_raw sd_mod crc_t10dif
> hpilo ipmi_si ipmi_msghandler container evdev crc32c pcspkr shpchp
> pci_hotplug libcrc32c power_meter mdio
> button processor ext3 jbd mbcache dm_mirror dm_region_hash dm_log
> dm_snapshot dm_mod usbhid hid cciss ehci_hcd uhci_hcd qla2xxx
> scsi_transport_fc usbcore nls_base tg3 libphy scsi_
> tgt scsi_mod thermal fan thermal_sys [last unloaded: scsi_wait_scan]
> Apr 19 10:15:42 hn-05 kernel: [18446743935.447111] CPU 4:
> Apr 19 10:15:42 hn-05 kernel: [18446743935.447112] Modules linked in:
> xt_pkttype ext2 tun kvm_intel kvm nf_conntrack_ipv6 ip6table_filter
> ip6_tables xt_tcpudp nf_conntrack_ipv4 nf_
> defrag_ipv4 xt_state nf_conntrack iptable_filter ip_tables x_tables 8021q
> garp bridge stp bonding dm_round_robin dm_multipath scsi_dh ipmi_poweroff
> ipmi_devintf loop snd_pcm snd_timer snd soundcore bnx2x psmouse
> snd_page_alloc serio_raw sd_mod crc_t10dif hpilo ipmi_si ipmi_msghandler
> container evdev crc32c pcspkr shpchp pci_hotplug libcrc32c power_meter mdio
> button processor ext3 jbd mbcache dm_mirror dm_region_hash dm_log
> dm_snapshot dm_mod usbhid hid cciss ehci_hcd uhci_hcd qla2xxx
> scsi_transport_fc usbcore nls_base tg3 libphy scsi_tgt scsi_mod thermal fan
> thermal_sys [last unloaded: scsi_wait_scan]
> Apr 19 10:15:42 hn-05 kernel: [18446743935.447154] Pid: 25913, comm: kvm
> Not tainted 2.6.32-bpo.5-amd64 #1 ProLiant BL460c G6
> Apr 19 10:15:42 hn-05 kernel: [18446743935.447157] RIP:
> 0010:[<ffffffffa02e209a>] [<ffffffffa02e209a>]
> kvm_arch_vcpu_ioctl_run+0x785/0xa44 [kvm]
> Apr 19 10:15:42 hn-05 kernel: [18446743935.447177] RSP:
> 0018:ffff88070065fd38 EFLAGS: 00000202
> Apr 19 10:15:42 hn-05 kernel: [18446743935.447179] RAX: ffff88070065ffd8
> RBX: ffff8804154ba860 RCX: ffff8804154ba860
> Apr 19 10:15:42 hn-05 kernel: [18446743935.447182] RDX: ffff8804154ba8b9
> RSI: ffff81004c818b10 RDI: ffff8100291a7d48
> Apr 19 10:15:42 hn-05 kernel: [18446743935.447184] RBP: ffffffff8101166e
> R08: 0000000000000000 R09: 0000000000000000
> Apr 19 10:15:42 hn-05 kernel: [18446743935.447187] R10: 00007f1b2a2af078
> R11: ffffffff802f3405 R12: 0000000000000001
> Apr 19 10:15:42 hn-05 kernel: [18446743935.447189] R13: 00000000154ba8b8
> R14: ffff8804136ac000 R15: ffff8804154bbd48
> Apr 19 10:15:42 hn-05 kernel: [18446743935.447192] FS:
> 0000000042cbb950(0000) GS:ffff88042e440000(0000) knlGS:0000000000000000
> Apr 19 10:15:42 hn-05 kernel: [18446743935.447195] CS: 0010 DS: 002b ES:
> 002b CR0: 0000000080050033
> Apr 19 10:15:42 hn-05 kernel: [18446743935.447197] CR2: 0000000000000008
> CR3: 0000000260c41000 CR4: 00000000000026e0
> Apr 19 10:15:42 hn-05 kernel: [18446743935.447200] DR0: 0000000000000000
> DR1: 0000000000000000 DR2: 0000000000000000
> Apr 19 10:15:42 hn-05 kernel: [18446743935.447202] DR3: 0000000000000000
> DR6: 00000000ffff0ff0 DR7: 0000000000000400
> Apr 19 10:15:42 hn-05 kernel: [18446743935.447205] Call Trace:
> Apr 19 10:15:42 hn-05 kernel: [18446743935.447215] [<ffffffffa02e2004>] ?
> kvm_arch_vcpu_ioctl_run+0x6ef/0xa44 [kvm]
> Apr 19 10:15:42 hn-05 kernel: [18446743935.447224] [<ffffffff8100f79c>] ?
> __switch_to+0x285/0x297
> Apr 19 10:15:42 hn-05 kernel: [18446743935.447231] [<ffffffffa02d49d1>] ?
> kvm_vcpu_ioctl+0xf1/0x4e6 [kvm]
> Apr 19 10:15:42 hn-05 kernel: [18446743935.447237] [<ffffffff810240da>] ?
> lapic_next_event+0x18/0x1d
> Apr 19 10:15:42 hn-05 kernel: [18446743935.447245] [<ffffffff8106fb77>] ?
> tick_dev_program_event+0x2d/0x95
> Apr 19 10:15:42 hn-05 kernel: [18446743935.447251] [<ffffffff81047e29>] ?
> finish_task_switch+0x3a/0xaf
> Apr 19 10:15:42 hn-05 kernel: [18446743935.447258] [<ffffffff810f9f5a>] ?
> vfs_ioctl+0x21/0x6c
> Apr 19 10:15:42 hn-05 kernel: [18446743935.447261] [<ffffffff810fa4a8>] ?
> do_vfs_ioctl+0x48d/0x4cb
> Apr 19 10:15:42 hn-05 kernel: [18446743935.447268] [<ffffffff81063fcd>] ?
> sys_timer_settime+0x233/0x283
> Apr 19 10:15:42 hn-05 kernel: [18446743935.447272] [<ffffffff810fa537>] ?
> sys_ioctl+0x51/0x70
> Apr 19 10:15:42 hn-05 kernel: [18446743935.447275] [<ffffffff81010b42>] ?
> system_call_fastpath+0x16/0x1b
> Apr 19 10:18:32 hn-05 kernel: [ 31.587025] bond0.13: received packet with
> own address as source address
Regards,
Willy
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/