[BUG] machine check Oops on Alpha

From: Bob Tracy
Date: Sun Apr 17 2016 - 17:30:28 EST


Apologies in advance for the "poor" quality of this bug report. No idea
how to proceed, because the issue historically has been intermittent to
non-existant for reasons unknown.

Within 24 hours of booting my Alpha (PWS 433au), I'm pretty much
guaranteed to see a "machine check" Oops which typically will occur
during a period of high disk activity (for example, during an "apt-get
update / upgrade". If I want a huge mess to clean up afterward, "git
pull" on the kernel source tree will generally suffice as well :-(.

As long as the "Oops" trace doesn't include evidence of filesystem write
activity (calls to ext3/4 functions), the machine is perfectly stable
afterward for as long as I care to let it run -- days, weeks, whatever
-- no further Oopses will occur, regardless of how hard I flog the
machine. A "bad" Oops will cause an immediate system lockup if any
process attempts to access the region of disk that was active at the
time the Oops occurred.

While a "machine check" is normally indicative of an underlying hardware
issue, the fact this is a one-time-per-boot issue has me thinking
otherwise. I suspect a code path being traversed prior to the Oops that
gets bypassed afterward. As previously mentioned, there have been months-
long intervals in the past where the issue has either been masked or non-
existent. Currently, the issue has persisted through several 4.X kernel
release candidates and releases.

Attached is an example of precisely what I'm talking about as far as a
"good" Oops. It occurred within a day of the last reboot, and the
machine has been running fine since. Been flogging the devil out of it,
too: lots of updates (hundreds of megabytes), kernel builds, etc.

While any and all help tracking this down will be appreciated, please
know that kernel rebuilds (to turn on debugging or for whatever reason)
are an overnight affair on this system. In other words, turnaround time
on diagnostic iterations involving kernel modifications will be slow.

--Bob
Apr 9 21:40:15 smirkin kernel: Unable to handle kernel paging request at virtual address 0000000000000010
Apr 9 21:40:15 smirkin kernel: dpkg-deb(19404): Oops 0
Apr 9 21:40:15 smirkin kernel: pc = [<fffffc0000316174>] ra = [<fffffc000031df78>] ps = 0007 Not tainted
Apr 9 21:40:15 smirkin kernel: pc is at process_mcheck_info+0x54/0x370
Apr 9 21:40:15 smirkin kernel: ra is at cia_machine_check+0x98/0xb0
Apr 9 21:40:15 smirkin kernel: v0 = 0000000000000004 t0 = 0000000000000000 t1 = 0000000000000001
Apr 9 21:40:15 smirkin kernel: t2 = 0000000000000630 t3 = fffffc0000d405f0 t4 = fffffc0000acf166
Apr 9 21:40:15 smirkin kernel: t5 = 00000000001fffff t6 = 00000000ffffffff t7 = fffffc005cf38000
Apr 9 21:40:15 smirkin kernel: s0 = 0000000000000000 s1 = fffffc0000c61750 s2 = 0000000000000000
Apr 9 21:40:15 smirkin kernel: s3 = 0000000000000000 s4 = fffffc0000cbcef0 s5 = fffffc0000d405d0
Apr 9 21:40:15 smirkin kernel: s6 = fffffc0000c7ef70
Apr 9 21:40:15 smirkin kernel: a0 = 0000000000000630 a1 = fffffc0000aca965 a2 = 0000000000000630
Apr 9 21:40:15 smirkin kernel: a3 = 0000000000000000 a4 = 0000000000000000 a5 = 0000000000000000
Apr 9 21:40:15 smirkin kernel: t8 = 000000000000001f t9 = fffffc0000acbb38 t10= fffffc0000d40608
Apr 9 21:40:15 smirkin kernel: t11= 0000000000000000 pv = fffffc0000316120 at = 0000000000800000
Apr 9 21:40:15 smirkin kernel: gp = fffffc0000cabb38 sp = fffffc005cf3b978
Apr 9 21:40:15 smirkin kernel: Disabling lock debugging due to kernel taint
Apr 9 21:40:15 smirkin kernel: Trace:
Apr 9 21:40:15 smirkin kernel: [<fffffc000031df78>] cia_machine_check+0x98/0xb0
Apr 9 21:40:15 smirkin kernel: [<fffffc0000316100>] do_entInt+0x1c0/0x1e0
Apr 9 21:40:15 smirkin kernel: [<fffffc0000311340>] ret_from_sys_call+0x0/0x10
Apr 9 21:40:15 smirkin kernel: [<fffffc0000398ea4>] get_page_from_freelist+0x504/0xa10
Apr 9 21:40:15 smirkin kernel: [<fffffc00005aa410>] clear_page+0x0/0xc4
Apr 9 21:40:15 smirkin kernel: [<fffffc00005aa428>] clear_page+0x18/0xc4
Apr 9 21:40:15 smirkin kernel: [<fffffc000039949c>] __alloc_pages_nodemask+0xec/0xa00
Apr 9 21:40:15 smirkin kernel: [<fffffc00003b70a0>] wp_page_copy.isra.100+0x3c0/0x620
Apr 9 21:40:15 smirkin kernel: [<fffffc00003b6d3c>] wp_page_copy.isra.100+0x5c/0x620
Apr 9 21:40:15 smirkin kernel: [<fffffc00003b8828>] do_wp_page.isra.102+0x128/0x640
Apr 9 21:40:15 smirkin kernel: [<fffffc00003b8758>] do_wp_page.isra.102+0x58/0x640
Apr 9 21:40:15 smirkin kernel: [<fffffc000036377c>] current_fs_time+0x4c/0x70
Apr 9 21:40:15 smirkin kernel: [<fffffc00003bac6c>] handle_mm_fault+0x73c/0x1180
Apr 9 21:40:15 smirkin kernel: [<fffffc00003bb4f8>] handle_mm_fault+0xfc8/0x1180
Apr 9 21:40:15 smirkin kernel: [<fffffc000036bbe0>] timekeeping_update+0x130/0x200
Apr 9 21:40:15 smirkin kernel: [<fffffc0000365790>] hrtimer_run_queues+0x50/0x210
Apr 9 21:40:15 smirkin kernel: [<fffffc000031ec30>] do_page_fault+0x150/0x500
Apr 9 21:40:15 smirkin kernel: [<fffffc00003bde68>] find_vma+0x28/0xc0
Apr 9 21:40:15 smirkin kernel: [<fffffc000031ebb4>] do_page_fault+0xd4/0x500
Apr 9 21:40:15 smirkin kernel: [<fffffc00003734fc>] tick_periodic.constprop.17+0x3c/0xc0
Apr 9 21:40:15 smirkin kernel: [<fffffc000031eb9c>] do_page_fault+0xbc/0x500
Apr 9 21:40:15 smirkin kernel: [<fffffc0000328244>] __do_softirq+0x184/0x310
Apr 9 21:40:15 smirkin kernel: [<fffffc0000310f7c>] entMM+0x9c/0xc0
Apr 9 21:40:15 smirkin kernel: [<fffffc0000315e8c>] handle_irq+0x8c/0xf0
Apr 9 21:40:15 smirkin kernel: [<fffffc0000315f9c>] do_entInt+0x5c/0x1e0
Apr 9 21:40:15 smirkin kernel:
Apr 9 21:40:15 smirkin kernel: Code: a53e0008 a55e0010 23de0020 6bfa8001 a55de018 47f00412 <a2890010> 261dffe2