Re: [linux-next][DLPAR CPU][Oops] Kernel crash with CPU hotunplug

From: Michael Ellerman
Date: Thu Oct 05 2017 - 20:33:15 EST

Abdul Haleem <abdhalee@xxxxxxxxxxxxxxxxxx> writes:

> Hi,
> linux-next kernel panic while DLPAR CPU add/remove operation in a loop.
> Test: CPU hot-unplug
> Machine Type: Power8 PowerVM LPAR
> kernel: 4.14.0-rc2-next-20170928
> gcc : 5.2.1
> trace logs
> ----------
> cpu 10 (hwid 10) Ready to die...
> cpu 11 (hwid 11) Ready to die...
> cpu 12 (hwid 12) Ready to die...
> cpu 13 (hwid 13) Ready to die...
> cpu 14 (hwid 14) Ready to die...
> cpu 15 (hwid 15) Ready to die...
> Unable to handle kernel paging request for data at address 0xdead4ead00000030

That's SPINLOCK_MAGIC plus 0x30.

> Faulting instruction address: 0xc000000001af38e4
> Oops: Kernel access of bad area, sig: 11 [#1]
> LE SMP NR_CPUS=2048 NUMA pSeries
> Modules linked in: rpadlpar_io rpaphp bridge stp llc xt_tcpudp ipt_REJECT nf_reject_ipv4 xt_conntrack nfnetlink iptable_nat nf_conntrack_ipv4 nf_defrag_ipv4 nf_nat_ipv4 nf_nat nf_conntrack iptable_mangle iptable_filter vmx_crypto pseries_rng rng_core binfmt_misc nfsd ip_tables x_tables autofs4
> CPU: 7 PID: 10657 Comm: systemd-udevd Not tainted 4.14.0-rc2-next-20170928-autotest #1
> task: c000000271b7cc00 task.stack: c00000026d504000
> NIP: c000000001af38e4 LR: c000000001af3b48 CTR: c000000001af4270
> REGS: c00000026d5079e0 TRAP: 0380 Not tainted (4.14.0-rc2-next-20170928-autotest)
> MSR: 8000000000009033 <SF,EE,ME,IR,DR,RI,LE> CR: 22008882 XER: 20000000
> CFAR: c000000001af3b44 SOFTE: 1
> GPR00: c000000001af3b48 c00000026d507c60 c000000003572500 c00000026c0d4a80
> GPR04: c00000026c0d4a80 c00000026b56b310 c0000000037d2500 dead4ead00000030
> GPR08: 00000000000016f0 fffffffffffffff0 dead4ead00000000 c000000270b24420
> GPR12: c000000001af4270 c00000000fdc1f80 00000000000029a3 000000000aba9500
> GPR16: 000001000e4134f0 000000000aba9500 000000000000000f 0000000000000001
> GPR20: 0000000120ff68d8 0000000120ff68d0 0000000120ff6a48 0000000120ff33f0
> GPR24: 0000000120ff6550 c00000026b56b310 c00000027286d9b8 c0000000037d4d88
> GPR28: c0000002727b17a0 c00000026c0d4a80 c00000027286da38 c00000026c0d4a80
> NIP [c000000001af38e4] free_pipe_info+0x64/0x200
> LR [c000000001af3b48] put_pipe_info+0xc8/0x140
> Call Trace:
> [c00000026d507c60] [c00000027286da38] 0xc00000027286da38 (unreliable)
> [c00000026d507ca0] [c000000001af3b48] put_pipe_info+0xc8/0x140
> [c00000026d507ce0] [c000000001af43fc] pipe_release+0x18c/0x1e0
> [c00000026d507d20] [c000000001ae0efc] __fput+0x12c/0x4f0
> [c00000026d507d80] [c000000001ae12ec] ____fput+0x2c/0x50
> [c00000026d507da0] [c00000000178eb3c] task_work_run+0x17c/0x200
> [c00000026d507e00] [c00000000160adb8] do_notify_resume+0x1f8/0x220
> [c00000026d507e30] [c0000000015ebec4] ret_from_except_lite+0x70/0x74
> Instruction dump:
> 81230070 e94300b0 39080001 7d2900d0 38ea0030 f9066d98 7c0004ac 3d020026
> e9086da0 3cc20026 39080001 f9066da0 <7d0038a8> 7d094214 7d0039ad 40c2fff4

Which is:
lwz r9,112(r3)
ld r10,176(r3) # r3 = struct pipe_inode_info *pipe, r10 = &pipe->user
addi r8,r8,1
neg r9,r9
addi r7,r10,48 # r7 = &(pipe->user->pipe_bufs)
std r8,28056(r6)
addis r8,r2,38
ld r8,28064(r8)
addis r6,r2,38
addi r8,r8,1
std r8,28064(r6)
ldarx r8,0,r7 <- fault
add r8,r9,r8
stdcx. r8,0,r7

Which is the atomic_long_add_return() in account_pipe_buffers().

>From the regs we can see:
r3 = c00000026c0d4a80
r7 = dead4ead00000030
r10 = dead4ead00000000

So pipe->user instead of being a pointer to a user_struct was actually
part of a spinlock.

There isn't a spinlock in struct pipe_inode_info, so probably pipe is
not actually a pointer to a struct pipe_inode_info at all.

There's not much more to go on, so memory corruption is my best guess.
Can you run with SLUB debugging on?