[REGRESSION v4.7] i915 / drm crash when undocking from DP monitors

From: Vadim Lobanov
Date: Fri Oct 07 2016 - 21:28:09 EST


Hi folks,

I'm seeing a repeatable crash on my HP EliteBook 840 G2/2216 when
booting it while in a docking station connected to two external
DisplayPort monitors, undocking, and then either logging out or
shutting down -- regardless of whether I've redocked it beforehand or
not. Both logout and shutdown work great if I do not undock the laptop
at all, so the badness correlates with the DP monitors going away.

This is a regression introduced somewhere in the v4.6 -> v4.7
development timeframe: 4.6.0 works, 4.7.0 fails as described, and
4.8.0 crashes earlier still when undocking.

The graphics hardware involved is:

00:02.0 VGA compatible controller: Intel Corporation HD Graphics 5500
(rev 09) (prog-if 00 [VGA controller])
Subsystem: Hewlett-Packard Company ZBook 15u G2 Mobile Workstation
Flags: bus master, fast devsel, latency 0, IRQ 49
Memory at c0000000 (64-bit, non-prefetchable) [size=16M]
Memory at b0000000 (64-bit, prefetchable) [size=256M]
I/O ports at 5000 [size=64]
[virtual] Expansion ROM at 000c0000 [disabled] [size=128K]
Capabilities: [90] MSI: Enable+ Count=1/1 Maskable- 64bit-
Capabilities: [d0] Power Management version 2
Capabilities: [a4] PCI Advanced Features
Kernel driver in use: i915
Kernel modules: i915

And the crash that I see is similar to this:

Oct 07 17:47:16 localhost.localdomain kernel: BUG: unable to handle
kernel paging request at 0000000000018c70
Oct 07 17:47:16 localhost.localdomain kernel: IP: [<ffffffff960ecd48>]
queued_spin_lock_slowpath+0x108/0x190
Oct 07 17:47:16 localhost.localdomain kernel: PGD 0
Oct 07 17:47:16 localhost.localdomain kernel: Oops: 0002 [#1] SMP
Oct 07 17:47:16 localhost.localdomain kernel: Modules linked in:
rfcomm ccm xt_CHECKSUM tun ipt_MASQUERADE nf_nat_masquerade_ipv4
xt_addrtype nf_conntrack_netbios_ns nf_conntrack_broadcast ip6t_REJECT
nf_reject_
Oct 07 17:47:16 localhost.localdomain kernel: sparse_keymap ppdev
irqbypass crct10dif_pclmul crc32_pclmul ghash_clmulni_intel iwlwifi
intel_cstate intel_uncore intel_rapl_perf cfg80211 joydev uvcvideo
lpc_ich r
Oct 07 17:47:16 localhost.localdomain kernel: CPU: 2 PID: 855 Comm:
systemd-logind Not tainted 4.7.5-200.fc24.x86_64 #1
Oct 07 17:47:16 localhost.localdomain kernel: Hardware name:
Hewlett-Packard HP EliteBook 840 G2/2216, BIOS M71 Ver. 01.04
02/24/2015
Oct 07 17:47:16 localhost.localdomain kernel: task: ffff88043a120000
ti: ffff880035d50000 task.ti: ffff880035d50000
Oct 07 17:47:16 localhost.localdomain kernel: RIP:
0010:[<ffffffff960ecd48>] [<ffffffff960ecd48>]
queued_spin_lock_slowpath+0x108/0x190
Oct 07 17:47:16 localhost.localdomain kernel: RSP:
0018:ffff880035d53908 EFLAGS: 00010202
Oct 07 17:47:16 localhost.localdomain kernel: RAX: 0000000000018c70
RBX: ffff880438716a50 RCX: ffff88044f498c40
Oct 07 17:47:16 localhost.localdomain kernel: RDX: 0000000000001b9a
RSI: 000000006e6f746f RDI: ffff880438716a54
Oct 07 17:47:16 localhost.localdomain kernel: RBP: ffff880035d53908
R08: 00000000000c0000 R09: 0000000000000000
Oct 07 17:47:16 localhost.localdomain kernel: R10: ffff880096e4e780
R11: 0000000000000898 R12: ffff88043ab3ec40
Oct 07 17:47:16 localhost.localdomain kernel: R13: ffff880438716a58
R14: ffff880427ebd800 R15: ffff8804396bd000
Oct 07 17:47:16 localhost.localdomain kernel: FS:
00007f22e2cb5900(0000) GS:ffff88044f480000(0000)
knlGS:0000000000000000
Oct 07 17:47:16 localhost.localdomain kernel: CS: 0010 DS: 0000 ES:
0000 CR0: 0000000080050033
Oct 07 17:47:16 localhost.localdomain kernel: CR2: 0000000000018c70
CR3: 000000043a095000 CR4: 00000000003406e0
Oct 07 17:47:16 localhost.localdomain kernel: DR0: 0000000000000000
DR1: 0000000000000000 DR2: 0000000000000000
Oct 07 17:47:16 localhost.localdomain kernel: DR3: 0000000000000000
DR6: 00000000fffe0ff0 DR7: 0000000000000400
Oct 07 17:47:16 localhost.localdomain kernel: Stack:
Oct 07 17:47:16 localhost.localdomain kernel: ffff880035d53918
ffffffff967ec350 ffff880035d53940 ffffffff967e9f2f
Oct 07 17:47:16 localhost.localdomain kernel: ffff88043ab3ec40
ffff880438716a50 ffff880438714800 ffff880035d53970
Oct 07 17:47:16 localhost.localdomain kernel: ffffffffc00a155e
ffff880427e49800 ffff880438716800 ffff880427ebd800
Oct 07 17:47:16 localhost.localdomain kernel: Call Trace:
Oct 07 17:47:16 localhost.localdomain kernel: [<ffffffff967ec350>]
_raw_spin_lock+0x20/0x30
Oct 07 17:47:16 localhost.localdomain kernel: [<ffffffff967e9f2f>]
__ww_mutex_lock+0x6f/0xa0
Oct 07 17:47:16 localhost.localdomain kernel: [<ffffffffc00a155e>]
drm_modeset_lock+0x4e/0xd0 [drm]
Oct 07 17:47:16 localhost.localdomain kernel: [<ffffffffc00a2044>]
drm_atomic_get_connector_state+0x34/0x1c0 [drm]
Oct 07 17:47:16 localhost.localdomain kernel: [<ffffffffc014ff90>]
__drm_atomic_helper_set_config+0x2a0/0x360 [drm_kms_helper]
Oct 07 17:47:16 localhost.localdomain kernel: [<ffffffffc01511da>]
restore_fbdev_mode+0x22a/0x260 [drm_kms_helper]
Oct 07 17:47:16 localhost.localdomain kernel: [<ffffffffc01535d4>]
drm_fb_helper_restore_fbdev_mode_unlocked+0x34/0x80 [drm_kms_helper]
Oct 07 17:47:16 localhost.localdomain kernel: [<ffffffffc015364d>]
drm_fb_helper_set_par+0x2d/0x50 [drm_kms_helper]
Oct 07 17:47:16 localhost.localdomain kernel: [<ffffffffc023da4a>]
intel_fbdev_set_par+0x1a/0x60 [i915]
Oct 07 17:47:16 localhost.localdomain kernel: [<ffffffff9645a6b6>]
fb_set_var+0x236/0x460
Oct 07 17:47:16 localhost.localdomain kernel: [<ffffffff960d98e8>] ?
enqueue_task_fair+0xa8/0x960
Oct 07 17:47:16 localhost.localdomain kernel: [<ffffffff961bf0df>] ?
free_hot_cold_page_list+0x3f/0xa0
Oct 07 17:47:16 localhost.localdomain kernel: [<ffffffff9645074f>]
fbcon_blank+0x30f/0x350
Oct 07 17:47:16 localhost.localdomain kernel: [<ffffffff9624c200>] ?
chrdev_open+0xb0/0x180
Oct 07 17:47:16 localhost.localdomain kernel: [<ffffffff964db0b2>]
do_unblank_screen+0xd2/0x1a0
Oct 07 17:47:16 localhost.localdomain kernel: [<ffffffff964d0ef6>]
vt_ioctl+0x4f6/0x1270
Oct 07 17:47:16 localhost.localdomain kernel: [<ffffffff9625abf9>] ?
fasync_remove_entry+0x29/0xb0
Oct 07 17:47:16 localhost.localdomain kernel: [<ffffffff964c537a>]
tty_ioctl+0x35a/0xc50
Oct 07 17:47:16 localhost.localdomain kernel: [<ffffffff964cdf79>] ?
tty_unlock+0x29/0x50
Oct 07 17:47:16 localhost.localdomain kernel: [<ffffffff9625f909>] ?
dput+0xd9/0x260
Oct 07 17:47:16 localhost.localdomain kernel: [<ffffffff96268ae4>] ?
mntput+0x24/0x40
Oct 07 17:47:16 localhost.localdomain kernel: [<ffffffff9625b4b2>]
do_vfs_ioctl+0xa2/0x5d0
Oct 07 17:47:16 localhost.localdomain kernel: [<ffffffff962497ce>] ?
____fput+0xe/0x10
Oct 07 17:47:16 localhost.localdomain kernel: [<ffffffff960be9b8>] ?
task_work_run+0x88/0xb0
Oct 07 17:47:16 localhost.localdomain kernel: [<ffffffff9625ba59>]
SyS_ioctl+0x79/0x90
Oct 07 17:47:16 localhost.localdomain kernel: [<ffffffff967ec572>]
entry_SYSCALL_64_fastpath+0x1a/0xa4
Oct 07 17:47:16 localhost.localdomain kernel: Code: 02 89 c2 45 31 c9
c1 e2 10 85 d2 74 41 c1 ea 12 83 e0 03 83 ea 01 48 c1 e0 04 48 63 d2
48 05 40 8c 01 00 48 03 04 d5 40 58 d3 96 <48> 89 08 8b 41 08 85 c0 75
0
Oct 07 17:47:16 localhost.localdomain kernel: RIP
[<ffffffff960ecd48>] queued_spin_lock_slowpath+0x108/0x190

I tried to bisect the crash, what with it being nicely reproducible
and all, but that effort didn't yield much useful information as many
of the intermediate commits either did not build, or the resulting
kernel did not recognize the DP monitors at all (thought they were
disconnected) when booted up.

Here's the bisect log so far, note the skips due to the issues described above:

git bisect start
# good: [2dcd0af568b0cf583645c8a317dd12e344b1c72a] Linux 4.6
git bisect good 2dcd0af568b0cf583645c8a317dd12e344b1c72a
# bad: [523d939ef98fd712632d93a5a2b588e477a7565e] Linux 4.7
git bisect bad 523d939ef98fd712632d93a5a2b588e477a7565e
# good: [0694f0c9e20c47063e4237e5f6649ae5ce5a369a] radix tree test
suite: remove dependencies on height
git bisect good 0694f0c9e20c47063e4237e5f6649ae5ce5a369a
# bad: [e4f7bdc2ec0d0dcc27f7d70db27a620dfdc1f697] Merge branch
'for-4.7-zac' of
git://git.kernel.org/pub/scm/linux/kernel/git/tj/libata
git bisect bad e4f7bdc2ec0d0dcc27f7d70db27a620dfdc1f697
# good: [2f37dd131c5d3a2eac21cd5baf80658b1b02a8ac] Merge tag
'staging-4.7-rc1' of
git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/staging
git bisect good 2f37dd131c5d3a2eac21cd5baf80658b1b02a8ac
# bad: [2b669875332fbdff0a7ad559e8662e875e7a1526] drm/msm: Drop
load/unload drm_driver ops
git bisect bad 2b669875332fbdff0a7ad559e8662e875e7a1526
# skip: [560ce1dc7c87ade27faaf07d381a9a5a2ffc9934] drm/i915: use
drm_crtc_send_vblank_event()
git bisect skip 560ce1dc7c87ade27faaf07d381a9a5a2ffc9934
# good: [bf16200689118d19de1b8d2a3c314fc21f5dc7bb] Linux 4.6-rc3
git bisect good bf16200689118d19de1b8d2a3c314fc21f5dc7bb
# skip: [9cd47424fb410e478e5a97e83ac10263c13ed65c] drm/mode: reduce
scope of fb_lock in framebuffer init
git bisect skip 9cd47424fb410e478e5a97e83ac10263c13ed65c
# skip: [187a1c07ec3c19d0c965f95741ed260bbc02040e] drm/i915: Fix oops
in vlv_force_pll_on()
git bisect skip 187a1c07ec3c19d0c965f95741ed260bbc02040e
# good: [fbf6d8798fceb1f64eb0e5fd7cd541becfc376cd] drm/i915: Add
locking to pll updates, v3.
git bisect good fbf6d8798fceb1f64eb0e5fd7cd541becfc376cd
# skip: [b5bf0f1ea3658254bd72ef64abc97786e8a32255] drm/exynos: clean
up register definions for fimd and decon
git bisect skip b5bf0f1ea3658254bd72ef64abc97786e8a32255
# skip: [528948745f6f52f36839b76beeab0632a9f16471] drm/i915: Move
gt/pm irq handling out from irq disabled section on VLV
git bisect skip 528948745f6f52f36839b76beeab0632a9f16471
# skip: [71cbf451eb2715865e3dbd0ec55837dac1148d23] drm/radeon: Use
lockless gem BO free callback
git bisect skip 71cbf451eb2715865e3dbd0ec55837dac1148d23
# skip: [7c8f6d2577c7565f67ba3f6b9b76f7422710d66e] drm/mode: rework
drm_mode_object_put to drm_mode_object_unregister.
git bisect skip 7c8f6d2577c7565f67ba3f6b9b76f7422710d66e

I think I'm now at the point where it makes sense to raise this up as
a general question. So, halp plz! :)

Thanks,

-- Vadim