[PATCH 4.9 07/31] drm/i915: Avoid rcu_barrier() from reclaim paths (shrinker)

From: Greg Kroah-Hartman
Date: Sun Apr 16 2017 - 04:13:33 EST


4.9-stable review patch. If anyone has any objections, please let me know.

------------------

From: Chris Wilson <chris@xxxxxxxxxxxxxxxxxx>

commit 3d3d18f086cdda72ee18a454db70ca72c6e3246c upstream.

The rcu_barrier() takes the cpu_hotplug mutex which itself is not
reclaim-safe, and so rcu_barrier() is illegal from inside the shrinker.

[ 309.661373] =========================================================
[ 309.661376] [ INFO: possible irq lock inversion dependency detected ]
[ 309.661380] 4.11.0-rc1-CI-CI_DRM_2333+ #1 Tainted: G W
[ 309.661383] ---------------------------------------------------------
[ 309.661386] gem_exec_gttfil/6435 just changed the state of lock:
[ 309.661389] (rcu_preempt_state.barrier_mutex){+.+.-.}, at: [<ffffffff81100731>] _rcu_barrier+0x31/0x160
[ 309.661399] but this lock took another, RECLAIM_FS-unsafe lock in the past:
[ 309.661402] (cpu_hotplug.lock){+.+.+.}
[ 309.661404]

and interrupts could create inverse lock ordering between them.

[ 309.661410]
other info that might help us debug this:
[ 309.661414] Possible interrupt unsafe locking scenario:

[ 309.661417] CPU0 CPU1
[ 309.661419] ---- ----
[ 309.661421] lock(cpu_hotplug.lock);
[ 309.661425] local_irq_disable();
[ 309.661432] lock(rcu_preempt_state.barrier_mutex);
[ 309.661441] lock(cpu_hotplug.lock);
[ 309.661446] <Interrupt>
[ 309.661448] lock(rcu_preempt_state.barrier_mutex);
[ 309.661453]
*** DEADLOCK ***

[ 309.661460] 4 locks held by gem_exec_gttfil/6435:
[ 309.661464] #0: (sb_writers#10){.+.+.+}, at: [<ffffffff8120d83d>] vfs_write+0x17d/0x1f0
[ 309.661475] #1: (debugfs_srcu){......}, at: [<ffffffff81320491>] debugfs_use_file_start+0x41/0xa0
[ 309.661486] #2: (&attr->mutex){+.+.+.}, at: [<ffffffff8123a3e7>] simple_attr_write+0x37/0xe0
[ 309.661495] #3: (&dev->struct_mutex){+.+.+.}, at: [<ffffffffa0091b4a>] i915_drop_caches_set+0x3a/0x150 [i915]
[ 309.661540]
the shortest dependencies between 2nd lock and 1st lock:
[ 309.661547] -> (cpu_hotplug.lock){+.+.+.} ops: 829 {
[ 309.661553] HARDIRQ-ON-W at:
[ 309.661560] __lock_acquire+0x5e5/0x1b50
[ 309.661565] lock_acquire+0xc9/0x220
[ 309.661572] __mutex_lock+0x6e/0x990
[ 309.661576] mutex_lock_nested+0x16/0x20
[ 309.661583] get_online_cpus+0x61/0x80
[ 309.661590] kmem_cache_create+0x25/0x1d0
[ 309.661596] debug_objects_mem_init+0x30/0x249
[ 309.661602] start_kernel+0x341/0x3fe
[ 309.661607] x86_64_start_reservations+0x2a/0x2c
[ 309.661612] x86_64_start_kernel+0x173/0x186
[ 309.661619] verify_cpu+0x0/0xfc
[ 309.661622] SOFTIRQ-ON-W at:
[ 309.661627] __lock_acquire+0x611/0x1b50
[ 309.661632] lock_acquire+0xc9/0x220
[ 309.661636] __mutex_lock+0x6e/0x990
[ 309.661641] mutex_lock_nested+0x16/0x20
[ 309.661646] get_online_cpus+0x61/0x80
[ 309.661650] kmem_cache_create+0x25/0x1d0
[ 309.661655] debug_objects_mem_init+0x30/0x249
[ 309.661660] start_kernel+0x341/0x3fe
[ 309.661664] x86_64_start_reservations+0x2a/0x2c
[ 309.661669] x86_64_start_kernel+0x173/0x186
[ 309.661674] verify_cpu+0x0/0xfc
[ 309.661677] RECLAIM_FS-ON-W at:
[ 309.661682] mark_held_locks+0x6f/0xa0
[ 309.661687] lockdep_trace_alloc+0xb3/0x100
[ 309.661693] kmem_cache_alloc_trace+0x31/0x2e0
[ 309.661699] __smpboot_create_thread.part.1+0x27/0xe0
[ 309.661704] smpboot_create_threads+0x61/0x90
[ 309.661709] cpuhp_invoke_callback+0x9c/0x8a0
[ 309.661713] cpuhp_up_callbacks+0x31/0xb0
[ 309.661718] _cpu_up+0x7a/0xc0
[ 309.661723] do_cpu_up+0x5f/0x80
[ 309.661727] cpu_up+0xe/0x10
[ 309.661734] smp_init+0x71/0xb3
[ 309.661738] kernel_init_freeable+0x94/0x19e
[ 309.661743] kernel_init+0x9/0xf0
[ 309.661748] ret_from_fork+0x2e/0x40
[ 309.661752] INITIAL USE at:
[ 309.661757] __lock_acquire+0x234/0x1b50
[ 309.661761] lock_acquire+0xc9/0x220
[ 309.661766] __mutex_lock+0x6e/0x990
[ 309.661771] mutex_lock_nested+0x16/0x20
[ 309.661775] get_online_cpus+0x61/0x80
[ 309.661780] __cpuhp_setup_state+0x44/0x170
[ 309.661785] page_alloc_init+0x23/0x3a
[ 309.661790] start_kernel+0x124/0x3fe
[ 309.661794] x86_64_start_reservations+0x2a/0x2c
[ 309.661799] x86_64_start_kernel+0x173/0x186
[ 309.661804] verify_cpu+0x0/0xfc
[ 309.661807] }
[ 309.661813] ... key at: [<ffffffff81e37690>] cpu_hotplug+0xb0/0x100
[ 309.661817] ... acquired at:
[ 309.661821] lock_acquire+0xc9/0x220
[ 309.661825] __mutex_lock+0x6e/0x990
[ 309.661829] mutex_lock_nested+0x16/0x20
[ 309.661833] get_online_cpus+0x61/0x80
[ 309.661837] _rcu_barrier+0x9f/0x160
[ 309.661841] rcu_barrier+0x10/0x20
[ 309.661847] netdev_run_todo+0x5f/0x310
[ 309.661852] rtnl_unlock+0x9/0x10
[ 309.661856] default_device_exit_batch+0x133/0x150
[ 309.661862] ops_exit_list.isra.0+0x4d/0x60
[ 309.661866] cleanup_net+0x1d8/0x2c0
[ 309.661872] process_one_work+0x1f4/0x6d0
[ 309.661876] worker_thread+0x49/0x4a0
[ 309.661881] kthread+0x107/0x140
[ 309.661884] ret_from_fork+0x2e/0x40

[ 309.661890] -> (rcu_preempt_state.barrier_mutex){+.+.-.} ops: 179 {
[ 309.661896] HARDIRQ-ON-W at:
[ 309.661901] __lock_acquire+0x5e5/0x1b50
[ 309.661905] lock_acquire+0xc9/0x220
[ 309.661910] __mutex_lock+0x6e/0x990
[ 309.661914] mutex_lock_nested+0x16/0x20
[ 309.661919] _rcu_barrier+0x31/0x160
[ 309.661923] rcu_barrier+0x10/0x20
[ 309.661928] netdev_run_todo+0x5f/0x310
[ 309.661932] rtnl_unlock+0x9/0x10
[ 309.661936] default_device_exit_batch+0x133/0x150
[ 309.661941] ops_exit_list.isra.0+0x4d/0x60
[ 309.661946] cleanup_net+0x1d8/0x2c0
[ 309.661951] process_one_work+0x1f4/0x6d0
[ 309.661955] worker_thread+0x49/0x4a0
[ 309.661960] kthread+0x107/0x140
[ 309.661964] ret_from_fork+0x2e/0x40
[ 309.661968] SOFTIRQ-ON-W at:
[ 309.661972] __lock_acquire+0x611/0x1b50
[ 309.661977] lock_acquire+0xc9/0x220
[ 309.661981] __mutex_lock+0x6e/0x990
[ 309.661986] mutex_lock_nested+0x16/0x20
[ 309.661990] _rcu_barrier+0x31/0x160
[ 309.661995] rcu_barrier+0x10/0x20
[ 309.661999] netdev_run_todo+0x5f/0x310
[ 309.662003] rtnl_unlock+0x9/0x10
[ 309.662008] default_device_exit_batch+0x133/0x150
[ 309.662013] ops_exit_list.isra.0+0x4d/0x60
[ 309.662017] cleanup_net+0x1d8/0x2c0
[ 309.662022] process_one_work+0x1f4/0x6d0
[ 309.662027] worker_thread+0x49/0x4a0
[ 309.662031] kthread+0x107/0x140
[ 309.662035] ret_from_fork+0x2e/0x40
[ 309.662039] IN-RECLAIM_FS-W at:
[ 309.662043] __lock_acquire+0x638/0x1b50
[ 309.662048] lock_acquire+0xc9/0x220
[ 309.662053] __mutex_lock+0x6e/0x990
[ 309.662058] mutex_lock_nested+0x16/0x20
[ 309.662062] _rcu_barrier+0x31/0x160
[ 309.662067] rcu_barrier+0x10/0x20
[ 309.662089] i915_gem_shrink_all+0x33/0x40 [i915]
[ 309.662109] i915_drop_caches_set+0x141/0x150 [i915]
[ 309.662114] simple_attr_write+0xc7/0xe0
[ 309.662119] full_proxy_write+0x4f/0x70
[ 309.662124] __vfs_write+0x23/0x120
[ 309.662128] vfs_write+0xc6/0x1f0
[ 309.662133] SyS_write+0x44/0xb0
[ 309.662138] entry_SYSCALL_64_fastpath+0x1c/0xb1
[ 309.662142] INITIAL USE at:
[ 309.662147] __lock_acquire+0x234/0x1b50
[ 309.662151] lock_acquire+0xc9/0x220
[ 309.662156] __mutex_lock+0x6e/0x990
[ 309.662160] mutex_lock_nested+0x16/0x20
[ 309.662165] _rcu_barrier+0x31/0x160
[ 309.662169] rcu_barrier+0x10/0x20
[ 309.662174] netdev_run_todo+0x5f/0x310
[ 309.662178] rtnl_unlock+0x9/0x10
[ 309.662183] default_device_exit_batch+0x133/0x150
[ 309.662188] ops_exit_list.isra.0+0x4d/0x60
[ 309.662192] cleanup_net+0x1d8/0x2c0
[ 309.662197] process_one_work+0x1f4/0x6d0
[ 309.662202] worker_thread+0x49/0x4a0
[ 309.662206] kthread+0x107/0x140
[ 309.662210] ret_from_fork+0x2e/0x40
[ 309.662214] }
[ 309.662220] ... key at: [<ffffffff81e4e1c8>] rcu_preempt_state+0x508/0x780
[ 309.662225] ... acquired at:
[ 309.662229] check_usage_forwards+0x12b/0x130
[ 309.662233] mark_lock+0x360/0x6f0
[ 309.662237] __lock_acquire+0x638/0x1b50
[ 309.662241] lock_acquire+0xc9/0x220
[ 309.662245] __mutex_lock+0x6e/0x990
[ 309.662249] mutex_lock_nested+0x16/0x20
[ 309.662253] _rcu_barrier+0x31/0x160
[ 309.662257] rcu_barrier+0x10/0x20
[ 309.662279] i915_gem_shrink_all+0x33/0x40 [i915]
[ 309.662298] i915_drop_caches_set+0x141/0x150 [i915]
[ 309.662303] simple_attr_write+0xc7/0xe0
[ 309.662307] full_proxy_write+0x4f/0x70
[ 309.662311] __vfs_write+0x23/0x120
[ 309.662315] vfs_write+0xc6/0x1f0
[ 309.662319] SyS_write+0x44/0xb0
[ 309.662323] entry_SYSCALL_64_fastpath+0x1c/0xb1

[ 309.662329]
stack backtrace:
[ 309.662335] CPU: 1 PID: 6435 Comm: gem_exec_gttfil Tainted: G W 4.11.0-rc1-CI-CI_DRM_2333+ #1
[ 309.662342] Hardware name: Hewlett-Packard HP Compaq 8100 Elite SFF PC/304Ah, BIOS 786H1 v01.13 07/14/2011
[ 309.662348] Call Trace:
[ 309.662354] dump_stack+0x67/0x92
[ 309.662359] print_irq_inversion_bug.part.19+0x1a4/0x1b0
[ 309.662365] check_usage_forwards+0x12b/0x130
[ 309.662369] mark_lock+0x360/0x6f0
[ 309.662374] ? print_shortest_lock_dependencies+0x1a0/0x1a0
[ 309.662379] __lock_acquire+0x638/0x1b50
[ 309.662383] ? __mutex_unlock_slowpath+0x3e/0x2e0
[ 309.662388] ? trace_hardirqs_on+0xd/0x10
[ 309.662392] ? _rcu_barrier+0x31/0x160
[ 309.662396] lock_acquire+0xc9/0x220
[ 309.662400] ? _rcu_barrier+0x31/0x160
[ 309.662404] ? _rcu_barrier+0x31/0x160
[ 309.662409] __mutex_lock+0x6e/0x990
[ 309.662412] ? _rcu_barrier+0x31/0x160
[ 309.662416] ? _rcu_barrier+0x31/0x160
[ 309.662421] ? synchronize_rcu_expedited+0x35/0xb0
[ 309.662426] ? _raw_spin_unlock_irqrestore+0x52/0x60
[ 309.662434] mutex_lock_nested+0x16/0x20
[ 309.662438] _rcu_barrier+0x31/0x160
[ 309.662442] rcu_barrier+0x10/0x20
[ 309.662464] i915_gem_shrink_all+0x33/0x40 [i915]
[ 309.662484] i915_drop_caches_set+0x141/0x150 [i915]
[ 309.662489] simple_attr_write+0xc7/0xe0
[ 309.662494] full_proxy_write+0x4f/0x70
[ 309.662498] __vfs_write+0x23/0x120
[ 309.662503] ? rcu_read_lock_sched_held+0x75/0x80
[ 309.662507] ? rcu_sync_lockdep_assert+0x2a/0x50
[ 309.662512] ? __sb_start_write+0x102/0x210
[ 309.662516] ? vfs_write+0x17d/0x1f0
[ 309.662520] vfs_write+0xc6/0x1f0
[ 309.662524] ? trace_hardirqs_on_caller+0xe7/0x200
[ 309.662529] SyS_write+0x44/0xb0
[ 309.662533] entry_SYSCALL_64_fastpath+0x1c/0xb1
[ 309.662537] RIP: 0033:0x7f507eac24a0
[ 309.662541] RSP: 002b:00007fffda8720e8 EFLAGS: 00000246 ORIG_RAX: 0000000000000001
[ 309.662548] RAX: ffffffffffffffda RBX: ffffffff81482bd3 RCX: 00007f507eac24a0
[ 309.662552] RDX: 0000000000000005 RSI: 00007fffda8720f0 RDI: 0000000000000005
[ 309.662557] RBP: ffffc9000048bf88 R08: 0000000000000000 R09: 000000000000002c
[ 309.662561] R10: 0000000000000014 R11: 0000000000000246 R12: 00007fffda872230
[ 309.662566] R13: 00007fffda872228 R14: 0000000000000201 R15: 00007fffda8720f0
[ 309.662572] ? __this_cpu_preempt_check+0x13/0x20

Fixes: 0eafec6d3244 ("drm/i915: Enable lockless lookup of request tracking via RCU")
Bugzilla: https://bugs.freedesktop.org/show_bug.cgi?id=100192
Signed-off-by: Chris Wilson <chris@xxxxxxxxxxxxxxxxxx>
Cc: Daniel Vetter <daniel.vetter@xxxxxxxx>
Link: http://patchwork.freedesktop.org/patch/msgid/20170314115019.18127-1-chris@xxxxxxxxxxxxxxxxxx
Reviewed-by: Daniel Vetter <daniel.vetter@xxxxxxxx>
(cherry picked from commit bd784b7cc41af7a19cfb705fa6d800e511c4ab02)
Signed-off-by: Jani Nikula <jani.nikula@xxxxxxxxx>
Link: http://patchwork.freedesktop.org/patch/msgid/20170321144531.12344-1-chris@xxxxxxxxxxxxxxxxxx
Signed-off-by: Greg Kroah-Hartman <gregkh@xxxxxxxxxxxxxxxxxxx>

---
drivers/gpu/drm/i915/i915_gem_shrinker.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)

--- a/drivers/gpu/drm/i915/i915_gem_shrinker.c
+++ b/drivers/gpu/drm/i915/i915_gem_shrinker.c
@@ -233,7 +233,7 @@ unsigned long i915_gem_shrink_all(struct
I915_SHRINK_BOUND |
I915_SHRINK_UNBOUND |
I915_SHRINK_ACTIVE);
- rcu_barrier(); /* wait until our RCU delayed slab frees are completed */
+ synchronize_rcu(); /* wait for our earlier RCU delayed slab frees */

return freed;
}