Re: [PATCH] gpu/drm/ttm: Use mutex_lock_killable() for shrinker functions.

From: Konrad Rzeszutek Wilk
Date: Wed May 28 2014 - 14:55:19 EST


On Sat, May 24, 2014 at 11:22:09PM +0900, Tetsuo Handa wrote:
> Hello.
>
> I tried to test whether it is OK (from point of view of reentrant) to use
> mutex_lock() or mutex_lock_killable() inside shrinker functions when shrinker
> functions do memory allocation, for drivers/gpu/drm/ttm/ttm_page_alloc_dma.c is
> doing memory allocation with mutex lock held inside ttm_dma_pool_shrink_scan().
>
> If I compile a test module shown below which mimics extreme case of what
> ttm_dma_pool_shrink_scan() will do

And ttm_pool_shrink_scan.

>
> ---------- test.c start ----------
> #include <linux/module.h>
> #include <linux/sched.h>
> #include <linux/slab.h>
> #include <linux/mm.h>
>
> static DEFINE_MUTEX(lock);
>
> static unsigned long shrink_test_count(struct shrinker *shrinker, struct shrink_control *sc)
> {
> if (mutex_lock_killable(&lock)) {
> printk(KERN_WARNING "Process %u (%s) gave up waiting for mutex"
> "\n", current->pid, current->comm);
> return 0;
> }
> mutex_unlock(&lock);
> return 1;
> }
>
> static unsigned long shrink_test_scan(struct shrinker *shrinker, struct shrink_control *sc)
> {
> LIST_HEAD(list);
> int i = 0;
> if (mutex_lock_killable(&lock)) {
> printk(KERN_WARNING "Process %u (%s) gave up waiting for mutex"
> "\n", current->pid, current->comm);
> return 0;
> }
> while (1) {
> struct list_head *l = kmalloc(PAGE_SIZE, sc->gfp_mask);
> if (!l)
> break;
> list_add_tail(l, &list);
> i++;
> }
> printk(KERN_WARNING "Process %u (%s) allocated %u pages\n",
> current->pid, current->comm, i);
> while (i--) {
> struct list_head *l = list.next;
> list_del(l);
> kfree(l);
> }
> mutex_unlock(&lock);
> return 1;
> }
>
> static struct shrinker recursive_shrinker = {
> .count_objects = shrink_test_count,
> .scan_objects = shrink_test_scan,
> .seeks = DEFAULT_SEEKS,
> };
>
> static int __init recursive_shrinker_init(void)
> {
> register_shrinker(&recursive_shrinker);
> return 0;
> }
>
> static void recursive_shrinker_exit(void)
> {
> unregister_shrinker(&recursive_shrinker);
> }
>
> module_init(recursive_shrinker_init);
> module_exit(recursive_shrinker_exit);
> MODULE_LICENSE("GPL");
> ---------- test.c end ----------
>
> and load the test module and do
>
> # echo 3 > /proc/sys/vm/drop_caches
>
> the system stalls with 0% CPU usage because of mutex deadlock
> (with prior lockdep warning).
>
> Is this because wrong gfp flags are passed to kmalloc() ? Is this because
> the test module's shrinker functions return wrong values? Is this because
> doing memory allocation with mutex held inside shrinker functions is
> forbidden? Can anybody tell me what is wrong with my test module?

What is the sc->gfp_flags? What if you use GFP_ATOMIC?

In regards to the lockdep warning below it looks like
>
> Regards.
>
> [ 48.077353]
> [ 48.077999] =================================
> [ 48.080023] [ INFO: inconsistent lock state ]
> [ 48.080023] 3.15.0-rc6-00190-g1ee1cea #203 Tainted: G OE
> [ 48.080023] ---------------------------------
> [ 48.080023] inconsistent {RECLAIM_FS-ON-W} -> {IN-RECLAIM_FS-W} usage.
> [ 48.086745] kswapd0/784 [HC0[0]:SC0[0]:HE1:SE1] takes:
> [ 48.086745] (lock#2){+.+.?.}, at: [<e0861022>] shrink_test_count+0x12/0x60 [test]
> [ 48.086745] {RECLAIM_FS-ON-W} state was registered at:


You have the scenario you described below, that is:

shrink_test_scan
mutex_lock_killable()
-> kmalloc
-> shrink_test_count
mutex_lock_killable()

And 'mutex_lock_killable' is the same (in at least this context)
the same as 'mutex_lock'. In other words, your second 'mutex_lock'
is going to spin - which is a deadlock.

Perhaps a way of not getting in this scenario is:

1). Try to take the mutex (ie, one that won't spin if it can't
get it).

2). Use the GFP_ATOMIC in the shrinker so that we never
end up calling ourselves in case of memory pressure

?

> [ 48.086745] [<c1089c18>] mark_held_locks+0x68/0x90
> [ 48.086745] [<c1089cda>] lockdep_trace_alloc+0x9a/0xe0
> [ 48.086745] [<c110b7f3>] kmem_cache_alloc+0x23/0x170
> [ 48.086745] [<e08610aa>] shrink_test_scan+0x3a/0xf90 [test]
> [ 48.086745] [<c10e59be>] shrink_slab_node+0x13e/0x1d0
> [ 48.086745] [<c10e6911>] shrink_slab+0x61/0xe0
> [ 48.086745] [<c115f849>] drop_caches_sysctl_handler+0x69/0xf0
> [ 48.086745] [<c117275a>] proc_sys_call_handler+0x6a/0xa0
> [ 48.086745] [<c11727aa>] proc_sys_write+0x1a/0x20
> [ 48.086745] [<c1110ac0>] vfs_write+0xa0/0x190
> [ 48.086745] [<c1110ca6>] SyS_write+0x56/0xc0
> [ 48.086745] [<c15201d6>] syscall_call+0x7/0xb
> [ 48.086745] irq event stamp: 39
> [ 48.086745] hardirqs last enabled at (39): [<c10f3480>] count_shadow_nodes+0x20/0x40
> [ 48.086745] hardirqs last disabled at (38): [<c10f346c>] count_shadow_nodes+0xc/0x40
> [ 48.086745] softirqs last enabled at (0): [<c1040627>] copy_process+0x2e7/0x1400
> [ 48.086745] softirqs last disabled at (0): [< (null)>] (null)
> [ 48.086745]
> [ 48.086745] other info that might help us debug this:
> [ 48.086745] Possible unsafe locking scenario:
> [ 48.086745]
> [ 48.086745] CPU0
> [ 48.086745] ----
> [ 48.086745] lock(lock#2);
> [ 48.086745] <Interrupt>
> [ 48.086745] lock(lock#2);
> [ 48.086745]
> [ 48.086745] *** DEADLOCK ***
> [ 48.086745]
> [ 48.086745] 1 lock held by kswapd0/784:
> [ 48.086745] #0: (shrinker_rwsem){++++.+}, at: [<c10e68da>] shrink_slab+0x2a/0xe0
> [ 48.086745]
> [ 48.086745] stack backtrace:
> [ 48.086745] CPU: 1 PID: 784 Comm: kswapd0 Tainted: G OE 3.15.0-rc6-00190-g1ee1cea #203
> [ 48.086745] Hardware name: VMware, Inc. VMware Virtual Platform/440BX Desktop Reference Platform, BIOS 6.00 08/15/2008
> [ 48.086745] c1ab9c20 dd187c94 c151a48f dd184250 dd187cd0 c1088f33 c165aa02 c165ac9d
> [ 48.086745] 00000310 00000000 00000000 00000000 00000000 00000001 00000001 c165ac9d
> [ 48.086745] dd1847dc 0000000a 00000008 dd187cfc c1089ae1 00000008 000001b8 31a0987d
> [ 48.086745] Call Trace:
> [ 48.086745] [<c151a48f>] dump_stack+0x48/0x61
> [ 48.086745] [<c1088f33>] print_usage_bug+0x1f3/0x250
> [ 48.086745] [<c1089ae1>] mark_lock+0x331/0x400
> [ 48.086745] [<c1088f90>] ? print_usage_bug+0x250/0x250
> [ 48.086745] [<c108a583>] __lock_acquire+0x283/0x1640
> [ 48.086745] [<c108b9bb>] lock_acquire+0x7b/0xa0
> [ 48.086745] [<e0861022>] ? shrink_test_count+0x12/0x60 [test]
> [ 48.086745] [<c151c544>] mutex_lock_killable_nested+0x64/0x3e0
> [ 48.086745] [<e0861022>] ? shrink_test_count+0x12/0x60 [test]
> [ 48.086745] [<e0861022>] ? shrink_test_count+0x12/0x60 [test]
> [ 48.086745] [<c11119f1>] ? put_super+0x21/0x30
> [ 48.086745] [<e0861022>] shrink_test_count+0x12/0x60 [test]
> [ 48.086745] [<c10e58ae>] shrink_slab_node+0x2e/0x1d0
> [ 48.086745] [<c10e68da>] ? shrink_slab+0x2a/0xe0
> [ 48.086745] [<c10e6911>] shrink_slab+0x61/0xe0
> [ 48.086745] [<c10e8416>] kswapd+0x5f6/0x8e0
> [ 48.086745] [<c1062e0f>] kthread+0xaf/0xd0
> [ 48.086745] [<c10e7e20>] ? try_to_free_pages+0x540/0x540
> [ 48.086745] [<c108a08b>] ? trace_hardirqs_on+0xb/0x10
> [ 48.086745] [<c1525d41>] ret_from_kernel_thread+0x21/0x30
> [ 48.086745] [<c1062d60>] ? __init_kthread_worker+0x60/0x60
>
> [ 77.958388] SysRq : Show State
> [ 77.959377] task PC stack pid father
> [ 77.960803] bash D dfa6ae80 5068 1 0 0x00000000
> [ 77.962348] ded35c30 00000046 dfa6ae90 dfa6ae80 322a9328 00000000 00000000 0000000b
> [ 77.962348] ded32010 c191c8c0 ded34008 32319d5a 0000000b c191c8c0 32319d5a 0000000b
> [ 77.962348] ded32010 00000001 ded35c04 3230ea9d 0000000b e0863060 ded325c4 ded32010
> [ 77.962348] Call Trace:
> [ 77.962348] [<c1073ab7>] ? local_clock+0x17/0x30
> [ 77.962348] [<c151b41e>] schedule+0x1e/0x60
> [ 77.962348] [<c151b6df>] schedule_preempt_disabled+0xf/0x20
> [ 77.962348] [<c151c63f>] mutex_lock_killable_nested+0x15f/0x3e0
> [ 77.962348] [<e0861022>] ? shrink_test_count+0x12/0x60 [test]
> [ 77.962348] [<e0861022>] ? shrink_test_count+0x12/0x60 [test]
> [ 77.962348] [<e0861022>] shrink_test_count+0x12/0x60 [test]
> [ 77.962348] [<c10e58ae>] shrink_slab_node+0x2e/0x1d0
> [ 77.962348] [<c10e68da>] ? shrink_slab+0x2a/0xe0
> [ 77.962348] [<c10e6911>] shrink_slab+0x61/0xe0
> [ 77.962348] [<c10e7b48>] try_to_free_pages+0x268/0x540
> [ 77.962348] [<c10df529>] __alloc_pages_nodemask+0x3e9/0x720
> [ 77.962348] [<c110bcbd>] cache_alloc_refill+0x37d/0x720
> [ 77.962348] [<e08610aa>] ? shrink_test_scan+0x3a/0xf90 [test]
> [ 77.962348] [<c110b902>] kmem_cache_alloc+0x132/0x170
> [ 77.962348] [<e08610aa>] ? shrink_test_scan+0x3a/0xf90 [test]
> [ 77.962348] [<e08610aa>] shrink_test_scan+0x3a/0xf90 [test]
> [ 77.962348] [<c151c4d8>] ? mutex_unlock+0x8/0x10
> [ 77.962348] [<c10e59be>] shrink_slab_node+0x13e/0x1d0
> [ 77.962348] [<c10e68da>] ? shrink_slab+0x2a/0xe0
> [ 77.962348] [<c10e6911>] shrink_slab+0x61/0xe0
> [ 77.962348] [<c115f849>] drop_caches_sysctl_handler+0x69/0xf0
> [ 77.962348] [<c151fe6d>] ? _raw_spin_unlock+0x1d/0x30
> [ 77.962348] [<c117275a>] proc_sys_call_handler+0x6a/0xa0
> [ 77.962348] [<c11727aa>] proc_sys_write+0x1a/0x20
> [ 77.962348] [<c1110ac0>] vfs_write+0xa0/0x190
> [ 77.962348] [<c1172790>] ? proc_sys_call_handler+0xa0/0xa0
> [ 77.962348] [<c112d0fd>] ? __fdget+0xd/0x10
> [ 77.962348] [<c1110ca6>] SyS_write+0x56/0xc0
> [ 77.962348] [<c15201d6>] syscall_call+0x7/0xb
>
> [ 77.962348] kswapd0 D 00000246 6200 784 2 0x00000000
> [ 77.962348] dd187d9c 00000046 c109d091 00000246 00000086 00000000 00000246 dd184250
> [ 77.962348] dd184250 c191c8c0 dd186008 318e2084 0000000b c191c8c0 37e97ef1 0000000b
> [ 77.962348] dd184250 dd184250 dd187d70 c10880aa dd187dac 00000000 0000007b ffffffff
> [ 77.962348] Call Trace:
> [ 77.962348] [<c109d091>] ? rcu_irq_exit+0x71/0xc0
> [ 77.962348] [<c10880aa>] ? print_lock_contention_bug+0x1a/0xf0
> [ 77.962348] [<c151b41e>] schedule+0x1e/0x60
> [ 77.962348] [<c151b6df>] schedule_preempt_disabled+0xf/0x20
> [ 77.962348] [<c151c63f>] mutex_lock_killable_nested+0x15f/0x3e0
> [ 77.962348] [<e0861022>] ? shrink_test_count+0x12/0x60 [test]
> [ 77.962348] [<e0861022>] ? shrink_test_count+0x12/0x60 [test]
> [ 77.962348] [<e0861022>] shrink_test_count+0x12/0x60 [test]
> [ 77.962348] [<c10e58ae>] shrink_slab_node+0x2e/0x1d0
> [ 77.962348] [<c10e68da>] ? shrink_slab+0x2a/0xe0
> [ 77.962348] [<c10e6911>] shrink_slab+0x61/0xe0
> [ 77.962348] [<c10e8416>] kswapd+0x5f6/0x8e0
> [ 77.962348] [<c1062e0f>] kthread+0xaf/0xd0
> [ 77.962348] [<c10e7e20>] ? try_to_free_pages+0x540/0x540
> [ 77.962348] [<c108a08b>] ? trace_hardirqs_on+0xb/0x10
> [ 77.962348] [<c1525d41>] ret_from_kernel_thread+0x21/0x30
> [ 77.962348] [<c1062d60>] ? __init_kthread_worker+0x60/0x60
>
> Tetsuo Handa wrote:
> > Tetsuo Handa wrote:
> > > From e314a1a1583e585d062dfc30c8aad8bf5380510b Mon Sep 17 00:00:00 2001
> > > From: Tetsuo Handa <penguin-kernel@xxxxxxxxxxxxxxxxxxx>
> > > Date: Mon, 19 May 2014 18:43:21 +0900
> > > Subject: [PATCH] gpu/drm/ttm: Use mutex_lock_killable() for shrinker functions.
> > >
> > > I can observe that RHEL7 environment stalls with 100% CPU usage when a
> > > certain type of memory pressure is given. While the shrinker functions
> > > are called by shrink_slab() before the OOM killer is triggered, the stall
> > > lasts for many minutes.
> > >
> > > I added debug printk() and observed that many threads are blocked for more
> > > than 10 seconds at ttm_dma_pool_shrink_count()/ttm_dma_pool_shrink_scan()
> > > functions. Since the kswapd can call these functions later, the current
> > > thread can return from these functions as soon as chosen by the OOM killer.
> > >
> > > This patch changes "mutex_lock();" to "if (mutex_lock_killable()) return ...;"
> > > so that any threads can promptly give up. (By the way, as far as I tested,
> > > changing to "if (!mutex_trylock()) return ...;" likely shortens the duration
> > > of stall. Maybe we don't need to wait for mutex if someone is already calling
> > > these functions.)
> > >
> >
> > While discussing about XFS problem, I got a question. Is it OK (from point
> > of view of reentrant) to use mutex_lock() or mutex_lock_killable() inside
> > shrinker's entry point functions? Can senario shown below possible?
> >
> > (1) kswapd is doing memory reclaim which does not need to hold mutex.
> >
> > (2) Someone in GFP_KERNEL context (not kswapd) calls
> > ttm_dma_pool_shrink_count() and then calls ttm_dma_pool_shrink_scan()
> > from direct reclaim path.
> >
> > (3) Inside ttm_dma_pool_shrink_scan(), GFP_KERNEL allocation is issued
> > while mutex is held by the someone.
> >
> > (4) GFP_KERNEL allocation cannot be completed immediately due to memory
> > pressure.
> >
> > (5) kswapd calls ttm_dma_pool_shrink_count() which need to hold mutex.
> >
> > (6) Inside ttm_dma_pool_shrink_count(), kswapd is blocked waiting for
> > mutex held by the someone, and the someone is waiting for GFP_KERNEL
> > allocation to complete, but GFP_KERNEL allocation cannot be completed
> > until mutex held by the someone is released?

Ewww. Perhaps if we used GFP_ATOMIC for the array allocation we do in
ttm_dma_page_pool_free and ttm_page_pool_free?

That would avoid the 4) problem.
> >
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/