Re: mm: swap: hang in lru_add_drain_all

From: Hugh Dickins
Date: Sat Jul 19 2014 - 19:27:06 EST


On Fri, 18 Jul 2014, Sasha Levin wrote:

> Hi all,
>
> While fuzzing with trinity inside a KVM tools guest running the latest -next
> kernel I've stumbled on the following spew:
>
> [ 729.682257] INFO: task trinity-c158:13508 blocked for more than 120 seconds.
> [ 729.683191] Not tainted 3.16.0-rc5-next-20140718-sasha-00052-g4d34feb #902
> [ 729.683843] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
> [ 729.684633] trinity-c158 D ffff880d32169000 12520 13508 8729 0x00000000
> [ 729.685323] ffff880de6d3fe38 0000000000000002 0000000000000006 ffff880c6c33b000
> [ 729.686039] 000000785eb36cd1 ffff880de6d3c010 ffff880de6d3c000 ffff880e2b270000
> [ 729.686761] ffff880c6c33b000 0000000000000000 ffffffffb070b908 0000000026e426e2
> [ 729.687483] Call Trace:
> [ 729.687736] schedule_preempt_disabled (kernel/sched/core.c:2874)
> [ 729.688544] mutex_lock_nested (kernel/locking/mutex.c:532 kernel/locking/mutex.c:584)
> [ 729.689127] ? lru_add_drain_all (mm/swap.c:867)
> [ 729.689666] ? lru_add_drain_all (mm/swap.c:867)
> [ 729.690334] lru_add_drain_all (mm/swap.c:867)
> [ 729.690946] SyS_mlockall (./arch/x86/include/asm/current.h:14 include/linux/sched.h:2978 mm/mlock.c:813 mm/mlock.c:798)
> [ 729.691437] tracesys (arch/x86/kernel/entry_64.S:541)
> [ 729.691883] 1 lock held by trinity-c158/13508:
> [ 729.692333] #0: (lock#3){+.+...}, at: lru_add_drain_all (mm/swap.c:867)
>
> There are quite a few tasks with the same stacktrace. Since the lock we're blocking
> on is static local to the function it's easy to figure out which task actually has it:
>
> [ 739.587839] trinity-c135 D ffff880e1318c000 13096 21051 8729 0x00000000
> [ 739.589080] ffff880bf659fcc0 0000000000000002 ffff880c74223cf0 ffff880c74223000
> [ 739.590544] 000000781c3378ed ffff880bf659c010 ffff880bf659c000 ffff880e2b153000
> [ 739.591815] ffff880c74223000 0000000000000000 7fffffffffffffff ffff880bf659fe80
> [ 739.593165] Call Trace:
> [ 739.593588] schedule (kernel/sched/core.c:2847)
> [ 739.594396] schedule_timeout (kernel/time/timer.c:1476)
> [ 739.595354] ? mark_lock (kernel/locking/lockdep.c:2894)
> [ 739.596229] ? __this_cpu_preempt_check (lib/smp_processor_id.c:63)
> [ 739.597297] ? get_parent_ip (kernel/sched/core.c:2561)
> [ 739.598185] wait_for_completion (include/linux/spinlock.h:328 kernel/sched/completion.c:76 kernel/sched/completion.c:93 kernel/sched/completion.c:101 kernel/sched/completion.c:122)
> [ 739.599191] ? wake_up_state (kernel/sched/core.c:2942)
> [ 739.600246] flush_work (kernel/workqueue.c:503 kernel/workqueue.c:2762)
> [ 739.601171] ? flush_work (kernel/workqueue.c:2733 kernel/workqueue.c:2760)
> [ 739.602084] ? destroy_worker (kernel/workqueue.c:2348)
> [ 739.603035] ? wait_for_completion (kernel/sched/completion.c:64 kernel/sched/completion.c:93 kernel/sched/completion.c:101 kernel/sched/completion.c:122)
> [ 739.604043] lru_add_drain_all (include/linux/cpumask.h:180 mm/swap.c:883)
> [ 739.605076] SyS_mlockall (./arch/x86/include/asm/current.h:14 include/linux/sched.h:2978 mm/mlock.c:813 mm/mlock.c:798)
> [ 739.605943] tracesys (arch/x86/kernel/entry_64.S:541)
>
> Now, you'd expect to see lru_add_drain_per_cpu in one of the tasks, but
> that's not the case.
>
> Full log attched.

I'm happy to find no "shmem" and no "fallocate" in that log: I agree
that this hang is something different, and I'll sound the all-clear
on that one, once I've made a suggestion on this one.

I know next to nothing about scheduler matters, so the less I say, the
less I'll make a fool of myself. But I believe this problem may be
self-inflicted by trinity, that it may be using its privilege to abuse
RT priority in a way to hang the system here.

Appended below are relevant cpu#13 lines from your log.txt. You took
10 watchdog dumps across 20 seconds: here's the first trace and stats,
eight intervening "rt_time" lines, and the last trace and stats.

Bear in mind that I know nothing about RT, and "rt_time", but it seems
fair to show it cycling around in between the first and last. And it
seems interesting that "nr_switches" is 49357 in the first and 49357
in the last, and the three cpu#13 runnable kworkers show the same
"switches" in the first and the last stats.

I've not checked whether "switches" means what I'd expect it to mean,
but I'm guessing trinity-c13 is looping at RT priority on cpu#13,
and locking out everything else.

In particular, locking out the kworker which would just love to get
in and drain cpu#13's lru_add pagevec, then wake up trinity-c135 to
say the drain is completed, whereupon trinity-c135 can drop the mutex,
to let trinity-c158 and many others in to do... much the same again.

Hugh

[ 739.534086] trinity-c13 R running task 13312 20846 8729 0x00080000
[ 739.534875] 000000000000024e ffffffffa9ffc7f7 ffff880ba796bd28 ffffffff00000001
[ 739.535696] ffff880ba796b000 ffff880c00000001 0000000000000000 0000000000000000
[ 739.536519] 0000000000000002 0000000000000000 ffffffffa9248135 ffff880c2bd97f28
[ 739.537336] Call Trace:
[ 739.537613] ? lock_release (kernel/locking/lockdep.c:3475 kernel/locking/lockdep.c:3498 kernel/locking/lockdep.c:3619)
[ 739.538202] ? debug_smp_processor_id (lib/smp_processor_id.c:57)
[ 739.538859] ? _raw_spin_unlock_irqrestore (include/linux/spinlock_api_smp.h:159 kernel/locking/spinlock.c:191)
[ 739.539549] ? lock_release (kernel/locking/lockdep.c:3475 kernel/locking/lockdep.c:3498 kernel/locking/lockdep.c:3619)
[ 739.540339] ? get_vtime_delta (kernel/sched/cputime.c:649 (discriminator 8))
[ 739.541110] ? vtime_account_user (kernel/sched/cputime.c:684)
[ 739.541737] ? context_tracking_user_exit (include/linux/vtime.h:89 include/linux/jump_label.h:115 include/trace/events/context_tracking.h:47 kernel/context_tracking.c:180)
[ 739.542435] ? vtime_user_enter (kernel/sched/cputime.c:693)
[ 739.543044] ? SyS_wait4 (kernel/exit.c:1619 kernel/exit.c:1587)
[ 739.543586] ? syscall_trace_leave (arch/x86/kernel/ptrace.c:1531)
[ 739.544240] ? tracesys (arch/x86/kernel/entry_64.S:530)
[ 739.544782] ? SyS_kill (kernel/signal.c:2890)

[ 740.663871] cpu#13, 2260.998 MHz
[ 740.664431] .nr_running : 3
[ 740.665197] .load : 1024
[ 740.666005] .nr_switches : 49357
[ 740.666806] .nr_load_updates : 48621
[ 740.667613] .nr_uninterruptible : -55
[ 740.668653] .next_balance : 4295.011457
[ 740.669538] .curr->pid : 20846
[ 740.670452] .clock : 740670.058306
[ 740.671471] .cpu_load[0] : 62
[ 740.673656] .cpu_load[1] : 62
[ 740.674147] .cpu_load[2] : 62
[ 740.674636] .cpu_load[3] : 62
[ 740.675131] .cpu_load[4] : 62
[ 740.675625] .yld_count : 125
[ 740.676138] .sched_count : 51104
[ 740.676650] .sched_goidle : 9784
[ 740.677158] .avg_idle : 1300456
[ 740.677695] .max_idle_balance_cost : 2394144
[ 740.678226] .ttwu_count : 14256
[ 740.678743] .ttwu_local : 2997
[ 740.679257]
[ 740.679257] cfs_rq[13]:/
[ 740.679741] .exec_clock : 170539.871709
[ 740.680387] .MIN_vruntime : 956608.992999
[ 740.680387] .min_vruntime : 956620.348251
[ 740.680387] .max_vruntime : 956608.992999
[ 740.680387] .spread : 0.000000
[ 740.680387] .spread0 : -393069.336040
[ 740.680387] .nr_spread_over : 1015
[ 740.680387] .nr_running : 1
[ 740.680387] .load : 1024
[ 740.680387] .runnable_load_avg : 62
[ 740.680387] .blocked_load_avg : 0
[ 740.680387] .tg_load_contrib : 62
[ 740.680387] .tg_runnable_contrib : 1020
[ 740.680387] .tg_load_avg : 72
[ 740.680387] .tg->runnable_avg : 2046
[ 740.680387] .tg->cfs_bandwidth.timer_active: 0
[ 740.680387] .throttled : 0
[ 740.680387] .throttle_count : 0
[ 740.680387] .avg->runnable_avg_sum : 48432
[ 740.680387] .avg->runnable_avg_period : 48432
[ 740.680387]
[ 740.680387] rt_rq[13]:/
[ 740.680387] .rt_nr_running : 2
[ 740.680387] .rt_throttled : 0
[ 740.680387] .rt_time : 696.812126
[ 740.680387] .rt_runtime : 1000.000000
[ 740.680387]
[ 740.680387] runnable tasks:
[ 740.680387] task PID tree-key switches prio exec-runtime sum-exec sum-sleep
[ 740.680387] ----------------------------------------------------------------------------------------------------------
[ 740.680387] rcuos/3 53 520765.589523 5 120 520765.589523 0.808313 404091.464092 0 /
[ 740.680387] watchdog/13 201 -11.682136 22 0 -11.682136 6.875011 0.000000 0 /
[ 740.680387] migration/13 202 0.000000 82 0 0.000000 47.887990 0.000000 0 /
[ 740.680387] rcuc/13 203 -12.000000 4 98 -12.000000 0.265959 0.000000 0 /
[ 740.680387] ksoftirqd/13 204 947403.072378 59 120 947403.072378 16.475254 505123.439436 0 /
[ 740.680387] kworker/13:0 205 29.113239 30 120 29.113239 4.337088 9646.893156 0 /
[ 740.680387] kworker/13:0H 206 378.140827 8 100 378.140827 0.489706 14056.057142 0 /
[ 740.680387] kworker/13:1 2406 956608.992999 4136 120 956608.992999 2540.330017 498215.463746 0 /
[ 740.680387] fcoethread/13 5843 6315.295777 2 100 6315.295777 0.320255 0.342003 0 /
[ 740.680387] bnx2fc_thread/1 5888 6315.423773 2 100 6315.423773 0.850324 0.574025 0 /
[ 740.680387] bnx2i_thread/13 5978 6340.922884 3 100 6340.922884 0.546603 0.496749 0 /
[ 740.680387] trinity-c182 9328 301.495168 1054 120 301.495168 10553.961542 76826.313305 0 /autogroup-133
[ 740.680387] trinity-c94 9640 284.370532 531 120 284.370532 6160.616065 55406.177681 0 /autogroup-90
[ 740.680387] trinity-c161 10919 63.826999 494 120 63.826999 3866.730276 55130.167304 0 /autogroup-235
[ 740.680387] R trinity-c13 20846 0.000000 52 97 0.000000 225300.270687 0.000000 0 /
[ 740.680387] trinity-c13 21073 0.000000 0 97 0.000000 0.000000 0.000000 0 /

then over the next 20 seconds, cpu#13 shows:
[ 752.910531] .rt_time : 919.040543
[ 759.959078] .rt_time : 956.934204
[ 760.167586] .rt_time : 169.606765
[ 760.375072] .rt_time : 379.091881
[ 760.582864] .rt_time : 588.926167
[ 760.790587] .rt_time : 798.310732
[ 760.998034] .rt_time : 998.004809
[ 761.201971] .rt_time : 209.376323
and ends up with this trace and stats:

[ 761.390639] trinity-c13 R running task 13312 20846 8729 0x00080000
[ 761.390649] ffff880ba796b070 ffffffffa9ffc7f7 ffff880c2bd97e40 ffffffffa92119f8
[ 761.390658] 0000000000000082 ffffffffa924fa35 0000000000000022 0000000000000000
[ 761.390668] ffff880dd9d80020 0000000000000282 ffffffffadea411c ffff880c2bd97e48
[ 761.390670] Call Trace:
[ 761.390679] ? rcu_read_lock_held (kernel/rcu/update.c:165)
[ 761.390686] ? sched_clock_cpu (kernel/sched/clock.c:311)
[ 761.390694] ? rcu_read_lock_held (kernel/rcu/update.c:165)
[ 761.390703] ? _raw_spin_lock (include/linux/spinlock_api_smp.h:143 kernel/locking/spinlock.c:151)
[ 761.390710] ? rcu_eqs_enter (kernel/rcu/tree.c:556)
[ 761.390719] ? context_tracking_user_exit (include/linux/vtime.h:89 include/linux/jump_label.h:115 include/trace/events/context_tracking.h:47 kernel/context_tracking.c:180)
[ 761.390727] ? SyS_wait4 (kernel/exit.c:1619 kernel/exit.c:1587)
[ 761.390735] ? syscall_trace_leave (arch/x86/kernel/ptrace.c:1531)
[ 761.390742] ? int_with_check (arch/x86/kernel/entry_64.S:555)

[ 761.412307] cpu#13, 2260.998 MHz
[ 761.412311] .nr_running : 3
[ 761.412314] .load : 1024
[ 761.412317] .nr_switches : 49357
[ 761.412320] .nr_load_updates : 50696
[ 761.412323] .nr_uninterruptible : -55
[ 761.412326] .next_balance : 4295.013537
[ 761.412329] .curr->pid : 20846
[ 761.412333] .clock : 761410.040099
[ 761.412335] .cpu_load[0] : 62
[ 761.412338] .cpu_load[1] : 62
[ 761.412341] .cpu_load[2] : 62
[ 761.412344] .cpu_load[3] : 62
[ 761.412347] .cpu_load[4] : 62
[ 761.412349] .yld_count : 125
[ 761.412352] .sched_count : 51104
[ 761.412355] .sched_goidle : 9784
[ 761.412358] .avg_idle : 1300456
[ 761.412361] .max_idle_balance_cost : 2153150
[ 761.412364] .ttwu_count : 14256
[ 761.412366] .ttwu_local : 2997
[ 761.412376]
[ 761.412376] cfs_rq[13]:/
[ 761.412379] .exec_clock : 170539.871709
[ 761.412387] .MIN_vruntime : 956608.992999
[ 761.412390] .min_vruntime : 956620.348251
[ 761.412393] .max_vruntime : 956608.992999
[ 761.412396] .spread : 0.000000
[ 761.412399] .spread0 : -395611.774772
[ 761.412402] .nr_spread_over : 1015
[ 761.412404] .nr_running : 1
[ 761.412407] .load : 1024
[ 761.412409] .runnable_load_avg : 62
[ 761.412412] .blocked_load_avg : 0
[ 761.412414] .tg_load_contrib : 62
[ 761.412417] .tg_runnable_contrib : 1020
[ 761.412419] .tg_load_avg : 597
[ 761.412422] .tg->runnable_avg : 2132
[ 761.412424] .tg->cfs_bandwidth.timer_active: 0
[ 761.412427] .throttled : 0
[ 761.412429] .throttle_count : 0
[ 761.412432] .avg->runnable_avg_sum : 48386
[ 761.412434] .avg->runnable_avg_period : 48386
[ 761.412530]
[ 761.412530] rt_rq[13]:/
[ 761.412532] .rt_nr_running : 2
[ 761.412535] .rt_throttled : 0
[ 761.412538] .rt_time : 419.008925
[ 761.412541] .rt_runtime : 1000.000000
[ 761.412547]
[ 761.412547] runnable tasks:
[ 761.412547] task PID tree-key switches prio exec-runtime sum-exec sum-sleep
[ 761.412547] ----------------------------------------------------------------------------------------------------------
[ 761.412607] rcuos/3 53 520765.589523 5 120 520765.589523 0.808313 404091.464092 0 /
[ 761.412736] watchdog/13 201 -11.682136 22 0 -11.682136 6.875011 0.000000 0 /
[ 761.412751] migration/13 202 0.000000 82 0 0.000000 47.887990 0.000000 0 /
[ 761.412767] rcuc/13 203 -12.000000 4 98 -12.000000 0.265959 0.000000 0 /
[ 761.412784] ksoftirqd/13 204 947403.072378 59 120 947403.072378 16.475254 505123.439436 0 /
[ 761.412800] kworker/13:0 205 29.113239 30 120 29.113239 4.337088 9646.893156 0 /
[ 761.412816] kworker/13:0H 206 378.140827 8 100 378.140827 0.489706 14056.057142 0 /
[ 761.412953] kworker/13:1 2406 956608.992999 4136 120 956608.992999 2540.330017 498215.463746 0 /
[ 761.413046] fcoethread/13 5843 6315.295777 2 100 6315.295777 0.320255 0.342003 0 /
[ 761.413095] bnx2fc_thread/1 5888 6315.423773 2 100 6315.423773 0.850324 0.574025 0 /
[ 761.413141] bnx2i_thread/13 5978 6340.922884 3 100 6340.922884 0.546603 0.496749 0 /
[ 761.413298] trinity-c182 9328 301.495168 1054 120 301.495168 10553.961542 76826.313305 0 /autogroup-133
[ 761.413316] trinity-c94 9640 284.370532 531 120 284.370532 6160.616065 55406.177681 0 /autogroup-90
[ 761.413343] trinity-c161 10919 63.826999 494 120 63.826999 3866.730276 55130.167304 0 /autogroup-235
[ 761.413538] R trinity-c13 20846 0.000000 52 97 0.000000 245950.924182 0.000000 0 /
[ 761.413560] trinity-c13 21073 0.000000 0 97 0.000000 0.000000 0.000000 0 /
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/