Re: [PATCH] sched: tg_set_cfs_bandwidth() causes rq->lock deadlock

From: Roman Gushchin
Date: Fri May 16 2014 - 04:38:42 EST


At Thu, 15 May 2014 10:43:14 -0700,
bsegall@xxxxxxxxxx wrote:
>
> Roman Gushchin <klamm@xxxxxxxxxxxxxx> writes:
>
> > tg_set_cfs_bandwidth() sets cfs_b->timer_active to 0 to
> > force the period timer restart. It's not safe, because
> > can lead to deadlock, described in commit 927b54fccbf0:
> > "__start_cfs_bandwidth calls hrtimer_cancel while holding rq->lock,
> > waiting for the hrtimer to finish. However, if sched_cfs_period_timer
> > runs for another loop iteration, the hrtimer can attempt to take
> > rq->lock, resulting in deadlock."
> > If tg_set_cfs_bandwidth() resets cfs_b->timer_active concurrently
> > to the second iteration of sched_cfs_period_timer, deadlock occurs.
>
> I do not see this deadlock. cfs_b->timer_active is protected by
> cfs_b->lock on both read and write, and I don't think it would even
> directly cause deadlock if it wasn't. In particular this patch makes us
> run for strictly longer outside of the interrupt (although I think the
> patched version is also correct). The old issue was calling
> hrtimer_cancel, which would mean we hold cfs_b->lock while waiting for
> the interrupt to complete, while the interrupt was waiting to take
> cfs_b->lock.

I still think, there is a deadlock. I'll try to explain.
Three CPUs must be involved:
CPU0 CPU1 CPU2
take rq->lock period timer fired
... take cfs_b lock
... ... tg_set_cfs_bandwidth()
throttle_cfs_rq() release cfs_b lock take cfs_b lock
... distribute_cfs_runtime() timer_active = 0
take cfs_b->lock wait for rq->lock ...
__start_cfs_bandwidth()
{wait for timer callback
break if timer_active == 1}

So, CPU0 and CPU1 are deadlocked.

If you remember do_sched_cfs_period_timer() code, timer_active is set to 1 to
avoid a similar deadlock:
static int do_sched_cfs_period_timer(struct cfs_bandwidth *cfs_b, int overrun)
{
...
/*
* if we have relooped after returning idle once, we need to update our
* status as actually running, so that other cpus doing
* __start_cfs_bandwidth will stop trying to cancel us.
*/
cfs_b->timer_active = 1;
...
while (throttled && runtime > 0) {
raw_spin_unlock(&cfs_b->lock);
/* we can't nest cfs_b->lock while distributing bandwidth */
> runtime = distribute_cfs_runtime(cfs_b, runtime,
runtime_expires);
...

tg_set_cfs_bandwidth can actually set timer_active to 0 in a marked moment.

This is how it looks like in practise:
CPU0:
[ 1967.348938] WARNING: at kernel/watchdog.c:245 watchdog_overflow_callback+0x98/0xc0()
[ 1967.356663] Watchdog detected hard LOCKUP on cpu 8
[ 1967.361269] Modules linked in: vhost_net macvtap macvlan 8021q mrp garp ip6table_filter ip6_tables ebtable_nat ebtables ipt_MASQUERADE iptable_nat nf_nat_ipv4 nf_nat nf_conntrack_ipv4 nf_defrag_ipv4 xt_state nf_conntrack ipt_REJECT xt_CHECKSUM iptable_mangle xt_tcpudp iptable_filter ip_tables x_tables bridge stp llc intel_powerclamp coretemp kvm_intel kvm crc32_pclmul ghash_clmulni_intel aesni_intel ablk_helper cryptd lrw gf128mul glue_helper aes_x86_64 gpio_ich hid_generic dm_multipath scsi_dh usbhid ast ttm hid drm_kms_helper microcode drm sysimgblt sysfillrect syscopyarea shpchp pci_hotplug sb_edac edac_core lpc_ich tcp_htcp raid10 raid456 async_pq async_xor xor async_memcpy async_raid6_recov igb dca i2c_algo_bit ahci libahci i2c_core ptp pps_core raid6_pq async_tx raid1 raid0 multipath linear
[ 1967.433217] CPU: 8 PID: 9745 Comm: kvm Tainted: G W 3.10.38+ #17
[ 1967.440074] Hardware name: GIGABYTE GS-CFZ1/GS-CCZ/GS-CCZH/GA-7PTSH, BIOS D06 03/25/2013
[ 1967.448142] ffffffff819faa10 ffff88207fc07bf8 ffffffff81676473 ffff88207fc07c38
[ 1967.455587] ffffffff81054fcb 0000000000000000 ffff882029338000 0000000000000000
[ 1967.463033] ffff88207fc07d80 0000000000000000 ffff88207fc07ef8 ffff88207fc07c98
[ 1967.470480] Call Trace:
[ 1967.472922] <NMI> [<ffffffff81676473>] dump_stack+0x19/0x1b
[ 1967.478691] [<ffffffff81054fcb>] warn_slowpath_common+0x6b/0xa0
[ 1967.484689] [<ffffffff810550a1>] warn_slowpath_fmt+0x41/0x50
[ 1967.490422] [<ffffffff810e2a28>] watchdog_overflow_callback+0x98/0xc0
[ 1967.496933] [<ffffffff811171bc>] __perf_event_overflow+0x9c/0x310
[ 1967.503097] [<ffffffff810245c8>] ? x86_perf_event_set_period+0xd8/0x150
[ 1967.509780] [<ffffffff811181f4>] perf_event_overflow+0x14/0x20
[ 1967.515686] [<ffffffff8102a71e>] intel_pmu_handle_irq+0x1ae/0x390
[ 1967.521852] [<ffffffff8133c31d>] ? do_raw_spin_unlock+0x5d/0xb0
[ 1967.527852] [<ffffffff810236c1>] perf_event_nmi_handler+0x21/0x30
[ 1967.534025] [<ffffffff810174ae>] nmi_handle.isra.2+0x9e/0x1e0
[ 1967.539852] [<ffffffff81017410>] ? unregister_nmi_handler+0x180/0x180
[ 1967.546370] [<ffffffff81017708>] do_nmi+0x118/0x390
[ 1967.551324] [<ffffffff8167e0c1>] end_repeat_nmi+0x1e/0x2e
[ 1967.556804] [<ffffffff810b2721>] ? __lock_acquire+0x201/0x1fb0
[ 1967.562716] [<ffffffff810b2721>] ? __lock_acquire+0x201/0x1fb0
[ 1967.568621] [<ffffffff810b2721>] ? __lock_acquire+0x201/0x1fb0
[ 1967.574525] <<EOE>> [<ffffffff810b293f>] ? __lock_acquire+0x41f/0x1fb0
[ 1967.581235] [<ffffffff810b293f>] ? __lock_acquire+0x41f/0x1fb0
[ 1967.587144] [<ffffffff81085b3b>] ? lock_hrtimer_base.isra.25+0x2b/0x60
[ 1967.593748] [<ffffffff810b4b4c>] lock_acquire+0xac/0x140
[ 1967.599143] [<ffffffff81098c18>] ? __start_cfs_bandwidth+0x68/0x90
[ 1967.605402] [<ffffffff8167cdac>] _raw_spin_lock+0x2c/0x40
[ 1967.610882] [<ffffffff81098c18>] ? __start_cfs_bandwidth+0x68/0x90
[ 1967.617132] [<ffffffff8167d236>] ? _raw_spin_unlock+0x26/0x30
[ 1967.622952] [<ffffffff81098c18>] __start_cfs_bandwidth+0x68/0x90
[ 1967.629038] [<ffffffff81098d27>] __account_cfs_rq_runtime+0xe7/0x190
[ 1967.635462] [<ffffffff81098ea8>] update_curr+0xd8/0x1c0
[ 1967.640768] [<ffffffff8109aca8>] put_prev_task_fair+0x228/0x350
[ 1967.646760] [<ffffffff8101bb99>] ? sched_clock+0x9/0x10
[ 1967.652060] [<ffffffff8109558d>] ? sched_clock_cpu+0xbd/0x110
[ 1967.657879] [<ffffffff8167b01f>] __schedule+0x13f/0x8a0
[ 1967.663187] [<ffffffff810b0fb0>] ? mark_held_locks+0x80/0x150
[ 1967.669011] [<ffffffff8167c405>] ? yield_to+0x145/0x210
[ 1967.674312] [<ffffffff8167c1a4>] schedule+0x24/0x70
[ 1967.679273] [<ffffffff8167c41a>] yield_to+0x15a/0x210
[ 1967.684419] [<ffffffffa0270e68>] kvm_vcpu_yield_to+0xc8/0x1c0 [kvm]
[ 1967.690767] [<ffffffffa0270da0>] ? kvm_vcpu_fault+0xf0/0xf0 [kvm]
[ 1967.696939] [<ffffffffa027107b>] kvm_vcpu_on_spin+0x11b/0x150 [kvm]
[ 1967.703283] [<ffffffffa0236da9>] handle_pause+0x19/0x30 [kvm_intel]
[ 1967.709630] [<ffffffffa023908e>] vmx_handle_exit+0xfe/0x8e0 [kvm_intel]
[ 1967.716330] [<ffffffffa028b085>] ? kvm_arch_vcpu_ioctl_run+0xe35/0x1330 [kvm]
[ 1967.723548] [<ffffffffa028adc5>] kvm_arch_vcpu_ioctl_run+0xb75/0x1330 [kvm]
[ 1967.730594] [<ffffffffa028ad5a>] ? kvm_arch_vcpu_ioctl_run+0xb0a/0x1330 [kvm]
[ 1967.737807] [<ffffffffa0273136>] kvm_vcpu_ioctl+0x456/0x640 [kvm]
[ 1967.743974] [<ffffffff810ba085>] ? do_futex+0xa45/0xbb0
[ 1967.749274] [<ffffffff8119a37f>] do_vfs_ioctl+0x8f/0x550
[ 1967.754667] [<ffffffff811a5bce>] ? fget_light+0xee/0x4d0
[ 1967.760059] [<ffffffff811a5b23>] ? fget_light+0x43/0x4d0
[ 1967.765445] [<ffffffff8167e355>] ? sysret_check+0x22/0x5d
[ 1967.770917] [<ffffffff8119a8d1>] SyS_ioctl+0x91/0xb0
[ 1967.775956] [<ffffffff81334cbe>] ? trace_hardirqs_on_thunk+0x3a/0x3f
[ 1967.782381] [<ffffffff8167e329>] system_call_fastpath+0x16/0x1b

CPU1:
[ 2044.045281] NMI backtrace for cpu 9
[ 2044.048763] CPU: 9 PID: 0 Comm: swapper/9 Tainted: G W 3.10.38+ #17
[ 2044.055878] Hardware name: GIGABYTE GS-CFZ1/GS-CCZ/GS-CCZH/GA-7PTSH, BIOS D06 03/25/2013
[ 2044.063949] task: ffff88102956af40 ti: ffff881029570000 task.ti: ffff881029570000
[ 2044.071411] RIP: 0010:[<ffffffff81333a1d>] [<ffffffff81333a1d>] delay_tsc+0x1d/0x80
[ 2044.079153] RSP: 0018:ffff88207fc23d58 EFLAGS: 00000006
[ 2044.084451] RAX: 0000000000000000 RBX: ffff88207fdb45c0 RCX: 000000005ae55164
[ 2044.091567] RDX: 000000000000c7c6 RSI: 000000000002a000 RDI: 0000000000000001
[ 2044.098685] RBP: ffff88207fc23d78 R08: 0000000000000002 R09: 0000000000000000
[ 2044.105801] R10: 0000000000000000 R11: 0000000000000016 R12: 00000000831982c5
[ 2044.112919] R13: 0000000000000009 R14: 0000000000000001 R15: ffff88207fdb45c0
[ 2044.120035] FS: 0000000000000000(0000) GS:ffff88207fc20000(0000) knlGS:0000000000000000
[ 2044.128105] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 2044.133835] CR2: 00007f561cd6fc30 CR3: 0000000001c0c000 CR4: 00000000000427e0
[ 2044.140954] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[ 2044.148072] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
[ 2044.155186] Stack:
[ 2044.157196] ffff88207fdb45c0 00000000831982c5 00000000833a1014 ffff881ec1a33400
[ 2044.164641] ffff88207fc23d88 ffffffff8133396a ffff88207fc23dc8 ffffffff8133c22c
[ 2044.172087] ffff88207fc23dc8 ffff88207fdb45c0 ffff881024dc8258 ffff88207fdb45c0
[ 2044.179534] Call Trace:
[ 2044.181974] <IRQ>
[ 2044.183898] [<ffffffff8133396a>] __delay+0xa/0x10
[ 2044.188886] [<ffffffff8133c22c>] do_raw_spin_lock+0xcc/0x120
[ 2044.194618] [<ffffffff8167cdb4>] _raw_spin_lock+0x34/0x40
[ 2044.200091] [<ffffffff8109be87>] ? distribute_cfs_runtime+0xb7/0x290
[ 2044.206523] [<ffffffff8109be87>] distribute_cfs_runtime+0xb7/0x290
[ 2044.212783] [<ffffffff8109bdd0>] ? rq_offline_fair+0x140/0x140
[ 2044.218695] [<ffffffff8109c288>] sched_cfs_period_timer+0x118/0x190
[ 2044.225035] [<ffffffff81085c2c>] __run_hrtimer+0x6c/0x1d0
[ 2044.230514] [<ffffffff810a626d>] ? ktime_get_update_offsets+0x4d/0xe0
[ 2044.237025] [<ffffffff8109c170>] ? sched_cfs_slack_timer+0x110/0x110
[ 2044.243450] [<ffffffff810864d7>] hrtimer_interrupt+0x107/0x260
[ 2044.249362] [<ffffffff81038684>] smp_apic_timer_interrupt+0x64/0xa0
[ 2044.255702] [<ffffffff8167efb2>] apic_timer_interrupt+0x72/0x80
[ 2044.261698] <EOI>
[ 2044.263622] [<ffffffff8150baae>] ? cpuidle_enter_state+0x5e/0xe0
[ 2044.269910] [<ffffffff8150baaa>] ? cpuidle_enter_state+0x5a/0xe0
[ 2044.275994] [<ffffffff8150bbe9>] cpuidle_idle_call+0xb9/0x250
[ 2044.281813] [<ffffffff8101d109>] arch_cpu_idle+0x9/0x30
[ 2044.287112] [<ffffffff810a3fc0>] cpu_startup_entry+0xe0/0x2d0
[ 2044.292930] [<ffffffff816662f9>] start_secondary+0x204/0x20b

CPU2:
[ 2043.793852] CPU: 12 PID: 1487 Comm: libvirtd Tainted: G W 3.10.38+ #17
[ 2043.801227] Hardware name: GIGABYTE GS-CFZ1/GS-CCZ/GS-CCZH/GA-7PTSH, BIOS D06 03/25/2013
[ 2043.809298] task: ffff882025785e80 ti: ffff882026518000 task.ti: ffff882026518000
[ 2043.816760] RIP: 0010:[<ffffffff81333a45>] [<ffffffff81333a45>] delay_tsc+0x45/0x80
[ 2043.824500] RSP: 0018:ffff882026519cd8 EFLAGS: 00000002
[ 2043.829800] RAX: 000011815ae55170 RBX: ffff881024dc8170 RCX: 000000005ae55170
[ 2043.836917] RDX: 0000000000001181 RSI: 0000000dc4934000 RDI: 0000000000000001
[ 2043.844033] RBP: ffff882026519cf8 R08: 0000000000000003 R09: 0000000000000000
[ 2043.851151] R10: 0000000000000000 R11: 000000000000049b R12: 000000005ae55170
[ 2043.858267] R13: 000000000000000c R14: 0000000000000001 R15: 0000000000000000
[ 2043.865384] FS: 00007f67cf908700(0000) GS:ffff88207fc80000(0000) knlGS:0000000000000000
[ 2043.873453] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 2043.879185] CR2: 00007f67d5c22000 CR3: 0000001026737000 CR4: 00000000000427e0
[ 2043.886301] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[ 2043.893418] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
[ 2043.900534] Stack:
[ 2043.902543] ffff881024dc8170 0000000000000032 00000000833a1014 0000000000000001
[ 2043.909990] ffff882026519d08 ffffffff8133396a ffff882026519d48 ffffffff8133c22c
[ 2043.917437] ffff882026519d38 ffff881024dc8170 ffff881024dc81d8 ffff881024dc8170
[ 2043.924883] Call Trace:
[ 2043.927326] [<ffffffff8133396a>] __delay+0xa/0x10
[ 2043.932114] [<ffffffff8133c22c>] do_raw_spin_lock+0xcc/0x120
[ 2043.937846] [<ffffffff8167cdb4>] _raw_spin_lock+0x34/0x40
[ 2043.943325] [<ffffffff81098c18>] ? __start_cfs_bandwidth+0x68/0x90
[ 2043.949577] [<ffffffff8167d236>] ? _raw_spin_unlock+0x26/0x30
[ 2043.955395] [<ffffffff81098c18>] __start_cfs_bandwidth+0x68/0x90
[ 2043.961474] [<ffffffff8108f9a8>] tg_set_cfs_bandwidth+0x2e8/0x330
[ 2043.967638] [<ffffffff8108f768>] ? tg_set_cfs_bandwidth+0xa8/0x330
[ 2043.973897] [<ffffffff810952e7>] tg_set_cfs_period+0x17/0x20
[ 2043.979629] [<ffffffff81095300>] cpu_cfs_period_write_u64+0x10/0x20
[ 2043.985969] [<ffffffff810caa63>] cgroup_file_write+0x143/0x2e0
[ 2043.991882] [<ffffffff8118773b>] ? vfs_write+0x1bb/0x1e0
[ 2043.997274] [<ffffffff8118773b>] ? vfs_write+0x1bb/0x1e0
[ 2044.002665] [<ffffffff812c8cb7>] ? security_file_permission+0x27/0xb0
[ 2044.009180] [<ffffffff81187649>] vfs_write+0xc9/0x1e0
[ 2044.014315] [<ffffffff81187b10>] SyS_write+0x50/0xa0
[ 2044.019360] [<ffffffff8167e329>] system_call_fastpath+0x16/0x1b

>
> This patch /does/ ensure that we call
> start_bandwidth_timer/__hrtimer_start_range_ns with the updated period,
> but if I'm remembering my hrtimer mechanics correctly that is not
> necessary.
>
> I just did notice however that sched_cfs_period_timer reads
> cfs_b->period without any locks, which can in fact race with an update
> to period (and isn't fixed by this patch). If this write doesn't happen
> to be atomic (which is certainly not guaranteed, and is entirely
> plausible on say 32-bit x86, much less other archs), we could read a
> partially written value and move the timer incorrectly. Pulling the
> lock/unlock out of do_sched_cfs_period_timer should fix this easily
> enough.

Looks like an another issue.

Thanks,
Roman

>
> unthrottle_offline_cfs_rqs could cause similar issues with its unlocked
> read of quota, and can also be easily fixed.
>
>
> >
> > Instead of resetting cfs_b->timer_active, tg_set_cfs_bandwidth can
> > wait for period timer callbacks (ignoring cfs_b->timer_active) and
> > restart the timer explicitly.
> >
> > Signed-off-by: Roman Gushchin <klamm@xxxxxxxxxxxxxx>
> > Cc: Ben Segall <bsegall@xxxxxxxxxx>
> > Cc: Peter Zijlstra <peterz@xxxxxxxxxxxxx>
> > Cc: pjt@xxxxxxxxxx
> > Cc: Chris J Arges <chris.j.arges@xxxxxxxxxxxxx>
> > Cc: Greg Kroah-Hartman <gregkh@xxxxxxxxxxxxxxxxxxx>
> > ---
> > kernel/sched/core.c | 3 +--
> > kernel/sched/fair.c | 8 ++++----
> > kernel/sched/sched.h | 2 +-
> > 3 files changed, 6 insertions(+), 7 deletions(-)
> >
> > diff --git a/kernel/sched/core.c b/kernel/sched/core.c
> > index d9d8ece..e9b9c66 100644
> > --- a/kernel/sched/core.c
> > +++ b/kernel/sched/core.c
> > @@ -7717,8 +7717,7 @@ static int tg_set_cfs_bandwidth(struct task_group *tg, u64 period, u64 quota)
> > /* restart the period timer (if active) to handle new period expiry */
> > if (runtime_enabled && cfs_b->timer_active) {
> > /* force a reprogram */
> > - cfs_b->timer_active = 0;
> > - __start_cfs_bandwidth(cfs_b);
> > + __start_cfs_bandwidth(cfs_b, true);
> > }
> > raw_spin_unlock_irq(&cfs_b->lock);
> >
> > diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> > index 7570dd9..55a0a5b 100644
> > --- a/kernel/sched/fair.c
> > +++ b/kernel/sched/fair.c
> > @@ -3129,7 +3129,7 @@ static int assign_cfs_rq_runtime(struct cfs_rq *cfs_rq)
> > */
> > if (!cfs_b->timer_active) {
> > __refill_cfs_bandwidth_runtime(cfs_b);
> > - __start_cfs_bandwidth(cfs_b);
> > + __start_cfs_bandwidth(cfs_b, false);
> > }
> >
> > if (cfs_b->runtime > 0) {
> > @@ -3308,7 +3308,7 @@ static void throttle_cfs_rq(struct cfs_rq *cfs_rq)
> > raw_spin_lock(&cfs_b->lock);
> > list_add_tail_rcu(&cfs_rq->throttled_list, &cfs_b->throttled_cfs_rq);
> > if (!cfs_b->timer_active)
> > - __start_cfs_bandwidth(cfs_b);
> > + __start_cfs_bandwidth(cfs_b, false);
> > raw_spin_unlock(&cfs_b->lock);
> > }
> >
> > @@ -3690,7 +3690,7 @@ static void init_cfs_rq_runtime(struct cfs_rq *cfs_rq)
> > }
> >
> > /* requires cfs_b->lock, may release to reprogram timer */
> > -void __start_cfs_bandwidth(struct cfs_bandwidth *cfs_b)
> > +void __start_cfs_bandwidth(struct cfs_bandwidth *cfs_b, bool force)
> > {
> > /*
> > * The timer may be active because we're trying to set a new bandwidth
> > @@ -3705,7 +3705,7 @@ void __start_cfs_bandwidth(struct cfs_bandwidth *cfs_b)
> > cpu_relax();
> > raw_spin_lock(&cfs_b->lock);
> > /* if someone else restarted the timer then we're done */
> > - if (cfs_b->timer_active)
> > + if (!force && cfs_b->timer_active)
> > return;
> > }
> >
> > diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
> > index 456e492..369b4d6 100644
> > --- a/kernel/sched/sched.h
> > +++ b/kernel/sched/sched.h
> > @@ -278,7 +278,7 @@ extern void init_cfs_bandwidth(struct cfs_bandwidth *cfs_b);
> > extern int sched_group_set_shares(struct task_group *tg, unsigned long shares);
> >
> > extern void __refill_cfs_bandwidth_runtime(struct cfs_bandwidth *cfs_b);
> > -extern void __start_cfs_bandwidth(struct cfs_bandwidth *cfs_b);
> > +extern void __start_cfs_bandwidth(struct cfs_bandwidth *cfs_b, bool force);
> > extern void unthrottle_cfs_rq(struct cfs_rq *cfs_rq);
> >
> > extern void free_rt_sched_group(struct task_group *tg);
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/