Re: [PATCH RFC] sched/fair: fix sudden expiration of cfq quota in put_prev_task()

From: Konstantin Khlebnikov
Date: Fri Apr 03 2015 - 08:51:36 EST


On 03.04.2015 15:41, Konstantin Khlebnikov wrote:
Pick_next_task_fair() must be sure that here is at least one runnable
task before calling put_prev_task(), but put_prev_task() can expire
last remains of cfs quota and throttle all currently runnable tasks.
As a result pick_next_task_fair() cannot find next task and crashes.

Kernel crash looks like this:

<1>[50288.719491] BUG: unable to handle kernel NULL pointer dereference at 0000000000000038
<1>[50288.719538] IP: [<ffffffff81097b8c>] set_next_entity+0x1c/0x80
<4>[50288.719567] PGD 0
<4>[50288.719578] Oops: 0000 [#1] SMP
<4>[50288.719594] Modules linked in: vhost_net macvtap macvlan vhost 8021q mrp garp ip6table_filter ip6_tables nf_conntrack_ipv4 nf_defrag_ipv4 xt_state nf_conntrack ipt_REJECT nf_reject_ipv4 xt_CHECKSUM iptable_mangle xt_tcpudp iptable_filter ip_tables x_tables bridge stp llc netconsole configfs x86_pkg_temp_thermal intel_powerclamp coretemp kvm_intel kvm mgag200 crc32_pclmul ghash_clmulni_intel aesni_intel ablk_helper cryptd lrw ttm gf128mul drm_kms_helper drm glue_helper aes_x86_64 i2c_algo_bit sysimgblt sysfillrect i2c_core sb_edac edac_core syscopyarea microcode ipmi_si ipmi_msghandler lpc_ich ioatdma dca mlx4_en mlx4_core vxlan udp_tunnel ip6_udp_tunnel tcp_htcp e1000e ptp pps_core ahci libahci raid10 raid456 async_pq async_xor xor async_memcpy async_raid6_recov raid6_pq async_tx raid1 raid0 multipath<4>[50288.719956] linear
<4>[50288.719964] CPU: 27 PID: 11505 Comm: kvm Not tainted 3.18.10-7 #7
<4>[50288.719987] Hardware name:
<4>[50288.720015] task: ffff880036acbaa0 ti: ffff8808445f8000 task.ti: ffff8808445f8000
<4>[50288.720041] RIP: 0010:[<ffffffff81097b8c>] [<ffffffff81097b8c>] set_next_entity+0x1c/0x80
<4>[50288.720072] RSP: 0018:ffff8808445fbbb8 EFLAGS: 00010086
<4>[50288.720091] RAX: 0000000000000000 RBX: 0000000000000000 RCX: 000000000000bcb8
<4>[50288.720116] RDX: 0000000000000000 RSI: 0000000000000000 RDI: ffff88107fd72af0
<4>[50288.720141] RBP: ffff8808445fbbd8 R08: 0000000000000000 R09: 0000000000000001
<4>[50288.720165] R10: 0000000000000000 R11: 0000000000000001 R12: 0000000000000000
<4>[50288.720190] R13: 0000000000000000 R14: ffff880b6f250030 R15: ffff88107fd72af0
<4>[50288.720214] FS: 00007f55467fc700(0000) GS:ffff88107fd60000(0000) knlGS:ffff8802175e0000
<4>[50288.720242] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
<4>[50288.720262] CR2: 0000000000000038 CR3: 0000000324ede000 CR4: 00000000000427e0
<4>[50288.720287] Stack:
<4>[50288.720296] ffff88107fd72a80 ffff88107fd72a80 0000000000000000 0000000000000000
<4>[50288.720327] ffff8808445fbc68 ffffffff8109ead8 ffff880800000000 ffffffffa1438990
<4>[50288.720357] ffff880b6f250000 0000000000000000 0000000000012a80 ffff880036acbaa0
<4>[50288.720388] Call Trace:
<4>[50288.720402] [<ffffffff8109ead8>] pick_next_task_fair+0x88/0x5d0
<4>[50288.720429] [<ffffffffa1438990>] ? vmx_fpu_activate.part.63+0x90/0xb0 [kvm_intel]
<4>[50288.720457] [<ffffffff81096b95>] ? sched_clock_cpu+0x85/0xc0
<4>[50288.720479] [<ffffffff816b5b99>] __schedule+0xf9/0x7d0
<4>[50288.720500] [<ffffffff816bb210>] ? reboot_interrupt+0x80/0x80
<4>[50288.720522] [<ffffffff816b630a>] _cond_resched+0x2a/0x40
<4>[50288.720549] [<ffffffffa03dd8c5>] __vcpu_run+0xd35/0xf30 [kvm]
<4>[50288.720573] [<ffffffff81075fc7>] ? __set_task_blocked+0x37/0x80
<4>[50288.720595] [<ffffffff8109387e>] ? try_to_wake_up+0x21e/0x360
<4>[50288.720622] [<ffffffffa03ddb65>] kvm_arch_vcpu_ioctl_run+0xa5/0x220 [kvm]
<4>[50288.720650] [<ffffffffa03c48b2>] kvm_vcpu_ioctl+0x2c2/0x620 [kvm]
<4>[50288.720675] [<ffffffff811c01c6>] do_vfs_ioctl+0x86/0x4f0
<4>[50288.720697] [<ffffffff810d14a2>] ? SyS_futex+0x142/0x1a0
<4>[50288.720717] [<ffffffff811c06c1>] SyS_ioctl+0x91/0xb0
<4>[50288.720737] [<ffffffff816ba489>] system_call_fastpath+0x12/0x17
<4>[50288.720758] Code: c7 47 60 00 00 00 00 45 31 c0 e9 0c ff ff ff 66 66 66 66 90 55 48 89 e5 48 83 ec 20 48 89 5d e8 4c 89 65 f0 48 89 f3 4c 89 6d f8 <44> 8b 4e 38 49 89 fc 45 85 c9 74 17 4c 8d 6e 10 4c 39 6f 30 74
<1>[50288.722636] RIP [<ffffffff81097b8c>] set_next_entity+0x1c/0x80
<4>[50288.723533] RSP <ffff8808445fbbb8>
<4>[50288.724406] CR2: 0000000000000038

in pick_next_task_fair() cfs_rq->nr_running was non-zero but after
put_prev_task(rq, prev) kernel cannot find any tasks to schedule next.

It crashes from time to time on strange libvirt/kvm setup where
cfs_quota is set on two levels: at parent cgroup which contains kvm
and at per-vcpu child cgroup.

This patch isn't verified yet.
But I haven't found any other possible reasons for that crash.


This patch leaves 1 in ->runtime_remaining when current assignation
expires and tries to refill it right after that. In the worst case
task will be scheduled once and throttled at the end of slice.

Signed-off-by: Konstantin Khlebnikov <khlebnikov@xxxxxxxxxxxxxx>
---
kernel/sched/fair.c | 19 +++++++++++++------
1 file changed, 13 insertions(+), 6 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 7ce18f3c097a..91785d077db4 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -3447,11 +3447,12 @@ static void expire_cfs_rq_runtime(struct cfs_rq *cfs_rq)
{
struct cfs_bandwidth *cfs_b = tg_cfs_bandwidth(cfs_rq->tg);

- /* if the deadline is ahead of our clock, nothing to do */
- if (likely((s64)(rq_clock(rq_of(cfs_rq)) - cfs_rq->runtime_expires) < 0))
+ /* nothing to expire */
+ if (cfs_rq->runtime_remaining <= 0)
return;

- if (cfs_rq->runtime_remaining < 0)
+ /* if the deadline is ahead of our clock, nothing to do */
+ if (likely((s64)(rq_clock(rq_of(cfs_rq)) - cfs_rq->runtime_expires) < 0))
return;

/*
@@ -3469,8 +3470,14 @@ static void expire_cfs_rq_runtime(struct cfs_rq *cfs_rq)
/* extend local deadline, drift is bounded above by 2 ticks */
cfs_rq->runtime_expires += TICK_NSEC;
} else {
- /* global deadline is ahead, expiration has passed */
- cfs_rq->runtime_remaining = 0;
+ /*
+ * Global deadline is ahead, expiration has passed.
+ *
+ * Do not expire runtime completely. Otherwise put_prev_task()
+ * can throttle all tasks when we already checked nr_running or
+ * put_prev_entity() can throttle already chosen next entity.
+ */
+ cfs_rq->runtime_remaining = 1;
}
}

@@ -3480,7 +3487,7 @@ static void __account_cfs_rq_runtime(struct cfs_rq *cfs_rq, u64 delta_exec)
cfs_rq->runtime_remaining -= delta_exec;
expire_cfs_rq_runtime(cfs_rq);

- if (likely(cfs_rq->runtime_remaining > 0))
+ if (likely(cfs_rq->runtime_remaining > 1))
return;

/*



--
Konstantin
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/