Divide-by-zero in post_init_entity_util_avg

From: Chris Wilson
Date: Thu Jun 09 2016 - 05:02:08 EST



[15774.966082] divide error: 0000 [#1] SMP
[15774.966137] Modules linked in: i915 intel_gtt
[15774.966208] CPU: 1 PID: 15319 Comm: gemscript Not tainted 4.7.0-rc1+ #330
[15774.966252] Hardware name: /NUC5CPYB, BIOS PYBSWCEL.86A.0027.2015.0507.1758 05/07/2015
[15774.966317] task: ffff880276a55e80 ti: ffff880272c10000 task.ti: ffff880272c10000
[15774.966377] RIP: 0010:[<ffffffff810b12e3>] [<ffffffff810b12e3>] post_init_entity_util_avg+0x43/0x80
[15774.966463] RSP: 0018:ffff880272c13e30 EFLAGS: 00010057
[15774.966504] RAX: 0000000022f00000 RBX: ffff880276a51f80 RCX: 00000000000000e8
[15774.966550] RDX: 0000000000000000 RSI: 0000000000000000 RDI: ffff880276a52000
[15774.966593] RBP: ffff880272c13e30 R08: 0000000000000000 R09: 0000000000000008
[15774.966635] R10: 0000000000000000 R11: 0000000000000000 R12: ffff880276a52544
[15774.966678] R13: ffff880276a52000 R14: ffff880276a521d8 R15: 0000000000003bda
[15774.966721] FS: 00007fe4df95a740(0000) GS:ffff88027fd00000(0000) knlGS:0000000000000000
[15774.966781] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[15774.966822] CR2: 00007fe4de7f0378 CR3: 0000000275e15000 CR4: 00000000001006e0
[15774.966860] Stack:
[15774.966894] ffff880272c13e68 ffffffff810a69c5 0000000000000206 0000000000003bd7
[15774.966987] 0000000000000000 0000000000000000 ffff880276a51f80 ffff880272c13ee8
[15774.967073] ffffffff8107a364 0000000000003bd7 ffffffffffffffff 0000000000000000
[15774.967162] Call Trace:
[15774.967208] [<ffffffff810a69c5>] wake_up_new_task+0x95/0x130
[15774.967256] [<ffffffff8107a364>] _do_fork+0x184/0x400
[15774.967302] [<ffffffff8107a6b7>] SyS_clone+0x37/0x50
[15774.967353] [<ffffffff8100197d>] do_syscall_64+0x5d/0xe0
[15774.967402] [<ffffffff815719bc>] entry_SYSCALL64_slow_path+0x25/0x25
[15774.967444] Code: 89 d1 48 c1 e9 3f 48 01 d1 48 d1 f9 48 85 c9 7e 38 48 85 c0 74 35 48 0f af 07 31 d2 48 89 87 e0 00 00 00 48 8b 76 70 48 83 c6 01 <48> f7 f6 48 39 c8 77 18 48 89 c1 48 89 87 e0 00 00 00 69 c9 7e
[15774.968086] RIP [<ffffffff810b12e3>] post_init_entity_util_avg+0x43/0x80
[15774.968142] RSP <ffff880272c13e30>

I've presumed commit 2b8c41daba327 ("sched/fair: Initiate a new task's
util avg to a bounded value") to be at fault, hence the CCs. Though it
may just be a victim.

gdb says 0x43/0x80 is

725 if (cfs_rq->avg.util_avg != 0) {
726 sa->util_avg = cfs_rq->avg.util_avg * se->load.weight;
-> 727 sa->util_avg /= (cfs_rq->avg.load_avg + 1);
728
729 if (sa->util_avg > cap)
730 sa->util_avg = cap;
731 } else {

I've run the same fork-heavy workload that seemed to hit the initial
fault under kasan. kasan has not reported any errors, nor has the bug
reoccurred after a day (earlier I had a couple of panics within a few
hours).

Is it possible for a race window where cfg_rq->avg.load_avg is indeed
-1? Any evidence of other memcorruption in the above?
-Chris

--
Chris Wilson, Intel Open Source Technology Centre