Re: [PROBLEM] possible divide by 0 in kernel/sched/cputime.cscale_stime()
From: Stanislaw Gruszka
Date: Mon Nov 18 2013 - 08:05:24 EST
Hi Christian
On Sat, Nov 16, 2013 at 10:37:40PM +0100, Christian Engelmayer wrote:
> Since upgrading from v3.8 to v3.12 I see random crashes in function scale_stime()
> in kernel/sched/cputime.c:
>
> divide error: 0000 [#1] SMP
> Modules linked in: parport_pc(F) ppdev(F) bnep rfcomm bluetooth binfmt_misc(F)
> zl10353 cx88_dvb cx88_vp3054_i2c videobuf_dvb dvb_core intel_powerclamp coretemp
> kvm_intel(F) tuner_xc2028 kvm(F) i915 snd_hda_codec_hdmi snd_hda_codec_realtek
> cx8800 cx8802 tuner snd_hda_intel snd_hda_codec cx88_alsa crct10dif_pclmul(F)
> crc32_pclmul(F) snd_hwdep(F) snd_pcm(F) snd_page_alloc(F) ghash_clmulni_intel(F)
> aesni_intel(F) snd_seq_midi(F) snd_seq_midi_event(F) snd_rawmidi(F) snd_seq(F)
> joydev(F) cx88xx snd_seq_device(F) snd_timer(F) aes_x86_64(F) lrw(F) gf128mul(F)
> glue_helper(F) video(F) btcx_risc drm_kms_helper ablk_helper(F) tveeprom cryptd(F)
> lp(F) videobuf_dma_sg rc_core drm v4l2_common videobuf_core mei_me parport(F)
> snd(F) mei soundcore(F) videodev i2c_algo_bit serio_raw(F) microcode(F) mac_hid
> lpc_ich asus_atk0110 hid_generic usbhid hid usb_storage(F) firewire_ohci ahci(F)
> libahci(F) firewire_core r8169 crc_itu_t(F) mii(F)
> CPU: 3 PID: 15367 Comm: htop Tainted: GF 3.12.0-031200-generic #201311031935
> Hardware name: System manufacturer System Product Name/P7H55-M PRO, BIOS 1709 01/04/2011
> task: ffff8800cc09e000 ti: ffff8800af620000 task.ti: ffff8800af620000
> RIP: 0010:[<ffffffff81099de0>] [<ffffffff81099de0>] cputime_adjust+0xf0/0x110
> RSP: 0018:ffff8800af621cc8 EFLAGS: 00010847
> RAX: 85fdc1fef4047c00 RBX: 0000000000000000 RCX: ffff8800af621df8
> RDX: 0000000000000000 RSI: ffff8800cc0634d0 RDI: 0000000000000000
> RBP: ffff8800af621cd8 R08: 00000000fffffffe R09: 0000000000000000
> R10: 0000000000000000 R11: fffffffe03427acc R12: ffff8800af621df0
> R13: ffff8800af621df8 R14: 0000000000000000 R15: ffff8800cc063300
> FS: 00007f22a387d740(0000) GS:ffff880117c60000(0000) knlGS:0000000000000000
> CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> CR2: 00007f22a3892000 CR3: 0000000097023000 CR4: 00000000000007e0
> Stack:
> ffff8800c37f0000 ffff8800af621df0 ffff8800af621d18 ffffffff8109aa51
> 000000000a7d8c00 0000000042fee100 fffffffe03427acc ffff8800bf112a80
> ffff8800c37f0000 ffff8800c307c280 ffff8800af621e50 ffffffff8121f74b
> Call Trace:
> [<ffffffff8109aa51>] thread_group_cputime_adjusted+0x41/0x50
> [<ffffffff8121f74b>] do_task_stat+0x8eb/0xb60
> [<ffffffff81176400>] ? vma_compute_subtree_gap+0x50/0x50
> [<ffffffff81220314>] proc_tgid_stat+0x14/0x20
> [<ffffffff8121b16d>] proc_single_show+0x4d/0x90
> [<ffffffff811d6eee>] seq_read+0x14e/0x390
> [<ffffffff811b3725>] vfs_read+0x95/0x160
> [<ffffffff811b4239>] SyS_read+0x49/0xa0
> [<ffffffff81723ced>] system_call_fastpath+0x1a/0x1f
> Code: 89 fa 49 c1 ea 20 4d 85 d2 74 ca 4c 89 c2 48 d1 ef 49 89 c0 48 d1 ea 48
> 89 d0 eb 9f 0f 1f 80 00 00 00 00 89 c0 31 d2 49 0f af c0 <48> f7 f7 4c
> 89 df 48 29 c7 49 89 c3 e9 31 ff ff ff 66 66 66 66
> RIP [<ffffffff81099de0>] cputime_adjust+0xf0/0x110
> RSP <ffff8800af621cc8>
> ---[ end trace dbafd2159a385dd6 ]---
>
> The affected LOC performing the division by 0 was introduced in commit
>
> commit 55eaa7c1f511af5fb6ef808b5328804f4d4e5243
> Author: Stanislaw Gruszka <sgruszka@xxxxxxxxxx>
> Date: Tue Apr 30 17:14:42 2013 +0200
> sched: Avoid cputime scaling overflow
>
> For the problem to occur the function is called eg. with the following
> input parameters
>
> stime: 0x3567e00
> rtime: 0xffffffffbf1abfdb
> total: 0x3938700
>
> which leads to 'total' being shifted to 0 during the adaption of the precision
> and is then used without further check in
>
> scaled = div_u64((u64) (u32) stime * (u64) (u32) rtime, (u32)total);
>
> The root cause triggering this issue seems to be an overflowed value of
>
> rtime = nsecs_to_cputime(curr->sum_exec_runtime);
>
> On the affected machine the problem can be triggered by loading the
> previously idle system by starting a full kernel build. The problem occurs
> within a minute after the ondemand frequency scaling governor adjusts the
> frequency from the minimum to the maximum.
>
> The x86 init check whether all booted CPUs have their TSC's synchronized, never
> failed so far, however, the tsc clocksource is sporadically marked unstable.
>
> Clocksource tsc unstable (delta = -74994678 ns)
>
> The used CPU provides an Intel Invariant TSC as stated by CPUID.80000007H:EDX[8]:
>
> eax in eax ebx ecx edx
> 00000000 0000000b 756e6547 6c65746e 49656e69
> 00000001 00020652 04100800 0298e3ff bfebfbff
> 00000002 55035a01 00f0b2e3 00000000 09ca212c
> 00000003 00000000 00000000 00000000 00000000
> 00000004 00000000 00000000 00000000 00000000
> 00000005 00000040 00000040 00000003 00001120
> 00000006 00000005 00000002 00000001 00000000
> 00000007 00000000 00000000 00000000 00000000
> 00000008 00000000 00000000 00000000 00000000
> 00000009 00000000 00000000 00000000 00000000
> 0000000a 07300403 00000004 00000000 00000603
> 0000000b 00000000 00000000 0000002c 00000004
> 80000000 80000008 00000000 00000000 00000000
> 80000001 00000000 00000000 00000001 28100800
> 80000002 65746e49 2952286c 726f4320 4d542865
> 80000003 35692029 55504320 20202020 20202020
> 80000004 30353620 20402020 30322e33 007a4847
> 80000005 00000000 00000000 00000000 00000000
> 80000006 00000000 00000000 01006040 00000000
> 80000007 00000000 00000000 00000000 00000100
> 80000008 00003024 00000000 00000000 00000000
>
> Vendor ID: "GenuineIntel"; CPUID level 11
>
> Intel-specific functions:
> Version 00020652:
> Type 0 - Original OEM
> Family 6 - Pentium Pro
> Model 5 - Pentium II Model 5/Xeon/Celeron
> Stepping 2
> Reserved 8
>
> Extended brand string: "Intel(R) Core(TM) i5 CPU 650 @ 3.20GHz"
> CLFLUSH instruction cache line size: 8
> Initial APIC ID: 4
> Hyper threading siblings: 16
>
> Feature flags bfebfbff:
> FPU Floating Point Unit
> VME Virtual 8086 Mode Enhancements
> DE Debugging Extensions
> PSE Page Size Extensions
> TSC Time Stamp Counter
> MSR Model Specific Registers
> PAE Physical Address Extension
> MCE Machine Check Exception
> CX8 COMPXCHG8B Instruction
> APIC On-chip Advanced Programmable Interrupt Controller present and enabled
> SEP Fast System Call
> MTRR Memory Type Range Registers
> PGE PTE Global Flag
> MCA Machine Check Architecture
> CMOV Conditional Move and Compare Instructions
> FGPAT Page Attribute Table
> PSE-36 36-bit Page Size Extension
> CLFSH CFLUSH instruction
> DS Debug store
> ACPI Thermal Monitor and Clock Ctrl
> MMX MMX instruction set
> FXSR Fast FP/MMX Streaming SIMD Extensions save/restore
> SSE Streaming SIMD Extensions instruction set
> SSE2 SSE2 extensions
> SS Self Snoop
> HT Hyper Threading
> TM Thermal monitor
> 31 reserved
>
> Nevertheless, when looking into the issue I saw occurences of sched_clock going
> backwards as if the TSCs were read out of sync. Accordingly the problem does
> not occur when either booting with option 'nosmp' or when forcing the TSC to be
> marked as unstable for sched_clock. Booting with 'acpi=off' and no frequency
> scaling works too.
>
> Having a look at the scheduler code I see the following pattern that would also
> catch a time warp, eg. kernel/sched/rt.c update_curr_rt():
>
> u64 delta_exec;
> delta_exec = rq_clock_task(rq) - curr->se.exec_start;
> if (unlikely((s64)delta_exec <= 0))
> return;
>
> However, there are still vulnerable places, eg. kernel/sched/fair.c update_curr():
>
> /*
> * Get the amount of time the current task was running
> * since the last time we changed load (this cannot
> * overflow on 32 bits):
> */
> delta_exec = (unsigned long)(now - curr->exec_start);
> if (!delta_exec)
> return;
Thanks for great analyse. I'm not sure where this problem should be
fixed (in TSC clocksource or sched/fair.c or sched/cputime.c), however
since RT scheduler is protected for sched_clock going backward, we
probably can protect fair scheduler as well (also do_task_delta_exec()
has similar check).
Does the below patch fixes the issue on affected machine?
Stanislaw
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 7c70201..f02a567 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -727,7 +727,7 @@ static void update_curr(struct cfs_rq *cfs_rq)
u64 now = rq_clock_task(rq_of(cfs_rq));
unsigned long delta_exec;
- if (unlikely(!curr))
+ if (unlikely(!curr || now <= curr->exec_start))
return;
/*
@@ -736,8 +736,6 @@ static void update_curr(struct cfs_rq *cfs_rq)
* overflow on 32 bits):
*/
delta_exec = (unsigned long)(now - curr->exec_start);
- if (!delta_exec)
- return;
__update_curr(cfs_rq, curr, delta_exec);
curr->exec_start = now;
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/