Re: sched: divide error in sg_capacity_factor
From: Sasha Levin
Date: Tue Mar 10 2015 - 06:56:35 EST
On 03/10/2015 12:29 AM, Ingo Molnar wrote:
>
> * Sasha Levin <sasha.levin@xxxxxxxxxx> wrote:
>
>> Hi all,
>>
>> While fuzzing with trinity inside the latest -next kernel using trinity I've stumbled on:
>>
>> [ 936.784266] divide error: 0000 [#1] PREEMPT SMP KASAN
>> [ 936.793957] RIP: find_busiest_group (kernel/sched/fair.c:6152 kernel/sched/fair.c:6223 kernel/sched/fair.c:6341 kernel/sched/fair.c:6603)
>
> Hm, these line numbers don't seem to match up very well with my
> version of linux-next:
>
> 28855005be1d Add linux-next specific files for 20150306
>
> and the Git version info included in the oops seems useless:
>
> 4.0.0-rc1-sasha-00044-ge21109a
>
> $ git log e21109a
> fatal: ambiguous argument 'e21109a': unknown revision or path not in the working tree.
>
> I think the kernel's SHA1 should be made at least 12 char wide,
> regardless of the user's gitconfig::core.abbrev settings.
>
> Also, latest linux-next is -rc2 based, while your version string says
> -rc1.
>
>> [ 936.829403] load_balance (kernel/sched/fair.c:6857)
>
> this does not match up either.
>
>> [ 936.829403] run_rebalance_domains (kernel/sched/fair.c:7494 kernel/sched/fair.c:7659)
>
> The line numbers are not even close to anything related:
> run_rebalance_domains() starts at line 7666 and ends at 7680.
Right, this is my fuck up. It seems that I was fuzzing 4.0-rc1 rather than
-next as I thought I was. I forgot to go back to -next after I tested a few
things on Linus's tree.
So the line numbers should match correctly with Linus's tree as you've already
guessed below.
> Also, why are the offsets into the function missing from the output?
> Those would allow the rough determination of the crash site, even if
> debuginfo is crap.
I found that offsets are useless here because of the really odd things the
compiler does based on my config. There's so many things that got inlined
in this case that I think offsets wouldn't mean anything to you here.
For example, in this case the division by 0 happened on load_balance+0x88a/0x2399.
> I also checked Linus's latest, and they do seem to match up better:
>
> affb8172de39 Merge git://git.kernel.org/pub/scm/virt/kvm/kvm
>
> and the line number gives:
>
> capacity_factor = min_t(unsigned,
> capacity_factor, DIV_ROUND_CLOSEST(capacity, SCHED_CAPACITY_SCALE));
>
> but that's a division with a constant? Should not trap.
>
> So I rebuild a kernel with debug info, pattern matched the disassembly
> you provided, and that gave me this division:
>
> (gdb) list *0xffffffff8107d958
> 0xffffffff8107d958 is in find_busiest_group (kernel/sched/fair.c:6162).
> 6157 capacity = group->sgc->capacity;
> 6158 capacity_orig = group->sgc->capacity_orig;
> 6159 cpus = group->group_weight;
> 6160
> 6161 /* smt := ceil(cpus / capacity), assumes: 1 < smt_capacity < 2 */
> 6162 smt = DIV_ROUND_UP(SCHED_CAPACITY_SCALE * cpus, capacity_orig);
> 6163 capacity_factor = cpus / smt; /* cores */
> 6164
> 6165 capacity_factor = min_t(unsigned,
> 6166 capacity_factor, DIV_ROUND_CLOSEST(capacity, SCHED_CAPACITY_SCALE));
This is the division I was seeing as well.
> So this too seems not very plausible: 'capacity_orig' comes straight
> from group->sgc->capacity_orig, which is:
>
> - boot time initialized
>
> - sometimes recalculated during CPU hot-plug: not sure how much of
> that your tests are doing?
I'm not forcing them, but they do happen pretty often.
> - but otherwise it's fairly constant and should have crashed your
> system early on if it was set up wrong
>
> unless I missed something that is.
>
>> [ 936.829403] __do_softirq (kernel/softirq.c:273 include/linux/jump_label.h:114 include/trace/events/irq.h:126 kernel/softirq.c:274)
>> [ 936.829403] irq_exit (kernel/softirq.c:350 kernel/softirq.c:391)
>> [ 936.829403] smp_apic_timer_interrupt (arch/x86/kernel/apic/apic.c:918)
>> [ 936.829403] apic_timer_interrupt (arch/x86/kernel/entry_64.S:958)
>> [ 936.829403] <EOI>
>> [ 936.829403] print_context_stack (arch/x86/kernel/dumpstack.c:105)
>> [ 936.829403] dump_trace (arch/x86/kernel/dumpstack_64.c:244)
>> [ 936.829403] save_stack_trace (arch/x86/kernel/stacktrace.c:64)
>> [ 936.829403] __set_page_owner (mm/page_owner.c:72)
>> [ 936.829403] get_page_from_freelist (include/linux/page_owner.h:26 mm/page_alloc.c:2176)
>> [ 936.829403] __alloc_pages_nodemask (mm/page_alloc.c:2844)
>> [ 936.829403] alloc_pages_vma (mm/mempolicy.c:2007)
>> [ 936.829403] handle_mm_fault (mm/memory.c:2156 mm/memory.c:3164 mm/memory.c:3269 mm/memory.c:3298)
>> [ 936.829403] __do_page_fault (arch/x86/mm/fault.c:1235)
>> [ 936.829403] trace_do_page_fault (arch/x86/mm/fault.c:1329)
>> [ 936.829403] do_async_page_fault (arch/x86/kernel/kvm.c:280)
>> [ 936.829403] async_page_fault (arch/x86/kernel/entry_64.S:1295)
>
> So debug info weirdnesses aside, other divisions in
> find_busiest_group():
>
> sds.avg_load = (SCHED_CAPACITY_SCALE * sds.total_load)
> / sds.total_capacity;
>
> total_capacity ought to be zero only on a totally borked machine
> (unlikely to boot), or on memory corruption.
>
> if calculate_imbalance() got inlined, then:
>
> load_above_capacity /= busiest->group_capacity;
>
> that too ought to only get corrupted in the most serious cases, we
> don't recalculate it runtime.
>
> So I'm baffled. Some tentative handwaving, pointing away from the
> scheduler:
>
> - Your stack trace is 'weird' not just due to debug info: an async
> page fault doing allocations, doing a stack trace, interruted by a
> timer irq, doing scheduler rebalancing...
Yeah, it's not the most straightforward trace, but it's not "broken" -
it's a plausible scenario.
> - The (spectacularly misnamed [*] ) CONFIG_PAGE_OWNER=y page lifetime
> tracing facility got enabled explicitly via the page_owner=on boot
> parameter, right? Not many people are doing that I suspect.
Right. But all it did here was save a stack trace, no?
> - CONFIG_KASAN=y is enabled in your kernel. New, invasive option,
> using compiler features that weren't used by kernel code before.
Agreed, I've Cc'ed Andrey and hope that he could see if there's anything
up with the assembly that's fishy.
> - async page faults are virtualization specials: not used much
> elsewhere.
>
> - There's a 'W' taint in your oops. Probably some harmless prior
> warning?
Yup, just one of my debug patches that helps me track down a different
bug.
> So your crash signature has the combination of 3 'uncommon' kernel
> features, and a scheduler crash with a relatively constant value that
> should never be zero and which should crash everywhere.
>
> So right now I'd blame the other 3 guys, I wasn't even there that
> night, officer!
Fair enough, I've only seen it once myself so I can ignore it until (if)
I hit it again.
Sorry again about the tree mixup. I'll go back to fuzzing -next now :)
Thanks,
Sasha
> Cc:-ed others as well.
>
> Thanks,
>
> Ingo
>
> [*] Please name debugging features accordingly: CONFIG_DEBUG_PAGE_OWNER.
> Maybe even prefix them with the subsystem: CONFIG_DEBUG_VM_PAGE_OWNER.
>
> We already have a nice set of CONFIG_DEBUG_VM* options:
>
> CONFIG_DEBUG_VM
> CONFIG_DEBUG_VM_RB
> CONFIG_DEBUG_VM_VMACACHE
>
> Also, there's no penalty for including a verb, so that people know
> wth it's doing, at a glance: CONFIG_DEBUG_VM_TRACK_PAGE_OWNER?
>
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/