Re: sched: odd values for effective load calculations

From: Sasha Levin
Date: Mon Dec 15 2014 - 23:52:30 EST


On 12/15/2014 07:12 AM, Peter Zijlstra wrote:
>
> Sorry for the long delay, I was out for a few weeks due to having become
> a dad for the second time.

Congrats! May you be able to sleep at night sooner rather than later.

> On Sat, Dec 13, 2014 at 09:30:12AM +0100, Ingo Molnar wrote:
>> * Sasha Levin <levinsasha928@xxxxxxxxx> wrote:
>>
>>> Hi all,
>>>
>>> I was fuzzing with trinity inside a KVM tools guest, running the latest -next
>>> kernel along with the undefined behaviour sanitizer patch, and hit the following:
>>>
>>> [ 787.894288] ================================================================================
>>> [ 787.897074] UBSan: Undefined behaviour in kernel/sched/fair.c:4541:17
>>> [ 787.898981] signed integer overflow:
>>> [ 787.900066] 361516561629678 * 101500 cannot be represented in type 'long long int'
>
> So that's:
>
> this_eff_load *= this_load +
> effective_load(tg, this_cpu, weight, weight);
>
> Going by the numbers the 101500 must be 'this_eff_load', 100 * ~1024
> makes that. Which makes the rhs 'large'. Do you have
> CONFIG_FAIR_GROUP_SCHED enabled? If so, what kind of cgroup hierarchy
> are you using?

CONFIG_FAIR_GROUP_SCHED is enabled. There's no cgroup set-up initially,
but I figure that trinity is able to do crazy things here.

> In any case, bit sad this doesn't have a register dump included :/
>
> Is this easy to reproduce or something that happened once?

It's fairy reproducible, I've seen it happen quite a few times. What other
information might be useful?

>>> The values for effective load seem a bit off (and are overflowing!).
>>
>> It definitely looks like a bug in SMP load balancing!
>
> Yeah, although theoretically (and somewhat practical) this can be
> triggered in more places if you manage to run up the 'weight' with
> enough tasks.
>
> That said, it should at worst result in 'funny' balancing behaviour, not
> anything else.

I'm not sure if you've caught up on the RCU stall issue we've been trying
to track down (https://lkml.org/lkml/2014/11/14/656), but could this "funny"
balancing behaviour be "funny" enough to cause a stall?


Thanks,
Sasha

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/