Re: [PATCH] sched/fair: fix mul overflow on 32-bit systems
From: bsegall
Date: Fri Dec 11 2015 - 14:19:05 EST
Dietmar Eggemann <dietmar.eggemann@xxxxxxx> writes:
> On 11/12/15 17:57, Morten Rasmussen wrote:
>> On Fri, Dec 11, 2015 at 05:00:01PM +0300, Andrey Ryabinin wrote:
>>>
>>>
>>> On 12/11/2015 04:36 PM, Peter Zijlstra wrote:
>>>> On Fri, Dec 11, 2015 at 02:25:51PM +0100, Peter Zijlstra wrote:
>>>>> On Fri, Dec 11, 2015 at 03:55:18PM +0300, Andrey Ryabinin wrote:
>>>>>> Make 'r' 64-bit type to avoid overflow in 'r * LOAD_AVG_MAX'
>>>>>> on 32-bit systems:
>>>>>> UBSAN: Undefined behaviour in kernel/sched/fair.c:2785:18
>>>>>> signed integer overflow:
>>>>>> 87950 * 47742 cannot be represented in type 'int'
>>>>>>
>>>>>> Fixes: 9d89c257dfb9 ("sched/fair: Rewrite runnable load and utilization average tracking")
>>>>>> Signed-off-by: Andrey Ryabinin <aryabinin@xxxxxxxxxxxxx>
>>>>>> ---
>>>>>> kernel/sched/fair.c | 4 ++--
>>>>>> 1 file changed, 2 insertions(+), 2 deletions(-)
>>>>>>
>>>>>> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
>>>>>> index e3266eb..733f0b8 100644
>>>>>> --- a/kernel/sched/fair.c
>>>>>> +++ b/kernel/sched/fair.c
>>>>>> @@ -2780,14 +2780,14 @@ static inline int update_cfs_rq_load_avg(u64 now, struct cfs_rq *cfs_rq)
>>>>>> int decayed, removed = 0;
>>>>>>
>>>>>> if (atomic_long_read(&cfs_rq->removed_load_avg)) {
>>>>>> - long r = atomic_long_xchg(&cfs_rq->removed_load_avg, 0);
>>>>>> + s64 r = atomic_long_xchg(&cfs_rq->removed_load_avg, 0);
>>>>>> sa->load_avg = max_t(long, sa->load_avg - r, 0);
>>>>>> sa->load_sum = max_t(s64, sa->load_sum - r * LOAD_AVG_MAX, 0);
>>>>>
>>>>> This makes sense, because sched_avg::load_sum is u64.
>>
>> A single removed nice=-20 task should be sufficient to cause the
>> overflow.
>
> yeah, this 87950 could be related to a single nice=-20 task
> (prio_to_weight[0]) or it is a value aggregated from more than one task.
> In any case the error is related to load not util.
>
>>
>>>>>
>>>>>> removed = 1;
>>>>>> }
>>>>>>
>>>>>> if (atomic_long_read(&cfs_rq->removed_util_avg)) {
>>>>>> - long r = atomic_long_xchg(&cfs_rq->removed_util_avg, 0);
>>>>>> + s64 r = atomic_long_xchg(&cfs_rq->removed_util_avg, 0);
>>>>>> sa->util_avg = max_t(long, sa->util_avg - r, 0);
>>>>>> sa->util_sum = max_t(s32, sa->util_sum - r * LOAD_AVG_MAX, 0);
>>>>>> }
>>>>>
>>>>> However sched_avg::util_sum is u32, so this is still wrecked.
>>>>
>>>> I seems to have wrecked that in:
>>>>
>>>> 006cdf025a33 ("sched/fair: Optimize per entity utilization tracking")
>>>>
>>>> maybe just make util_load u64 too?
>>
>> It isn't as bad, but the optimization does increase the normal range
>> (not guaranteed) for util_sum from 47742 to
>> scale_down(SCHED_LOAD_SCALE)*47742 (= 1024*47742, unless you mess with
>> the scaling).
>>
>>> Is there any guarantee that the final result of expression 'util_sum - r * LOAD_AVG_MAX' always can be represented by s32?
>>>
>>> If yes, than we could just do this:
>>> max_t(s32, (u64)sa->util_sum - r * LOAD_AVG_MAX, 0)
>>
>> In most cases 'r' shouldn't exceed 1024 and util_sum not significantly
>> exceed 1024*47742, but in extreme cases like spawning lots of new tasks
>> it may potentially overflow 32 bit. Newly created tasks contribute
>> 1024*47742 each to the rq util_sum, which means that more than ~87 new
>> tasks on a single rq will get us in trouble I think.
>>
>> Without Peter's optimization referenced above, that number should
>> increase to ~87k tasks as each task only contributed 47742 before, but
>> 'r' could still cause 32-bit overflow if we remove more than ~87 newly
>> created tasks in one go. But I'm not sure if that is a situation we
>> should worry about?
>>
>> I think we have to either make util_sum u64 too or look at the
>> optimizations again.
>
> But for me the question here is if 'r' for util has to be changed from
> long to s64 as well.
>
> IMHO, on 32bit machine we can deal with (2147483648/47742/1024 = 43.9)
> 43 tasks before overflowing.
>
> Can we have a scenario where >43 tasks with se->avg.util_avg=1024 value
> get migrated (migrate_task_rq_fair()) or die (task_dead_fair()) or a
> task group dies (free_fair_sched_group()) which has a se->avg.util_avg >
> 44981 for a specific cpu before the atomic_long_xchg() happens in
> update_cfs_rq_load_avg()? Never saw this in my tests so far on ARM
> machines.
First, I believe in theory util_avg on a cpu should add up to 100% or
1024 or whatever. However, recently migrated-in tasks don't have their
utilization cleared, so if they were quickly migrated again you could
have up to the number of cpus or so times 100%, which could lead to
overflow here. This just leads to more questions though:
The whole removed_util_avg thing doesn't seem to make a ton of sense -
the code doesn't add util_avg for a migrating task onto
cfs_rq->avg.util_avg, and doing so would regularly give >100% values (it
does so on attach/detach where it's less likely to cause issues, but not
migration). Removing it only makes sense if the task has accumulated all
that utilization on this cpu, and even then mostly only makes sense if
this is the only task on the cpu (and then it would make sense to add it
on migrate-enqueue). The whole add-on-enqueue-migrate,
remove-on-dequeue-migrate thing comes from /load/, where doing so is a
more globally applicable approximation than it is for utilization,
though it could still be useful as a fast-start/fast-stop approximation,
if the add-on-enqueue part was added. It could also I guess be cleared
on migrate-in, as basically the opposite assumption (or do something
like add on enqueue, up to 100% and then set the se utilization to the
amount actually added or something).
If the choice was to not do the add/remove thing, then se->avg.util_sum
would be unused except for attach/detach, which currently do the
add/remove thing. It's not unreasonable for them, except that currently
nothing uses anything other than the root's utilization, so migration
between cgroups wouldn't actually change the relevant util number
(except it could because changing the cfs_rq util_sum doesn't actually
do /anything/ unless it's the root, so you'd have to wait until the
cgroup se actually changed in utilization).
So uh yeah, my initial impression is "rip it out", but if being
immediately-correct is important in the case of one task being most of
the utilization, rather than when it is more evenly distributed, it
would probably make more sense to instead put in the add-on-enqueue
code.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/