Re: [PATCH] sched/fair: fix mul overflow on 32-bit systems

From: Morten Rasmussen
Date: Mon Dec 14 2015 - 07:32:38 EST


On Fri, Dec 11, 2015 at 11:18:56AM -0800, bsegall@xxxxxxxxxx wrote:
> Dietmar Eggemann <dietmar.eggemann@xxxxxxx> writes:
> > IMHO, on 32bit machine we can deal with (2147483648/47742/1024 = 43.9)
> > 43 tasks before overflowing.
> >
> > Can we have a scenario where >43 tasks with se->avg.util_avg=1024 value
> > get migrated (migrate_task_rq_fair()) or die (task_dead_fair()) or a
> > task group dies (free_fair_sched_group()) which has a se->avg.util_avg >
> > 44981 for a specific cpu before the atomic_long_xchg() happens in
> > update_cfs_rq_load_avg()? Never saw this in my tests so far on ARM
> > machines.
>
> First, I believe in theory util_avg on a cpu should add up to 100% or
> 1024 or whatever. However, recently migrated-in tasks don't have their
> utilization cleared, so if they were quickly migrated again you could
> have up to the number of cpus or so times 100%, which could lead to
> overflow here.

Not only that, just creating new tasks can the overflow. As Yuyang
already pointed out in this thread, tasks are initialized to 100% so
spawning n_cpus*44 should almost guarantee overflow for at least one rq
in the system.

> This just leads to more questions though:
>
> The whole removed_util_avg thing doesn't seem to make a ton of sense -
> the code doesn't add util_avg for a migrating task onto
> cfs_rq->avg.util_avg, and doing so would regularly give >100% values (it
> does so on attach/detach where it's less likely to cause issues, but not
> migration). Removing it only makes sense if the task has accumulated all
> that utilization on this cpu, and even then mostly only makes sense if
> this is the only task on the cpu (and then it would make sense to add it
> on migrate-enqueue). The whole add-on-enqueue-migrate,
> remove-on-dequeue-migrate thing comes from /load/, where doing so is a
> more globally applicable approximation than it is for utilization,
> though it could still be useful as a fast-start/fast-stop approximation,
> if the add-on-enqueue part was added. It could also I guess be cleared
> on migrate-in, as basically the opposite assumption (or do something
> like add on enqueue, up to 100% and then set the se utilization to the
> amount actually added or something).

Migrated tasks are already added to cfs_rq->avg.util_avg (as Yuyang
already pointed out) which gives us very responsive metric for cpu
utilization. util_avg > 100% is currently a quite common transient scenario. It
happens very often when creating new tasks. Unless we always clear
util_avg on migration (including wake-up migration) we will have to deal
with util_avg > 100%, but clearing would make per-entity utilization
tracking useless. The whole point, as I see it, is to have a utilization
metric which can deal with task migrations.

We do however have to be very clear about the meaning of util_avg. It
has very little meaning when for neither the sched_entites nor the
cfs_rq when cfs_rq->avg.util_avg > 100%. All we can say is that the cpu
is quite likely overutilized. But for lightly utilized systems it gives
us a very responsive and fairly accurate estimate of the cpu utilization
and can be used to estimate the cpu utilization change caused by
migrating a task.

> If the choice was to not do the add/remove thing, then se->avg.util_sum
> would be unused except for attach/detach, which currently do the
> add/remove thing. It's not unreasonable for them, except that currently
> nothing uses anything other than the root's utilization, so migration
> between cgroups wouldn't actually change the relevant util number
> (except it could because changing the cfs_rq util_sum doesn't actually
> do /anything/ unless it's the root, so you'd have to wait until the
> cgroup se actually changed in utilization).

We use util_avg extensively in the energy model RFC patches, and I think
it is worth considering using both cfs_rq->avg.util_avg and
se->avg.util_avg to improve select_task_rq_fair().

util_avg for task groups has a quite different meaning than load_avg.
Where load_avg is scaled to ensure that the combined contribution of a
group never exceeds that of a single always-running task, util_avg for
groups reflect the true cpu utilization of the group. I agree that
tracking util_avg for groups is redundant and could be removed if it can
be done in a clean way.

> So uh yeah, my initial impression is "rip it out", but if being
> immediately-correct is important in the case of one task being most of
> the utilization, rather than when it is more evenly distributed, it
> would probably make more sense to instead put in the add-on-enqueue
> code.

I would prefer if stayed in. There are several patch sets posted for
review that use util_avg.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/