Re: [PATCH 2/2 v4] sched: Rewrite per entity runnable load average tracking

From: Peter Zijlstra
Date: Mon Jul 28 2014 - 13:19:37 EST


On Mon, Jul 28, 2014 at 09:58:19AM -0700, bsegall@xxxxxxxxxx wrote:
> Peter Zijlstra <peterz@xxxxxxxxxxxxx> writes:
>
> >> @@ -4551,18 +4382,34 @@ migrate_task_rq_fair(struct task_struct *p, int next_cpu)
> >> {
> >> struct sched_entity *se = &p->se;
> >> struct cfs_rq *cfs_rq = cfs_rq_of(se);
> >> + u64 last_update_time;
> >>
> >> /*
> >> + * Task on old CPU catches up with its old cfs_rq, and subtract itself from
> >> + * the cfs_rq (task must be off the queue now).
> >> */
> >> +#ifndef CONFIG_64BIT
> >> + u64 last_update_time_copy;
> >> +
> >> + do {
> >> + last_update_time_copy = cfs_rq->load_last_update_time_copy;
> >> + smp_rmb();
> >> + last_update_time = cfs_rq->avg.last_update_time;
> >> + } while (last_update_time != last_update_time_copy);
> >> +#else
> >> + last_update_time = cfs_rq->avg.last_update_time;
> >> +#endif
> >> + __update_load_avg(last_update_time, &se->avg, 0);
> >> + atomic_long_add(se->avg.load_avg, &cfs_rq->removed_load_avg);
> >> +
> >> + /*
> >> + * We are supposed to update the task to "current" time, then its up to date
> >> + * and ready to go to new CPU/cfs_rq. But we have difficulty in getting
> >> + * what current time is, so simply throw away the out-of-date time. This
> >> + * will result in the wakee task is less decayed, but giving the wakee more
> >> + * load sounds not bad.
> >> + */
> >> + se->avg.last_update_time = 0;
> >>
> >> /* We have migrated, no longer consider this task hot */
> >> se->exec_start = 0;
> >
> >
> > And here we try and make good on that assumption. The thing I worry
> > about is what happens if the machine is entirely idle...
> >
> > What guarantees an semi up-to-date cfs_rq->avg.last_update_time.
>
> update_blocked_averages I think should do just as good a job as the old
> code, which isn't perfect but is about as good as you can get worst case.

Right, that's called from rebalance_domains() which should more or less
update this value on tick boundaries or thereabouts for most 'active'
cpus.

But if the entire machine is idle, the first wakeup (if its a x-cpu one)
might see a very stale timestamp.

If we can fix that, that would be good I suppose, but I'm not
immediately seeing something pretty there, but you're right, it'd not be
worse than the current situation.

Attachment: pgp9J66uQqlDX.pgp
Description: PGP signature