Re: [PATCH 2/2] sched: Rewrite per entity runnable load average tracking
From: Peter Zijlstra
Date: Tue Jul 08 2014 - 08:51:12 EST
On Mon, Jul 07, 2014 at 03:25:07PM -0700, bsegall@xxxxxxxxxx wrote:
> >> +static inline void enqueue_entity_load_avg(struct sched_entity *se)
> >> {
> >> + struct sched_avg *sa = &se->avg;
> >> + struct cfs_rq *cfs_rq = cfs_rq_of(se);
> >> + u64 now = cfs_rq_clock_task(cfs_rq);
> >> + u32 old_load_avg = cfs_rq->avg.load_avg;
> >> + int migrated = 0;
> >>
> >> + if (entity_is_task(se)) {
> >> + if (sa->last_update_time == 0) {
> >> + sa->last_update_time = now;
> >> + migrated = 1;
> >> }
> >> + else
> >> + __update_load_avg(now, sa, se->on_rq * se->load.weight);
> >> }
> >>
> >> + __update_load_avg(now, &cfs_rq->avg, cfs_rq->load.weight);
> >>
> >> + if (migrated)
> >> + cfs_rq->avg.load_avg += sa->load_avg;
> >>
> >> + synchronize_tg_load_avg(cfs_rq, old_load_avg);
> >> }
> >
> > So here you add the task to the cfs_rq avg when its got migrate in,
> > however:
> >
> >> @@ -4552,17 +4326,9 @@ migrate_task_rq_fair(struct task_struct *p, int next_cpu)
> >> struct sched_entity *se = &p->se;
> >> struct cfs_rq *cfs_rq = cfs_rq_of(se);
> >>
> >> + /* Update task on old CPU, then ready to go (entity must be off the queue) */
> >> + __update_load_avg(cfs_rq_clock_task(cfs_rq), &se->avg, 0);
> >> + se->avg.last_update_time = 0;
> >>
> >> /* We have migrated, no longer consider this task hot */
> >> se->exec_start = 0;
> >
> > there you don't remove it first..
>
> Yeah, the issue is that you can't remove it, because you don't hold the
> lock. Thus the whole runnable/blocked split iirc. Also the
> cfs_rq_clock_task read is incorrect for the same reason (and while
> rq_clock_task could certainly be fixed min_vruntime-style,
> cfs_rq_clock_task would be harder).
>
> The problem with just working around the clock issue somehow and then using an
> atomic to do this subtraction is that you have no idea when the /cfs_rq/
> last updated - there's no guarantee it is up to date, and if it's not
> then the subtraction is wrong. You can't update it to make it up to date
> like the se->avg, becasue you don't hold any locks. You would need
> decay_counter stuff like the current code, and I'm not certain how well
> that would work out without the runnable/blocked split.
Right; so the current code jumps through a few nasty hoops because of
this. But I think the proposed code got this wrong (understandably).
But yes, we spend a lot of time and effort to remove the rq->lock from
the remote wakeup path, which makes all this very tedious indeed.
Like you said, we can indeed make the time thing work, but the remote
subtraction is going to be messy. Can't seem to come up with anything
sane there either.
Attachment:
pgpjzBV8hrwOP.pgp
Description: PGP signature