Re: [RFC PATCH 14/14] sched: implement usage tracking

From: Paul Turner
Date: Fri Feb 17 2012 - 06:33:09 EST


On Thu, Feb 16, 2012 at 8:58 AM, Peter Zijlstra <a.p.zijlstra@xxxxxxxxx> wrote:
> On Wed, 2012-02-01 at 17:38 -0800, Paul Turner wrote:
>>  struct sched_avg {
>>         u64 runnable_avg_sum, runnable_avg_period;
>>         u64 last_runnable_update, decay_count;
>> +       u32 usage_avg_sum;
>
> 4 byte hole
>
>>         unsigned long load_avg_contrib;
>>
>>         int contributes_blocked_load;
>>};
>
>
> A) Running/Runnable
>    - uses last_runnable_update to track
>        runnable_avg_sum, usage_avg_sum and runnable_avg_period

Yes, with the runnable average being runnable_avg_sum / runnable_avg_period

>
> B) Blocked
>    - uses decay_count to keep track of sleeptime

Not quite (we could track sleep-time explicitly about clock_task after all).

We need decay-count to synchronize against entity_tick /
update_blocked_shares accounting decays against
cfs_rq->blocked_load_sum.

Since we do not know where we slept relative to the tick, unless we
explicitly track where this occurred we don't know for a "n.y" jiffy
sleep whether the "0.y" fraction included an extra decay. It's
important to get this right to reduce instability, since when the
entity wakes back up we need to charge an equivalent decay against
it's contribution before we can remove it from the blocked_load_sum.

For the common case of a short sleep whether we charge an extra period
(if y included a jiffy edge) or not is significant to stability since
getting it wrong means we leave a little extra load in blocked_load.

Further complicating this we need to be able to do said
synchronization in the annoying cases of wake-up migration (where we
don't hold the lock on previous rq) and 32-bit machines (where
everything sucks).

>
> C) 'Migrate'
>    - uses contributes_blocked_load to check if we actually did migrate?
>

We actually use se->avg.decay_count to check whether we were migrated
(setting it appropriately at migration); contributes_blocked_load is
just a convenience variable to track whether we're a part of
cfs_rq->blocked_load_avg or not.

I have to double-check everything but I think the sign of
se->decay_count can be used equivalently so I'll likely just remove it
in favor of that. Although in order to keep things readable I'll
replace it a similarly named function that does this sign check.)

> So there's a number of things I don't get, why have
> remove_task_load_avg_async() in set_task_rq()? enqueue_entity_load_avg()
> can do the __synchronize_entity_decay() thing (and actually does) just
> fine, right?

Wake-up migration is the nuisance here.

In this case we:
A) Don't know where we came from in enqueue_entity_load_avg to
synchronize the decay against
B) Need to be able to remove the blocked load possessed by said waking
entity against its former cfs_rq->blocked_load_sum

There's a one or two other paths (than ttwu) that this matters for,
using set_task_rq() gets them all.

>
> Similarly, why not use last_runnable_update (as last_update) to track
> both A and B since a task cannot be in both states at the same time
> anyway. If you always update the timestamp you can use it to either
> track runnable/usage or decay for sleep.

This is pretty much exactly what we do (use it to track both A and B);
modulo the complications surrounding decay for sleep explained above
and below.

There's two things at play here:
1) An entity's runnable_avg := runnable_avg_sum / runnable_avg_period
2) An entity's load contribution := runnable_avg_sum /
runnable_avg_period * entity_weight

The load contribution is what we added to blocked_load_sum. When we
wake back up we synchronize decays against this and remove it from the
sum. However, decay into this is really just an imperfect
approximation (alignment of when we actually slept vs tick)

As soon as it's actually running again we:
- Adjust our load contrib by decay so that we can remove the right
amount from blocked_load_sum
- Charge the time since last_runnable_update as not running
- Compute a new, *accurate* load_contrib from (1) above, throwing away
the decayed previous one
- Add the new accurate calculation into the running load sum

The nice thing is that when we start this cycle again when the entity
next blocks we don't compound the losses of precision due to aligning
decay vs ticks since we get to throw away that number and do an
accurate accounting based on runnable_avg again.

>
> That would get rid of decay_count and contributes_blocked_load, no?
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/