Re: [PATCH 04/17] sched/fair: Add avg_vruntime

From: Peter Zijlstra
Date: Wed Apr 05 2023 - 15:14:47 EST


On Wed, Mar 29, 2023 at 09:50:51AM +0200, Peter Zijlstra wrote:
> On Tue, Mar 28, 2023 at 04:57:49PM -0700, Josh Don wrote:
> > On Tue, Mar 28, 2023 at 4:06 AM Peter Zijlstra <peterz@xxxxxxxxxxxxx> wrote:
> > [...]
> > > +/*
> > > + * Compute virtual time from the per-task service numbers:
> > > + *
> > > + * Fair schedulers conserve lag: \Sum lag_i = 0
> > > + *
> > > + * lag_i = S - s_i = w_i * (V - v_i)
> > > + *
> > > + * \Sum lag_i = 0 -> \Sum w_i * (V - v_i) = V * \Sum w_i - \Sum w_i * v_i = 0
> >
> > Small note: I think it would be helpful to label these symbols
> > somewhere :) Weight and vruntime are fairly obvious, but I don't
> > think 'S' and 'V' are as clear. Are these non-virtual ideal service
> > time, and average vruntime, respectively?
>
> Yep, they are. I'll see what I can do with the comments.


/*
* Compute virtual time from the per-task service numbers:
*
* Fair schedulers conserve lag:
*
* \Sum lag_i = 0
*
* Where lag_i is given by:
*
* lag_i = S - s_i = w_i * (V - v_i)
*
* Where S is the ideal service time and V is it's virtual time counterpart.
* Therefore:
*
* \Sum lag_i = 0
* \Sum w_i * (V - v_i) = 0
* \Sum w_i * V - w_i * v_i = 0
*
* From which we can solve an expression for V in v_i (which we have in
* se->vruntime):
*
* \Sum v_i * w_i \Sum v_i * w_i
* V = -------------- = --------------
* \Sum w_i W
*
* Specifically, this is the weighted average of all entity virtual runtimes.
*
* [[ NOTE: this is only equal to the ideal scheduler under the condition
* that join/leave operations happen at lag_i = 0, otherwise the
* virtual time has non-continguous motion equivalent to:
*
* V +-= lag_i / W
*
* Also see the comment in place_entity() that deals with this. ]]
*
* However, since v_i is u64, and the multiplcation could easily overflow
* transform it into a relative form that uses smaller quantities:
*
* Substitute: v_i == (v_i - v0) + v0
*
* \Sum ((v_i - v0) + v0) * w_i \Sum (v_i - v0) * w_i
* V = ---------------------------- = --------------------- + v0
* W W
*
* Which we track using:
*
* v0 := cfs_rq->min_vruntime
* \Sum (v_i - v0) * w_i := cfs_rq->avg_vruntime
* \Sum w_i := cfs_rq->avg_load
*
* Since min_vruntime is a monotonic increasing variable that closely tracks
* the per-task service, these deltas: (v_i - v), will be in the order of the
* maximal (virtual) lag induced in the system due to quantisation.
*
* Also, we use scale_load_down() to reduce the size.
*
* As measured, the max (key * weight) value was ~44 bits for a kernel build.
*/


And the comment in place_entity() (slightly updated since this morning):


/*
* If we want to place a task and preserve lag, we have to
* consider the effect of the new entity on the weighted
* average and compensate for this, otherwise lag can quickly
* evaporate.
*
* Lag is defined as:
*
* lag_i = S - s_i = w_i * (V - v_i)
*
* To avoid the 'w_i' term all over the place, we only track
* the virtual lag:
*
* vl_i = V - v_i <=> v_i = V - vl_i
*
* And we take V to be the weighted average of all v:
*
* V = (\Sum w_j*v_j) / W
*
* Where W is: \Sum w_j
*
* Then, the weighted average after adding an entity with lag
* vl_i is given by:
*
* V' = (\Sum w_j*v_j + w_i*v_i) / (W + w_i)
* = (W*V + w_i*(V - vl_i)) / (W + w_i)
* = (W*V + w_i*V - w_i*vl_i) / (W + w_i)
* = (V*(W + w_i) - w_i*l) / (W + w_i)
* = V - w_i*vl_i / (W + w_i)
*
* And the actual lag after adding an entity with vl_i is:
*
* vl'_i = V' - v_i
* = V - w_i*vl_i / (W + w_i) - (V - vl_i)
* = vl_i - w_i*vl_i / (W + w_i)
*
* Which is strictly less than vl_i. So in order to preserve lag
* we should inflate the lag before placement such that the
* effective lag after placement comes out right.
*
* As such, invert the above relation for vl'_i to get the vl_i
* we need to use such that the lag after placement is the lag
* we computed before dequeue.
*
* vl'_i = vl_i - w_i*vl_i / (W + w_i)
* = ((W + w_i)*vl_i - w_i*vl_i) / (W + w_i)
*
* (W + w_i)*vl'_i = (W + w_i)*vl_i - w_i*vl_i
* = W*vl_i
*
* vl_i = (W + w_i)*vl'_i / W
*/