Re: [PATCH 5/6] sched/fair: Get rid of scaling utilization by capacity_orig
From: bsegall
Date: Tue Sep 22 2015 - 13:18:39 EST
Yuyang Du <yuyang.du@xxxxxxxxx> writes:
> On Mon, Sep 21, 2015 at 10:30:04AM -0700, bsegall@xxxxxxxxxx wrote:
>> > But first, I think as load_sum and load_avg can afford NICE_0_LOAD with either high
>> > or low resolution. So we have no reason to have low resolution (10bits) load_avg
>> > when NICE_0_LOAD has high resolution (20bits), because load_avg = runnable% * load,
>> > as opposed to now we have load_avg = runnable% * scale_load_down(load).
>> >
>> > We get rid of all scale_load_down() for runnable load average?
>>
>> Hmm, LOAD_AVG_MAX * prio_to_weight[0] is 4237627662, ie barely within a
>> 32-bit unsigned long, but in fact LOAD_AVG_MAX * MAX_SHARES is already
>> going to give errors on 32-bit (even with the old code in fact). This
>> should probably be fixed... somehow (dividing by 4 for load_sum on
>> 32-bit would work, though be ugly. Reducing MAX_SHARES by 2 bits on
>> 32-bit might have made sense but would be a weird difference between 32
>> and 64, and could break userspace anyway, so it's presumably too late
>> for that).
>>
>> 64-bit has ~30 bits free, so this would be fine so long as SLR is 0 on
>> 32-bit.
>>
>
> load_avg has no LOAD_AVG_MAX term in it, it is runnable% * load, IOW, load_avg <= load.
> So, on 32bit, cfs_rq's load_avg can host 2^32/prio_to_weight[0]/1024 = 47, with 20bits
> load resolution. This is ok, because struct load_weight's load is also unsigned
> long. If overflown, cfs_rq->load.weight will be overflown in the first place.
>
> However, after a second thought, this is not quite right. Because load_avg is not
> necessarily no greater than load, since load_avg has blocked load in it. Although,
> load_avg is still at the same level as load (converging to be <= load), we may not
> want the risk to overflow on 32bit.
Yeah, I missed that load_sum was u64 and only load_avg was long. This
means we're fine on 32-bit with no SLR (or more precisely, cfs_rq
runnable_load_avg can overflow, but only when cfs_rq load.weight does,
so whatever). On 64-bit you can currently have 2^36 cgroups or 2^37
tasks before load.weight overflows, and ~2^31 tasks before
runnable_load_avg does, which is obviously fine (and in fact you hit
PID_MAX_LIMIT and even if you had the cpu/ram/etc to not fall over).
Now, applying SLR to runnable_load_avg would cut this down to ~2^21
tasks running at once or 2^20 with cgroups, which is technically
allowed, though it seems utterly implausible (especially since this
would have to all be on one cpu). If SLR was increased as peterz asked
about, this could be an issue though.
All that said, using SLR on load_sum/load_avg as opposed to cfs_rq
runnable_load_avg would be fine, as they're limited to only one
task/cgroup's weight. Having it SLRed and cfs_rq not would be a
little odd, but not impossible.
>
>> > +/*
>> > + * NICE_0's weight (visible to user) and its load (invisible to user) have
>> > + * independent resolution, but they should be well calibrated. We use scale_load()
>> > + * and scale_load_down(w) to convert between them, the following must be true:
>> > + * scale_load(prio_to_weight[20]) == NICE_0_LOAD
>> > + */
>> > #define NICE_0_LOAD SCHED_LOAD_SCALE
>> > #define NICE_0_SHIFT SCHED_LOAD_SHIFT
>>
>> I still think tying the scale_load shift to be the same as the
>> SCHED_CAPACITY/etc shift is silly, and tying the NICE_0_LOAD/SHIFT in is
>> worse. Honestly if I was going to change anything it would be to define
>> NICE_0_LOAD/SHIFT entirely separately from SCHED_LOAD_SCALE/SHIFT.
>
> If NICE_0_LOAD is nice-0's load, and if SCHED_LOAD_SHIFT is to say how to get
> nice-0's load, I don't understand why you want to separate them.
SCHED_LOAD_SHIFT is not how to get nice-0's load, it just happens to
have the same value as NICE_0_SHIFT. (I think anyway, SCHED_LOAD_* is
used in precisely one place other than the newish util_avg, and as I
mentioned it's not remotely clear what compute_imbalance is doing theer)
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/