Re: [PATCH] sched/fair: favor non-idle group in tick preemption

From: Josh Don
Date: Mon Oct 31 2022 - 18:45:09 EST


On Mon, Oct 31, 2022 at 1:39 AM Chuyi Zhou <zhouchuyi@xxxxxxxxxxxxx> wrote:
>
> >> 在 2022/10/28 07:34, Josh Don 写道:
> > The reason for limiting the control of weight for idle cgroups is to
> > match the semantics of the per-task SCHED_IDLE api, which gives
> > SCHED_IDLE threads minimum weight. The idea behind SCHED_IDLE is that
> > these entities are intended to soak "unused" cpu cycles, and should
> > give minimal interference to any non-idle thread. However, we don't
> > have strict priority between idle and non-idle, due to the potential
> > for starvation issues.
> >
> > Perhaps you could clarify your use case a bit further. Why do you want
> > to change the weight? Is it to adjust the competition between two idle
> > groups, or something else?
> >
> Suppose we have two cgroups(idle & non-idle)in /sys/fs/cgroup/cpu.
> Idle cgroup contains some offline service, such as beg data processing;
> non-idle cgroup contains some online service which have
> higher priority to users and are sensitive to latency. We set
> quota/period for idle cgroup which indicates it's *cpu limit*.
> In general, we consider that the idle cgroup's cpu usage
> closer to the limit, the better. However, when the system is busy,
> the idle cgroups can only get little cpu resources with minimum weight.
> To cope with the above situation, we changed the default weight.

I see. So you want the part of SCHED_IDLE that makes the entity highly
preemptible (and avoids preemption of non idle entities), but want to
adjust weight to reach a target cpu split? That seems a bit
counterintuitive to me, since by giving the idle entities higher
weight, you'll end up pushing out the round-robin latency for the
non-idle entities.

Worth noting that SCHED_IDLE is a bit of a CFS hack, but the intended
semantics of it are that these threads soak only "remaining cycles".
This comes with many implications beyond just weight. For example, a
cpu running only SCHED_IDLE entities is considered as "idle" from the
perspective of non-idle entities. If we give these idle entities
meaningful weight, we start to break assumptions there, for example
see sched_idle_cpu() and load balancing.

I wonder if maybe dusting off SCHED_BATCH is a better answer here, for
this type of use case (some amount of throughput "guarantee", but with
preemption properties similar to SCHED_IDLE). Peter, thoughts?

> One more question is, why you think this patch can strave idle entity?
>
> /*
> * Ensure that a task that missed wakeup preemption by a
> * narrow margin doesn't have to wait for a full slice.
> * This also mitigates buddy induced latencies under load.
> */
> se = __pick_first_entity(cfs_rq);
> delta = curr->vruntime - se->vruntime;
>
> if (delta < 0)
> return;
>
> if (delta > ideal_runtime)
> resched_curr(rq_of(cfs_rq));
>
> se can preempt curr only when
> curr->vruntime > se->vruntime &&
> curr->vruntime - se->vruntime > ideal_runtime
> is true. I think the original purpose is that se doesn't have to wait
> for a full slice, reduce response time if se is latency sensitive.
> This patch just let curr exhaust it's idleal_runtime when se is idle and
> curr is non-idle. Normally se will be choosed by pick_next_entity().
>
> Maybe I missed something ?
> Thanks

No that was my mistake, I accidentally thought this delta was being
applied to the 'if (delta_exec > ideal_runtime) {' above in
check_preempt_tick().

Some weirdness about this change though, is that if there is a
non-idle current entity, and the two next entities on the cfs_rq are
idle and non-idle respectively, we'll now take longer to preempt the
on-cpu non-idle entity, because the non-idle entity on the cfs_rq is
'hidden' by the idle 'first' entity. Wakeup preemption is different
because we're always directly comparing the current entity with the
newly woken entity.

Best,
Josh