Re: [PATCH 4/4] sched/fair: limit sched slice duration

From: Vincent Guittot
Date: Fri Sep 09 2022 - 10:07:40 EST

Next message: Song Liu: "Re: [PATCH v5 bpf-next 2/4] ftrace: Allow IPMODIFY and DIRECT ops on the same function"
Previous message: Vitaly Kuznetsov: "Re: [PATCH 2/3] pci_ids: Add Microsoft PCI Vendor ID, and remove redundant definitions"
In reply to: Peter Zijlstra: "Re: [PATCH 4/4] sched/fair: limit sched slice duration"
Next in thread: Vincent Guittot: "Re: [PATCH 4/4] sched/fair: limit sched slice duration"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]

On Fri, 9 Sept 2022 at 13:14, Peter Zijlstra <peterz@xxxxxxxxxxxxx> wrote:
>
>
> Picked up the first three.
>
> On Thu, Aug 25, 2022 at 02:27:26PM +0200, Vincent Guittot wrote:
> > In presence of a lot of small weight tasks like sched_idle tasks, normal
> > or high weight tasks can see their ideal runtime (sched_slice) to increase
> > to hundreds ms whereas it normally stays below sysctl_sched_latency.
> >
> > 2 normal tasks running on a CPU will have a max sched_slice of 12ms
> > (half of the sched_period). This means that they will make progress
> > every sysctl_sched_latency period.
> >
> > If we now add 1000 idle tasks on the CPU, the sched_period becomes
>
> Surely people aren't actually having that many runnable tasks and this
> is a device for the argument?
>
> > 3006 ms and the ideal runtime of the normal tasks becomes 609 ms.
> > It will even become 1500ms if the idle tasks belongs to an idle cgroup.
> > This means that the scheduler will look for picking another waiting task
> > after 609ms running time (1500ms respectively). The idle tasks change
> > significantly the way the 2 normal tasks interleave their running time
> > slot whereas they should have a small impact.
> >
> > Such long sched_slice can delay significantly the release of resources
> > as the tasks can wait hundreds of ms before the next running slot just
> > because of idle tasks queued on the rq.
> >
> > Cap the ideal_runtime to sysctl_sched_latency when comparing to the next
> > waiting task to make sure that tasks will regularly make progress and will
> > not be significantly impacted by idle/background tasks queued on the rq.
> >
> > Signed-off-by: Vincent Guittot <vincent.guittot@xxxxxxxxxx>
> > ---
> >
> > While studying the problem, I have also considered to substract
> > cfs.idle_h_nr_running before computing the sched_slice but we can have
> > quite similar problem with low weight bormal task/cgroup so I have decided
> > to keep this solution.
>
> That ^... my proposal below has the same problem.
>
> This:
>
> > Also, this solution doesn't completly remove the impact of idle tasks
> > in the scheduling pattern but cap the running slice of a task to a max
> > value of 2*sysctl_sched_latency.
>
> I'm failing to see how.

The 1st part of check_preempt_tick ensures that we wait at least
sysctl_sched_min_granularity but not more than ideal_runtime before
possibly picking another entity.

Once both conditions above tested, we check that the vruntime diff
with the 1st pending entity is not larger than a sysctl_sched_latency.

Normally sched_slice should return an ideal_runtime value less than
sysctl_sched_latency. But we also want to provide a minimum runtime to
all tasks so we increase the sched_period when the number of tasks
increases too much.

The case described above is a corner case because of the large
difference between the tasks' prio.

Now, let assume that we have only 1 normal task and 1000 idle tasks, I
don't see any problem with providing a large ideal runtime for this
normal task. The problem comes when you have at least 2 normal tasks
as we don't expect the other normal task to wait for several hundreds
of ms before running.

That's why the comparison is done against the diff of vruntime; idle
task running for a 4ms tick will increase its vruntime with + 1366ms
which is comparable with the slice duration of the normal task. On the
other side, a 4ms tick will increase the vruntime of a nice 0 task to
4ms only. So the vruntime diff will quickly move above the
sysctl_sched_latency.

That being said, it doesn't completely fix the case of 2 nice -20 task runnings

>
> > kernel/sched/fair.c | 2 ++
> > 1 file changed, 2 insertions(+)
> >
> > diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> > index 260a55ac462f..96fedd0ab5fa 100644
> > --- a/kernel/sched/fair.c
> > +++ b/kernel/sched/fair.c
> > @@ -4599,6 +4599,8 @@ check_preempt_tick(struct cfs_rq *cfs_rq, struct sched_entity *curr)
> > if (delta < 0)
> > return;
>
> (I'm thinking that early return is a bit pointless, a negative value
> won't be larger than ideal_time anyway)

yes

>
> > + ideal_runtime = min_t(u64, ideal_runtime, sysctl_sched_latency);
> > +
>
> (superfluous whitespace -- twice, once after the = once this whole extra
> line)

sorry for that...

>
> > if (delta > ideal_runtime)
> > resched_curr(rq_of(cfs_rq));
> > }
>
> Urgghhhh..
>
> so delta is in vtime here, while sched_latency is not, so the heavier
> the queue, the larger this value becomes.
>
> Given those 1000 idle tasks, rq-weight would be around 2048; however due
> to nr_running being insane, sched_slice() ends up being something like:

rq weight will be 1000*3+2*1024=5048
sched_period becomes 1002 * min_gran = 3006ms

idle task got a slice of weight(3) * (1002 min_gran) / 5048 =
3002/5048 * min_gran

normal task got a slice of weight(1024) * (1002 min_gran) / 5048 =
1024*1002*5048 * min_gran ~ 200 min_gran

if the 1000 task are in a idle sched group, that even worth because
the rq weight decrease to 3+2*1024 = 2051 and the slice increase to
500 min_gran

note that if we use 2 tasks nice -20 and 1000 tasks with nice 19 we
have similar slice duration (around 500 min_gran) so we can't really
rely on idle_nr_running

>
> 1000 * min_gran * 2 / 2048
>
> which is around ~min_gran and so won't come near to latency.
>
>
> since we already have idle_min_gran; how about something like this?

the idl_min gran will divide by 4 the sched_slice which can still
remain quite high

The main problem with my proposal is that task with negative nice prio
can still get larger sched_slice because vruntime increases slower
than real time

>
>
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index efceb670e755..8dd18fc0affa 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -706,12 +706,14 @@ static inline u64 calc_delta_fair(u64 delta, struct sched_entity *se)
> *
> * p = (nr <= nl) ? l : l*nr/nl
> */
> -static u64 __sched_period(unsigned long nr_running)
> +static u64 __sched_period(unsigned long nr_running, unsigned long nr_idle)
> {
> - if (unlikely(nr_running > sched_nr_latency))
> - return nr_running * sysctl_sched_min_granularity;
> - else
> - return sysctl_sched_latency;
> + u64 period = 0;
> +
> + period += nr_running * sysctl_sched_min_granularity;
> + period += nr_idle * sysctl_sched_idle_min_granularity;
> +
> + return max_t(u64, period, sysctl_sched_latency);
> }
>
> static bool sched_idle_cfs_rq(struct cfs_rq *cfs_rq);
> @@ -724,15 +726,25 @@ static bool sched_idle_cfs_rq(struct cfs_rq *cfs_rq);
> */
> static u64 sched_slice(struct cfs_rq *cfs_rq, struct sched_entity *se)
> {
> - unsigned int nr_running = cfs_rq->nr_running;
> + unsigned int nr_idle = cfs_rq->idle_nr_running;
> + unsigned int nr_running = cfs_rq->nr_running - nr_idle;
> struct sched_entity *init_se = se;
> unsigned int min_gran;
> u64 slice;
>
> - if (sched_feat(ALT_PERIOD))
> - nr_running = rq_of(cfs_rq)->cfs.h_nr_running;
> + if (sched_feat(ALT_PERIOD)) {
> + nr_idle = rq_of(cfs_rq)->cfs.idle_h_nr_running;
> + nr_running = rq_of(cfs_rq)->cfs.h_nr_running - nr_idle;
> + }
> +
> + if (!se->on_rq) {
> + if (se_is_idle(se))
> + nr_idle++;
> + else
> + nr_running++;
> + }
>
> - slice = __sched_period(nr_running + !se->on_rq);
> + slice = __sched_period(nr_running, nr_idle);
>
> for_each_sched_entity(se) {
> struct load_weight *load;
>
>
> This changes how the compute the period depending on the composition. It
> suffers the exact same problem you had earlier though in that it doesn't
> work for the other low-weight cases. But perhaps we can come up with a
> better means of computing the period that *does* consider them?
>
> As said before;... urgh! bit of a sticky problem this.

Next message: Song Liu: "Re: [PATCH v5 bpf-next 2/4] ftrace: Allow IPMODIFY and DIRECT ops on the same function"
Previous message: Vitaly Kuznetsov: "Re: [PATCH 2/3] pci_ids: Add Microsoft PCI Vendor ID, and remove redundant definitions"
In reply to: Peter Zijlstra: "Re: [PATCH 4/4] sched/fair: limit sched slice duration"
Next in thread: Vincent Guittot: "Re: [PATCH 4/4] sched/fair: limit sched slice duration"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]