Re: [PATCH V2] sched: Forward deadline for early tick

From: Vincent Guittot
Date: Tue Jan 14 2025 - 05:16:44 EST


On Thu, 9 Jan 2025 at 06:49, zihan zhou <15645113830zzh@xxxxxxxxx> wrote:
>
> Thank you for your reply!
>
> > > There are two reasons for tick error: clockevent precision and the
> > > CONFIG_IRQ_TIME_ACCOUNTING. with CONFIG_IRQ_TIME_ACCOUNTING every tick
> > > will less than 1ms, but even without it, because of clockevent precision,
> > > tick still often less than 1ms. In the system above, there is no such
> > > config, but the task still often takes more than 3ms.
> > >
> > > To solve this problem, we add a sched feature FORWARD_DEADLINE,
> > > consider forwarding the deadline appropriately. When vruntime is very
> > > close to the deadline, and the task is ineligible, we consider that task
> > > should be resched, the tolerance is set to min(vslice/128, tick/2).
> >
> > I'm worried with this approximation because the task didn't get the
> > slice it has requested because of the stolen time by irq or a shorter
> > tick duration
>
> Yes, you are right. forward deadline is not a good way, although the error
> is small, the task will have less exec time.
>
>
> > Yes but it also didn't say that you can move forward its deadline
> > before it has consumed its slice request. A task is only ensured to
> > get a discrete time quanta but can be preempted after each quanta even
> > if its slice has not elapsed.
> >
> > What you want, it's to trigger a need_resched after a minimum time
> > quanta has elapsed but not to update the deadline before the slice has
> > elapsed.
> >
> > Now the question is what is the minimum time quanta for us. Should it
> > be a tick whatever it's real duration for the task ? Should it be
> > longer ?
>
> I have also been thinking about this question: what is the appropriate
> minimum time quanta? I think neither tick nor slice is a good choice.
> The task should have an atomic runtime greater than tick to avoid frequent
> switching, but if the time size is silce, it is not flexible, tasks often
> use up the requested slice at once without caring about if they are

The current implementation tries to minimize the number of context
switch by letting current to exhaust its slice unless a waking
eligible task has a shorter deadline

> eligible. So can we add a new kernel parameter, like
> sysctl_sched_min_granularity? Adding a min_granularity between tick and
> slice seems like a good choice.

We try to prevent adding more knobs.
We are back to changing the default slice to not be a multiple of
ticks to give room for interrupt context and others.

default slice = 0.75 msec * (1 + ilog(ncpus)) with ncpus capped at 8
which means that we have a default slice of
0.75 for 1 cpu
1.50 up to 3 cpus
2.25 up to 7 cpus
3.00 for 8 cpus and above

For HZ=250 and HZ=100, all values are "ok". By "ok", I mean that task
will not get an extra tick but their runtime remains far higher than
their slice. The only config that has an issue is HZ=1000 with 8 cpus
or more.

Using 0.70 instead of 0.75 should not change much for other configs
and would fix this config too with a default slice equals 2.8ms.
0.70 for 1 cpu
1.40 up to 3 cpus
2.10 up to 7 cpus
2.8 for 8 cpus and above

That being said the problem remains the same if a task sets its custom
slice being a multiple of ticks or the time stolen by interrupts is
higher than 66us per tick in average

>
> We can let a task exec min_granularity time at once, then it is needed to
> consider whether the task is eligible or whether the deadline needs
> to be updated.
>
>
> But I don't know how to handle wake up preemption more appropriately.
> Is it necessary to wait for the preempted task to complete an atomic time?
>
>