Re: [RFC PATCH] sched/eevdf: Use tunable knob sysctl_sched_base_slice as explicit time quanta

From: Ze Gao
Date: Tue Feb 06 2024 - 02:50:37 EST


On Mon, Feb 5, 2024 at 3:37 PM Vishal Chourasia <vishalc@xxxxxxxxxxxxx> wrote:
>
> On Sun, Feb 04, 2024 at 11:05:22AM +0800, Ze Gao wrote:
> > On Fri, Feb 2, 2024 at 7:50 PM Vishal Chourasia <vishalc@xxxxxxxxxxxxx> wrote:
> > >
> > > On Wed, Jan 24, 2024 at 10:32:08AM +0800, Ze Gao wrote:
> > > > > Hi, How are you setting custom request values for process A and B?
> > > >
> > > > I cherry-picked peter's commit[1], and adds a SCHED_QUANTA feature control
> > > > for testing w/o my patch. You can check out [2] to see how it works.
> > > >
> > > Thank you sharing your setup.
> > >
> > > Built the kernel according to [2] keeping v6.8.0-rc1 as base
> > >
> > > // NO_SCHED_QUANTA
> > > # perf script -i perf.data.old -s perf-latency.py
> > > PID 355045: Average Delta = 87.72726154385964 ms, Max Delta = 110015044 ms, Count = 57
> > > PID 355044: Average Delta = 92.2655679245283 ms, Max Delta = 110.017182 ms, Count = 53
> > >
> > > // SCHED_QUANTA
> > > # perf script -i perf.data -s perf-latency.py
> > > PID 355065: Average Delta = 10.00 ms, Max Delta = 10.012708 ms, Count = 500
> > > PID 355064: Average Delta = 9.959 ms, Max Delta = 10.023588 ms, Count = 501
> > >
> > > # cat /sys/kernel/debug/sched/base_slice_ns
> > > 3000000
> > >
> > > base slice is not being enforced.
> > >
> > > Next, Looking closing at the perf.data file
> > >
> > > # perf script -i perf.data -C 1 | grep switch
> > > ...
> > > stress-ng-cpu 355064 [001] 776706.003222: sched:sched_switch: stress-ng-cpu:355064 [120] R ==> stress-ng-cpu:355065 [120]
> > > stress-ng-cpu 355065 [001] 776706.013218: sched:sched_switch: stress-ng-cpu:355065 [120] R ==> stress-ng-cpu:355064 [120]
> > > stress-ng-cpu 355064 [001] 776706.023218: sched:sched_switch: stress-ng-cpu:355064 [120] R ==> stress-ng-cpu:355065 [120]
> > > stress-ng-cpu 355065 [001] 776706.033218: sched:sched_switch: stress-ng-cpu:355065 [120] R ==> stress-ng-cpu:355064 [120]
> > > ...
> > >
> > > Delta wait time is approx 0.01s or 10ms
> >
> > You can check out your HZ, which should be 100 in your settings
> > in my best guess.That explains your results.
> Yes. How much is it in your case? If I may ask.

Like I mentioned in the changelog: with HZ=1000, sysctl_sched_base_slice=3ms,
nr_cpu=42.

> > > So, switch is not happening at base_slice_ns boundary.
> > >
> > > But why? is it possible base_slice_ns is not properly used in
> > > arch != x86 ?
> >
> > The thing is in my RFC the effective quanta is actually
> >
> > max_t(u64, TICK_NSEC, sysctl_sched_base_slice)
> >
> > where sysctl_sched_base_slice is precisely a handy tunable knob
> > for users ( maybe i should make it loud and clear more ).
> >
> > See what I do in update_entity_lag(), you will understand.
> Thanks. I will look into it.
> >
> > Note we have 3 time related concepts here:
> > 1. TIME TICK: (schedule) accounting time unit.
> > 2. TIME QUANTA (not necessarily the effective one): scheduling time unit
> > 3. USER SLICE: time slice per request
> To double check,
> User slice is the request size submitted by a competing task for the time-shared resource (here,
> processor) against other competing tasks.
> Scheduler allocates time-shared resource (here, processor) in `q` quantum
> which is our TIME QUANTA
> TIME TICK is time period between two scheduler ticks.

Yeah, that is what I see them.

Note we don't necessarily allocate time quantum continuously to fulfil a user's
request.

To quote from the paper, "by decoupling the request size from the size of a time
quantum, ... gives a client possibility of trading between allocation
accuracy and
scheduling overhead". This is the very reason why this patch proposes to bring
the concept of time quanta into existence.

Cheers,
-- Ze

> Thanks,
> -- vishal.c
> >
> > To implement latency-nice while being as fair as possible, We must
> > carefully consider the size relationship between them, and especially
> > the value range of USER SLICE due to the cold fact that the lag(
> > unfairness) is literally subject to both time quanta and user requested
> > slices.
> >
> >
> > Regards,
> > -- Ze
> >
> > > >
> > > > echo NO_SCHED_QUANTA > /sys/kernel/debug/sched/features
> > > > test
> > > > sleep 2
> > > > echo SCHED_QUANTA > /sys/kernel/debug/sched/features
> > > > test
> > > >
> > > >
> > > > [1]: https://git.kernel.org/pub/scm/linux/kernel/git/peterz/queue.git/commit/kernel/sched?h=sched/eevdf&id=98866150f92f268a2f08eb1d884de9677eb4ec8f
> > > > [2]: https://github.com/zegao96/linux/tree/sched-eevdf
> > > >
> > > >
> > > > Regards,
> > > > -- Ze
> > > >
> > > > > >
> > > > > > stress-ng-cpu:10705 stress-ng-cpu:10706
> > > > > > ---------------------------------------------------------------------
> > > > > > Slices(ms) 100 0.1
> > > > > > Runtime(ms) 4934.206 5025.048
> > > > > > Switches 58 67
> > > > > > Average delay(ms) 87.074 73.863
> > > > > > Maximum delay(ms) 101.998 101.010
> > > > > >
> > > > > > In contrast, using sysctl_sched_base_slice as the size of a 'quantum'
> > > > > > in this patch gives us a better control of the allocation accuracy and
> > > > > > the avg latency:
> > > > > >
> > > > > > stress-ng-cpu:10584 stress-ng-cpu:10583
> > > > > > ---------------------------------------------------------------------
> > > > > > Slices(ms) 100 0.1
> > > > > > Runtime(ms) 4980.309 4981.356
> > > > > > Switches 1253 1254
> > > > > > Average delay(ms) 3.990 3.990
> > > > > > Maximum delay(ms) 5.001 4.014
> > > > > >
> > > > > > Furthmore, with sysctl_sched_base_slice = 10ms, we might benefit from
> > > > > > less switches at the cost of worse delay:
> > > > > >
> > > > > > stress-ng-cpu:11208 stress-ng-cpu:11207
> > > > > > ---------------------------------------------------------------------
> > > > > > Slices(ms) 100 0.1
> > > > > > Runtime(ms) 4983.722 4977.035
> > > > > > Switches 456 456
> > > > > > Average delay(ms) 10.963 10.939
> > > > > > Maximum delay(ms) 19.002 21.001
> > > > > >
> > > > > > By being able to tune sysctl_sched_base_slice knob, we can achieve
> > > > > > the goal to strike a good balance between throughput and latency by
> > > > > > adjusting the frequency of context switches, and the conclusions are
> > > > > > much close to what's covered in [1] with the explicit definition of
> > > > > > a time quantum. And it aslo gives more freedom to choose the eligible
> > > > > > request length range(either through nice value or raw value)
> > > > > > without worrying about overscheduling or underscheduling too much.
> > > > > >
> > > > > > Note this change should introduce no obvious regression because all
> > > > > > processes have the same request length as sysctl_sched_base_slice as
> > > > > > in the status quo. And the result of benchmarks proves this as well.
> > > > > >
> > > > > > schbench -m2 -F128 -n10 -r90 w/patch tip/6.7-rc7
> > > > > > Wakeup (usec): 99.0th: 3028 95
> > > > > > Request (usec): 99.0th: 14992 21984
> > > > > > RPS (count): 50.0th: 5864 5848
> > > > > >
> > > > > > hackbench -s 512 -l 200 -f 25 -P w/patch tip/6.7-rc7
> > > > > > -g 10 0.212 0.223
> > > > > > -g 20 0.415 0.432
> > > > > > -g 30 0.625 0.639
> > > > > > -g 40 0.852 0.858
> > > > > >
> > > > > > [1]: https://dl.acm.org/doi/10.5555/890606
> > > > > > [2]: https://lore.kernel.org/all/20230420150537.GC4253@xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx/T/#u
> > > > > >
> > > > > > Signed-off-by: Ze Gao <zegao@xxxxxxxxxxx>
> > > > > > ---
> > > > >
> > > >