Re: [PATCH v12 8/8] sched/fair: Add latency list

From: Vincent Guittot
Date: Tue Mar 07 2023 - 05:20:11 EST


Le mardi 07 mars 2023 à 00:34:49 (+0530), Shrikanth Hegde a écrit :
>
> > Le lundi 06 mars 2023 à 17:03:30 (+0530), Shrikanth Hegde a écrit :
> >>
> >> On 3/5/23 6:33 PM, Vincent Guittot wrote:
> >>> On Sat, 4 Mar 2023 at 16:13, Shrikanth Hegde <sshegde@xxxxxxxxxxxxxxxxxx> wrote:
> >>>>
> >>>> On 3/3/23 10:01 PM, Vincent Guittot wrote:
> >>>>> Le jeudi 02 mars 2023 à 23:37:52 (+0530), Shrikanth Hegde a écrit :
> >>>>>> On 3/2/23 8:30 PM, Shrikanth Hegde wrote:
> >>>>>>> On 3/2/23 6:47 PM, Vincent Guittot wrote:
> >>>>>>>> On Thu, 2 Mar 2023 at 12:00, Shrikanth Hegde <sshegde@xxxxxxxxxxxxxxxxxx> wrote:
> >>>>>>>>> On 3/2/23 1:20 PM, Vincent Guittot wrote:
> >>>>>>>>>> On Wed, 1 Mar 2023 at 19:48, shrikanth hegde <sshegde@xxxxxxxxxxxxxxxxxx> wrote:
> >>>>>>>>>>> On 2/24/23 3:04 PM, Vincent Guittot wrote:
> >>>>> [...]
> >>>>>
> >>>>>>>>>>> Ran the schbench and hackbench with this patch series. Here comparison is
> >>>>>>>>>>> between 6.2 stable tree, 6.2 + Patch and 6.2 + patch + above re-arrange of
> >>>>>>>>>>> latency_node. Ran two cgroups, in one cgroup running stress-ng at 50%(group1)
> >>>>>>>>>>> and other is running these benchmarks (group2). Set the latency nice
> >>>>>>>>>>> of group2 to -20. These are run on Power system with 12 cores with SMT=8.
> >>>>>>>>>>> Total of 96 CPU.
> >>>>>>>>>>>
> >>>>>>>>>>> schbench gets lower latency compared to stabletree. Whereas hackbench seems
> >>>>>>>>>>> to regress under this case. Maybe i am doing something wrong. I will re-run
> >>>>>>>>>>> and attach the numbers to series.
> >>>>>>>>>>> Please suggest if any variation in the test i need to try.
> >>>>>>>>>> hackbench takes advanatge of a latency nice 19 as it mainly wants to
> >>>>>>>>>> run longer slice to move forward rather than preempting others all the
> >>>>>>>>>> time
> >>>>>>>>> hackbench still seems to regress in different latency nice values compared to
> >>>>>>>>> baseline of 6.2 in this case. up to 50% in some cases.
> >>>>>>>>>
> >>>>>>>>> 12 core powerpc system with SMT=8 i.e 96 CPU
> >>>>>>>>> running 2 CPU cgroups. No quota assigned.
> >>>>>>>>> 1st cgroup is running stress-ng with 48 threads. Consuming 50% of CPU.
> >>>>>>>>> latency is not changed for this cgroup.
> >>>>>>>>> 2nd cgroup is running hackbench. This cgroup is assigned the different latency
> >>>>>>>>> nice values of 0, -20 and 19.
> >>>>>>>> According to your other emails, you are using the cgroup interface and
> >>>>>>>> not the task's one. Do I get it right ?
> >>>>>>> right. I create cgroup, attach bash command with echo $$,
> >>>>>>> assign the latency nice to cgroup, and run hackbench from that bash prompt.
> >>>>>>>
> >>>>>>>> I haven't run test such tests in a cgroup but at least the test with
> >>>>>>>> latency_nice == 0 should not make any noticeable difference. Does this
> >>>>>>>> include the re-arrange patch that you have proposed previously ?
>
> Ran the test on a different system altogether. I don't see similar regression there.
> In fact latency nice is helping in reducing the latency as expected.
> It is much bigger system with 60 cores. i.e total of 480 cpu.
>
> Tested in the same way. created two cgroups. one is running the micro benchmarks
> and other is running stress-ng at different utilization point.
> This data is at 50% utilization point. Similar observations w.r.t latency
> is seen at 0%, 25%, 75% and 100% utilization as well.
>

Thanks for testing on a different system which seems to get results aligned with what
Prateek and I have seen on our system.


> ==========
> schbench
> ==========
> 6.2 6.2 + V12 + LN=0
> Groups: 1
> 50.0th: 14.0 12.5
> 75.0th: 16.5 14.0
> 90.0th: 18.5 15.5
> 95.0th: 20.5 17.0
> 99.0th: 27.5 21.0
> 99.5th: 36.0 23.5
> Groups: 2
> 50.0th: 14.0 16.0
> 75.0th: 17.0 18.0
> 90.0th: 20.0 21.0
> 95.0th: 23.0 23.0
> 99.0th: 71.0 34.0
> 99.5th: 1170.0 96.0
> 99.9th: 5088.0 3212.0
> Groups: 4
> 50.0th: 20.5 19.5
> 75.0th: 24.5 22.5
> 90.0th: 31.0 26.0
> 95.0th: 260.5 28.0
> 99.0th: 3644.0 35.0
> 99.5th: 5152.0 44.5
> 99.9th: 8076.0 168.5
> Groups: 8
> 50.0th: 26.0 25.5
> 75.0th: 32.5 31.5
> 90.0th: 41.5 36.5
> 95.0th: 794.0 39.5
> 99.0th: 5992.0 66.0
> 99.5th: 7208.0 159.0
> 99.9th: 9392.0 1604.0
> Groups: 16
> 50.0th: 37.5 34.0
> 75.0th: 49.5 44.5
> 90.0th: 70.0 53.5
> 95.0th: 1284.0 58.5
> 99.0th: 5600.0 102.5
> 99.5th: 7216.0 368.5
> 99.9th: 9328.0 5192.0
> Groups: 32
> 50.0th: 59.0 54.5
> 75.0th: 83.0 74.5
> 90.0th: 118.5 91.0
> 95.0th: 1921.0 100.5
> 99.0th: 6672.0 317.0
> 99.5th: 8252.0 2264.0
> 99.9th: 10448.0 8388.0
>
>
> ===========
> hackbench
> ==========
>
> type Groups 6.2 | 6.2 + V12 + LN=0
> Process 10 0.19 | 0.19
> Process 20 0.34 | 0.34
> Process 30 0.45 | 0.44
> Process 40 0.58 | 0.57
> Process 50 0.70 | 0.69
> Process 60 0.82 | 0.80
> thread 10 0.20 | 0.20
> thread 20 0.36 | 0.36
> Process(Pipe) 10 0.24 | 0.21
> Process(Pipe) 20 0.46 | 0.40
> Process(Pipe) 30 0.65 | 0.58
> Process(Pipe) 40 0.90 | 0.68
> Process(Pipe) 50 1.04 | 0.83
> Process(Pipe) 60 1.16 | 0.86
> thread(Pipe) 10 0.19 | 0.18
> thread(Pipe) 20 0.46 | 0.37
>
>

[...]

> >>>>
> >>>>
> >>>> Do you want me to try any other experiment on this further?
> >>> Yes, would be good to know which of the 3 changes in the patch create
> >>> the regression
> >>>
> >>> I suspect the 1st change to be the root cause of your problem but It
> >>> would be good if you can confirm my assumption with some tests
> >>>
> >>> Thanks
> >> Applied each change individually. 3rd change seems to cause the regression.
> >> Kept only the 3rd change and numbers are same as stable 6.2 for latency nice
> >> value of 0.
> > Ok, it's the patch 1 that aims to prevent some unfairness with low weight
> > waking task. And your platform probably falls in the last part of the commit:
> >
> > " Strictly speaking, we should use cfs->min_vruntime instead of
> > curr->vruntime but it doesn't worth the additional overhead and complexity
> > as the vruntime of current should be close to min_vruntime if not equal."
> >
> > Could you try the patch below on top of v12 ?
> >
> > ---
> > kernel/sched/fair.c | 21 +++++++++++----------
> > 1 file changed, 11 insertions(+), 10 deletions(-)
> >
> > diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> > index e2aeb4511686..77b03a280912 100644
> > --- a/kernel/sched/fair.c
> > +++ b/kernel/sched/fair.c
> > @@ -5049,7 +5049,7 @@ set_next_entity(struct cfs_rq *cfs_rq, struct sched_entity *se)
> > }
> >
> > static int
> > -wakeup_preempt_entity(struct sched_entity *curr, struct sched_entity *se);
> > +wakeup_preempt_entity(struct sched_entity *curr, struct sched_entity *se, struct cfs_rq *cfs_rq);
> >
> > /*
> > * Pick the next process, keeping these things in mind, in this order:
> > @@ -5088,16 +5088,16 @@ pick_next_entity(struct cfs_rq *cfs_rq, struct sched_entity *curr)
> > second = curr;
> > }
> >
> > - if (second && wakeup_preempt_entity(second, left) < 1)
> > + if (second && wakeup_preempt_entity(second, left, cfs_rq) < 1)
> > se = second;
> > }
> >
> > - if (cfs_rq->next && wakeup_preempt_entity(cfs_rq->next, left) < 1) {
> > + if (cfs_rq->next && wakeup_preempt_entity(cfs_rq->next, left, cfs_rq) < 1) {
> > /*
> > * Someone really wants this to run. If it's not unfair, run it.
> > */
> > se = cfs_rq->next;
> > - } else if (cfs_rq->last && wakeup_preempt_entity(cfs_rq->last, left) < 1) {
> > + } else if (cfs_rq->last && wakeup_preempt_entity(cfs_rq->last, left, cfs_rq) < 1) {
> > /*
> > * Prefer last buddy, try to return the CPU to a preempted task.
> > */
> > @@ -5107,7 +5107,7 @@ pick_next_entity(struct cfs_rq *cfs_rq, struct sched_entity *curr)
> > /* Check for latency sensitive entity waiting for running */
> > latency = __pick_first_latency(cfs_rq);
> > if (latency && (latency != se) &&
> > - wakeup_preempt_entity(latency, se) < 1)
> > + wakeup_preempt_entity(latency, se, cfs_rq) < 1)
> > se = latency;
> >
> > return se;
> > @@ -7808,7 +7808,7 @@ static unsigned long wakeup_gran(struct sched_entity *se)
> > *
> > */
> > static int
> > -wakeup_preempt_entity(struct sched_entity *curr, struct sched_entity *se)
> > +wakeup_preempt_entity(struct sched_entity *curr, struct sched_entity *se, struct cfs_rq *cfs_rq)
> > {
> > s64 gran, vdiff = curr->vruntime - se->vruntime;
> > s64 offset = wakeup_latency_gran(curr, se);
> > @@ -7818,6 +7818,8 @@ wakeup_preempt_entity(struct sched_entity *curr, struct sched_entity *se)
> >
> > gran = offset + wakeup_gran(se);
> >
> > + if (vdiff > gran)
> > + return 1;
> > /*
> > * At wake up, the vruntime of a task is capped to not be older than
> > * a sched_latency period compared to min_vruntime. This prevents long
> > @@ -7827,9 +7829,8 @@ wakeup_preempt_entity(struct sched_entity *curr, struct sched_entity *se)
> > * for low priority task. Make sure that long sleeping task will get a
> > * chance to preempt current.
> > */
> > - gran = min_t(s64, gran, get_latency_max());
> > -
> > - if (vdiff > gran)
> > + vdiff = cfs_rq->min_vruntime - se->vruntime;
> > + if (vdiff > get_latency_max())
> > return 1;
> >
> > return 0;
> > @@ -7933,7 +7934,7 @@ static void check_preempt_wakeup(struct rq *rq, struct task_struct *p, int wake_
> > return;
> >
> > update_curr(cfs_rq_of(se));
> > - if (wakeup_preempt_entity(se, pse) == 1) {
> > + if (wakeup_preempt_entity(se, pse, cfs_rq_of(se)) == 1) {
> > /*
> > * Bias pick_next to pick the sched entity that is
> > * triggering this preemption.
> > --
> > 2.34.1
>
> Tried above patch on top of V12. Numbers are worse than V12. We maybe running into
> a corner case on this system.

Yes it can be a corner case.

Nevertheless, the patch above has a problem and does an unsigned comparison instead of a signed
one. I have forced the signed comparison in the patch below to be applied on top of
previous one:

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 77b03a280912..22a497f92dbb 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -7830,7 +7830,7 @@ wakeup_preempt_entity(struct sched_entity *curr, struct sched_entity *se, struct
* chance to preempt current.
*/
vdiff = cfs_rq->min_vruntime - se->vruntime;
- if (vdiff > get_latency_max())
+ if (vdiff > (s64)get_latency_max())
return 1;

return 0;


>
> Type Groups 6.2 | 6.2+V12
>
> Process 10 0.33 | 0.44
> Process 20 0.61 | 0.90
> Process 30 0.87 | 1.29
> Process 40 1.10 | 1.69
> Process 50 1.34 | 2.08
> Process 60 1.58 | 2.39
> thread 10 0.36 | 0.53
> thread 20 0.64 | 0.94
> Process(Pipe) 10 0.18 | 0.46
> Process(Pipe) 20 0.32 | 0.75
> Process(Pipe) 30 0.42 | 1.01
> Process(Pipe) 40 0.56 | 1.15
> Process(Pipe) 50 0.68 | 1.38
> Process(Pipe) 60 0.80 | 1.56
>
>
> >> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> >> index cdcd991bbcf1..c89c201dd164 100644
> >> --- a/kernel/sched/fair.c
> >> +++ b/kernel/sched/fair.c
> >> @@ -7827,7 +7827,6 @@ wakeup_preempt_entity(struct sched_entity *curr, struct sched_entity *se)
> >> * for low priority task. Make sure that long sleeping task will get a
> >> * chance to preempt current.
> >> */
> >> - gran = min_t(s64, gran, get_latency_max());
> >>
> >> if (vdiff > gran)
> >> return 1;
> >>
> >>
> >>

[...]

> >>>>>>>>>>>> * 'curr' points to currently running entity on this cfs_rq.
>