Re: [PATCH RFC 1/2] sched: Minimize the idle cpu selection race window.

From: Uladzislau Rezki
Date: Thu Nov 23 2017 - 05:53:06 EST


Hello, Atish, Peter, all.

I have a question about if a task's nr_cpus_allowed is 1.
In that scenario we do not call select_task_rq. Therefore
even thought a task "p" is placed on idle CPU that CPU
will not be marked as claimed for wake-up.

What do you think about adding per_cpu(claim_wakeup, cpu) = 1;
to select_task_rq() instead and possibly get rid of them from
other places (increases a race window a bit)?

Also, it will help and improve an ILB decision for NO_HZ idle
balance. To be more clear i mean:

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index d17c5da523a0..8111cfa20075 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -1559,6 +1559,7 @@ int select_task_rq(struct task_struct *p, int cpu, int sd_flags, int wake_flags)
!cpu_online(cpu)))
cpu = select_fallback_rq(task_cpu(p), p);
+ per_cpu(claim_wakeup, cpu) = 1;
return cpu;
}

@@ -3888,6 +3889,7 @@ int task_prio(const struct task_struct *p)
return p->prio - MAX_RT_PRIO;
}

+DEFINE_PER_CPU(int, claim_wakeup);
/**
* idle_cpu - is a given CPU idle currently?
* @cpu: the processor in question.
@@ -3909,6 +3911,9 @@ int idle_cpu(int cpu)
return 0;
#endif

+ if (per_cpu(claim_wakeup, cpu))
+ return 0;
+
return 1;
}

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 5c09ddf8c832..55ad7fa97e95 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -8596,7 +8596,17 @@ static struct {

static inline int find_new_ilb(void)
{
- int ilb = cpumask_first(nohz.idle_cpus_mask);
+ cpumask_t cpumask;
+ int ilb;
+
+ cpumask_copy(&cpumask, nohz.idle_cpus_mask);
+
+ /*
+ * Probably is not good approach in case of many CPUs
+ */
+ for_each_cpu(ilb, &cpumask)
+ if (!per_cpu(claim_wakeup, ilb))
+ break;

if (ilb < nr_cpu_ids && idle_cpu(ilb))
return ilb;
diff --git a/kernel/sched/idle_task.c b/kernel/sched/idle_task.c
index d518664cce4f..91cf50cab836 100644
--- a/kernel/sched/idle_task.c
+++ b/kernel/sched/idle_task.c
@@ -29,6 +29,7 @@ pick_next_task_idle(struct rq *rq, struct task_struct *prev, struct rq_flags *rf
{
put_prev_task(rq, prev);
update_idle_core(rq);
+ this_cpu_write(claim_wakeup, 0);
schedstat_inc(rq->sched_goidle);
return rq->idle;
}
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index 3b448ba82225..0e0bb7536725 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -1055,6 +1055,7 @@ DECLARE_PER_CPU(int, sd_llc_id);
DECLARE_PER_CPU(struct sched_domain_shared *, sd_llc_shared);
DECLARE_PER_CPU(struct sched_domain *, sd_numa);
DECLARE_PER_CPU(struct sched_domain *, sd_asym);
+DECLARE_PER_CPU(int, claim_wakeup);

struct sched_group_capacity {
atomic_t ref;

--
Uladzislau Rezki

On Tue, Nov 21, 2017 at 11:23:18PM -0600, Atish Patra wrote:
> Here are the results of schbench(scheduler latency benchmark) and uperf
> (networking benchmark).
>
> Hardware config: 20 core (40 hyperthreaded cpus) x86 box.
> schbench config: message threads = 2; time = 180s, worker thread = variable
> uperf config:ping pong test on loopback interface with message size = 8k
>
> Overall, both benchmark seems to happiest when number of threads are closer
> to number of cpus.
> --------------------------------------------------------------------------------------------------------------------------
> schbench Maximum Latency(lower is better):
> Base(4.14) Base+pcpu
> Num
> Worker Mean Stdev Mean Stdev Improvement (%)
> 10 3026.8 4987.12 523 474.35 82.7210255055
> 18 13854.6 1841.61 12945.6 125.19 6.5609977913
> 19 16457 2046.51 12985.4 48.46 21.0949747828
> 20 14995 2368.84 15838 2038.82 -5.621873958
> 25 29952.2 107.72 29673.6 337.57 0.9301487036
> 30 30084 19.768 30096.2 7.782 -0.0405531179
>
> -------------------------------------------------------------------------------------------------------------------
>
> The proposed fix seems to improve the maximum latency for lower number
> of threads. It also seems to reduce the variation(lower stdev) as well.
>
> If number of threads are equal or higher than number of cpus, it results in
> significantly
> higher latencies in because of the nature of the benchmark. Results for
> higher
> threads use case are presented to provide a complete picture but it is
> difficult to conclude
> anything from that.
>
> Next individual percentile results are present for each use case. The
> proposed fix also
> improves latency across all percentiles for configuration(19 worker threads)
> which
> should saturate the system.
> ---------------------------------------------------------------------------------------------------------------------
> schbench Latency in usec(lower is better)
> Baseline(4.14) Base+pcpu
> Num
> Worker Mean stdev Mean stdev Improvement(%)
>
> 50th
> 10 64.2 2.039 63.6 1.743 0.934
> 18 57.6 5.388 57 4.939 1.041
> 19 63 4.774 58 4 7.936
> 20 59.6 4.127 60.2 5.153 -1.006
> 25 78.4 0.489 78.2 0.748 0.255
> 30 96.2 0.748 96.4 1.019 -0.207
>
> 75th
> 10 72 3.033 71.6 2.939 0.555
> 18 78 2.097 77.2 2.135 1.025
> 19 81.6 1.2 79.4 0.8 2.696
> 20 81 1.264 80.4 2.332 0.740
> 25 109.6 1.019 110 0 -0.364
> 30 781.4 50.902 731.8 70.6382 6.3475
>
> 90th
> 10 80.4 3.666 80.6 2.576 -0.248
> 18 87.8 1.469 88 1.673 -0.227
> 19 92.8 0.979 90.6 0.489 2.370
> 20 92.6 1.019 92 2 0.647
> 25 8977.6 1277.160 9014.4 467.857 -0.409
> 30 9558.4 334.641 9507.2 320.383 0.5356
>
> 95th
> 10 86.8 3.867 87.6 4.409 -0.921
> 18 95.4 1.496 95.2 2.039 0.209
> 19 102.6 1.624 99 0.894 3.508
> 20 103.2 1.326 102.2 2.481 0.968
> 25 12400 78.383 12406.4 37.318 -0.051
> 30 12336 40.477 12310.4 12.8 0.207
>
> 99th
> 10 99.2 5.418 103.4 6.887 -4.233
> 18 115.2 2.561 114.6 3.611 0.5208
> 19 126.25 4.573 120.4 3.872 4.6336
> 20 145.4 3.09 133 1.41 8.5281
> 25 12988.8 15.676 12924.8 25.6 0.4927
> 30 12988.8 15.676 12956.8 32.633 0.2463
>
> 99.50th
> 10 104.4 5.161 109.8 7.909 -5.172
> 18 127.6 7.391 124.2 4.214 2.6645
> 19 2712.2 4772.883 133.6 5.571 95.074
> 20 3707.8 2831.954 2844.2 4708.345 23.291
> 25 14032 1283.834 13008 0 7.2976
> 30 16550.4 886.382 13840 1218.355 16.376
> ------------------------------------------------------------------------------------------------------------------------
>
>
> Results from uperf
> uperf config: Loopback ping pong test with message size = 8k
>
> Baseline (4.14) Baseline +pcpu
> Mean stdev Mean stdev Improvement(%)
> 1 9.056 0.02 8.966 0.083 -0.993
> 2 17.664 0.13 17.448 0.303 -1.222
> 4 32.03 0.22 31.972 0.129 -0.181
> 8 58.198 0.31 58.588 0.198 0.670
> 16 101.018 0.67 100.056 0.455 -0.952
> 32 148.1 15.41 164.494 2.312 11.069
> 64 203.66 1.16 203.042 1.348 -0.3073
> 128 197.12 1.04 194.722 1.174 -1.2165
>
> The race window fix seems to help uperf for 32 threads (closest to number of
> cpus) as well.
>
> Regards,
> Atish
>
> On 11/04/2017 07:58 PM, Joel Fernandes wrote:
> > Hi Peter,
> >
> > On Tue, Oct 31, 2017 at 1:20 AM, Peter Zijlstra <peterz@xxxxxxxxxxxxx> wrote:
> > > On Tue, Oct 31, 2017 at 12:27:41AM -0500, Atish Patra wrote:
> > > > Currently, multiple tasks can wakeup on same cpu from
> > > > select_idle_sibiling() path in case they wakeup simulatenously
> > > > and last ran on the same llc. This happens because an idle cpu
> > > > is not updated until idle task is scheduled out. Any task waking
> > > > during that period may potentially select that cpu for a wakeup
> > > > candidate.
> > > >
> > > > Introduce a per cpu variable that is set as soon as a cpu is
> > > > selected for wakeup for any task. This prevents from other tasks
> > > > to select the same cpu again. Note: This does not close the race
> > > > window but minimizes it to accessing the per-cpu variable. If two
> > > > wakee tasks access the per cpu variable at the same time, they may
> > > > select the same cpu again. But it minimizes the race window
> > > > considerably.
> > > The very most important question; does it actually help? What
> > > benchmarks, give what numbers?
> > I collected some numbers with an Android benchmark called Jankbench.
> > Most tests didn't show an improvement or degradation with the patch.
> > However, one of the tests called "list view", consistently shows an
> > improvement. Particularly striking is the improvement at mean and 25
> > percentile.
> >
> > For list_view test, Jankbench pulls up a list of text and scrolls the
> > list, this exercises the display pipeline in Android to render and
> > display the animation as the scroll happens. For Android, lower frame
> > times is considered quite important as that means we are less likely
> > to drop frames and give the user a good experience vs a perceivable
> > poor experience.
> >
> > For each frame, Jankbench measures the total time a frame takes and
> > stores it in a DB (the time from which the app starts drawing, to when
> > the rendering completes and the frame is submitted for display).
> > Following is the distribution of frame times in ms.
> >
> > count 16304 (@60 fps, 4.5 minutes)
> >
> > Without patch With patch
> > mean 5.196633 4.429641 (+14.75%)
> > std 2.030054 2.310025
> > 25% 5.606810 1.991017 (+64.48%)
> > 50% 5.824013 5.716631 (+1.84%)
> > 75% 5.987102 5.932751 (+0.90%)
> > 95% 6.461230 6.301318 (+2.47%)
> > 99% 9.828959 9.697076 (+1.34%)
> >
> > Note that although Android uses energy aware scheduling patches, I
> > turned those off to bring the test as close to mainline as possible. I
> > also backported Vincent's and Brendan's slow path fixes to the 4.4
> > kernel that the Pixel 2 uses.
> >
> > Personally I am in favor of this patch considering this test data but
> > also that in the past, I remember that our teams had to deal with the
> > same race issue and used cpusets to avoid it (although they probably
> > tested with "energy aware" CPU selection kept on).
> >
> > thanks,
> >
> > - Joel
>