Re: sched: tweak select_idle_sibling to look for idle threads
From: Chris Mason
Date: Tue May 03 2016 - 11:12:31 EST
On Tue, May 03, 2016 at 04:32:25PM +0200, Peter Zijlstra wrote:
> On Mon, May 02, 2016 at 11:47:25AM -0400, Chris Mason wrote:
> > On Mon, May 02, 2016 at 04:58:17PM +0200, Peter Zijlstra wrote:
> > > On Mon, May 02, 2016 at 04:50:04PM +0200, Mike Galbraith wrote:
> > > > Oh btw, did you know single socket boxen have no sd_busy? That doesn't
> > > > look right.
> > >
> > > I suspected; didn't bother looking at yet. The 'problem' is that the LLC
> > > domain is the top-most, so it doesn't have a parent domain. I'm sure we
> > > can come up with something if we can get this all working right.
> > >
> > > And yes, I can get gains on various workloads with various options, I
> > > can even break all workloads, but I've so far completely failed on
> > > getting a win for everyone :/
> >
> > Adding in the task_hot() check to decide if scanning idle was a good
> > idea ended up being really important
>
> So I'm conflicted on this patch:
>
> +static int bounce_to_target(struct task_struct *p, int cpu)
> +{
> + s64 delta;
> +
> + /*
> + * as the run queue gets bigger, its more and more likely that
> + * balance will have distributed things for us, and less likely
> + * that scanning all our CPUs for an idle one will find one.
> + * So, if nr_running > 1, just call this CPU good enough
> + */
> + if (cpu_rq(cpu)->cfs.nr_running > 1)
> + return 1;
The nr_running check is interesting. It is supposed to give the same
benefit as your "do we have anything idle?" variable, but without having
to constantly update a variable somewhere. I'll have to do a few runs
to verify (maybe a idle_scan_failed counter).
> +
> + /* taken from task_hot() */
> + delta = rq_clock_task(task_rq(p)) - p->se.exec_start;
> + return delta < (s64)sysctl_sched_migration_cost;
> +}
>
> This will work for you schbench workload because it sleep for 30ms while
> the migration_cost thingy is 500us, therefore you'll trigger the full
> LLC scan.
The task_hot checks don't do much for the sleeping schbench runs, but
they help a lot for this:
# pick a single core, in my case cpus 0,20 are the same core
# cpu_hog is any program that spins
#
taskset -c 20 cpu_hog &
# schbench -p 4 means message passing mode with 4 byte messages (like
# pipe test), no sleeps, just bouncing as fast as it can.
#
# make the scheduler choose between the sibling of the hog and cpu 1
#
taskset -c 0,1 schbench -p 4 -m 1 -t 1
Current mainline will stuff both schbench threads onto CPU 1, leaving
CPU 0 100% idle. My first patch with the minimal task_hot() checks
would sometimes pick CPU 0. My second patch that just directly calls
task_hot sticks to cpu1, which is ~3x faster than spreading it.
The full task_hot() checks also really help tbench.
>
> _However_, the migration_cost is supposed the model the cost of leaving
> the LLC, so testing against that here seems wrong.
>
> Let me go play with something that measures the cost of doing that LLC
> scan and compares that against the sleepy time -- of course, now need to
> go figure out how to do this clock thing without rq-lock pain.
>
>
>
> + if (package_sd && !bounce_to_target(p, target)) {
> + for_each_cpu_and(i, sched_domain_span(package_sd), tsk_cpus_allowed(p)) {
> + if (idle_cpu(i)) {
> + target = i;
> + break;
> + }
> +
> + }
> + }
>
> Also note your s/sd/package_sd/ rename is, strictly speaking, wrong.
> Sure, on your current Intel system the LLC is the entire package, but
> this is not true in general.
>
> Take for instance the Intel Core2Quad and AMD Bulldozer thingies, they
> had two dies in one package, and correspondingly two LLC domains in one
> package.
>
> (also, the Intel cluster-on-die thing can split the thing in two)
>
> There were also the old P6 era SMP boards which had external LLC, where
> you could have an LLC shared across multiple packages -- although I'm
> thinking we'll never see that again, due to off package being far
> toooooo slooooooow these days.
Gotcha, makes sense. I'll switch to llc_sd ;)
-chris