Re: [PATCH RFC] select_idle_sibling experiments

From: Bastien Philbert
Date: Tue Apr 05 2016 - 17:06:19 EST




On 2016-04-05 04:03 PM, Matt Fleming wrote:
> On Tue, 05 Apr, at 02:08:22PM, Chris Mason wrote:
>>
>> I started with a small-ish program to benchmark wakeup latencies. The
>> basic idea is a bunch of worker threads who sit around and burn CPU.
>> Every once and a while they send a message to a message thread.
>
> This reminds me of something I've been looking at recently; a similar
> workload in Mel's mmtests based on pgbench with 1-client that also has
> this problem of cpu_idle() being false at an inconvenient time in
> select_idle_sibling(), so we move the task off the cpu and the cpu
> then immediately goes idle.
>
> This leads to tasks bouncing around the socket as we search for idle
> cpus.
>
>> It has knobs for cpu think time, and for how long the messenger thread
>> waits before replying. Here's how I'm running it with my patch:
>
> [...]
>
> Cool, I'll go have a play with this.
>
>> Now, on to the patch. I pushed some code around and narrowed the
>> problem down to select_idle_sibling() We have cores going into and out
>> of idle fast enough that even this cut our latencies in half:
>>
>> static int select_idle_sibling(struct task_struct *p, int target)
>> goto next;
>>
>> for_each_cpu(i, sched_group_cpus(sg)) {
>> - if (i == target || !idle_cpu(i))
>> + if (!idle_cpu(i))
>> goto next;
>> }
>>
>> IOW, by the time we get down to for_each_cpu(), the idle_cpu() check
>> done at the top of the function is no longer valid.
>
> Yeah. The problem is that because we're racing with the cpu going in
> and out of idle, and since you're exploiting that race condition, this
> is highly tuned to your specific workload.
>
> Which is a roundabout way of saying, this is probably going to
> negatively impact other workloads.
>
>> I tried a few variations on select_idle_sibling() that preserved the
>> underlying goal of returning idle cores before idle SMT threads. They
>> were all horrible in different ways, and none of them were fast.
>
> I toyed with ignoring cpu_idle() in select_idle_sibling() for my
> workload. That actually was faster ;)
>
>> The patch below just makes select_idle_sibling pick the first idle
>> thread it can find. When I ran it through production workloads here, it
>> was faster than the patch we've been carrying around for the last few
>> years.
>
> It would be really nice if we had a lightweight way to gauge the
> "idleness" of a cpu, and whether we expect it to be idle again soon.
>
The best way to do this is either embed it in a already used structure to
allow us to check it quickly. Otherwise I am curious if writing a marco
may prove useful for this. Seems that idleness checking needs to accounted
for when scheduling, in order to make this lightweight enough to avoid using
it during a context switch, the challenge however is to make the reference
counting lightweight enough to out weight it being done during current scheduling
functions.
> Failing that, could we just force the task onto 'target' when it makes
> sense and skip the idle search (and the race) altogether?
>
Doesn't this possibly cause a context switch or even a extensive move to another
CPU instruction(s) on certain architectures. Seems we need to add reference counting
or tracking of idle CPUS somewhere.
Bastien