Re: [BUG] scheduler doesn't balance thread to idle cpu for 3 seconds

From: Jan Stancek
Date: Fri Jan 29 2016 - 05:35:12 EST






----- Original Message -----
> From: "Peter Zijlstra" <peterz@xxxxxxxxxxxxx>
> To: "Jan Stancek" <jstancek@xxxxxxxxxx>
> Cc: "alex shi" <alex.shi@xxxxxxxxx>, "guz fnst" <guz.fnst@xxxxxxxxxxxxxx>, mingo@xxxxxxxxxx, jolsa@xxxxxxxxxx,
> riel@xxxxxxxxxx, linux-kernel@xxxxxxxxxxxxxxx
> Sent: Friday, 29 January, 2016 11:15:22 AM
> Subject: Re: [BUG] scheduler doesn't balance thread to idle cpu for 3 seconds
>
> On Thu, Jan 28, 2016 at 01:43:13PM -0500, Jan Stancek wrote:
> > > How long should I have to wait for a fail?
> >
> > It's about 1000-2000 iterations for me, which I think you covered
> > by now in those 2 hours.
>
> So I've been running:
>
> while ! ./pthread_cond_wait_1 ; do sleep 1; done
>
> overnight on the machine, and have yet to hit a wobbly -- that is, its
> still running.

I have seen similar result.

Then, instead of turning CPUs off, I spawned more low prio threads to scale
with number of CPUs on system:

@@ -213,10 +213,14 @@
printf(ERROR_PREFIX "pthread_attr_setschedparam\n");
exit(PTS_UNRESOLVED);
}
- rc = pthread_create(&low_id, &low_attr, low_priority_thread, NULL);
- if (rc != 0) {
- printf(ERROR_PREFIX "pthread_create\n");
- exit(PTS_UNRESOLVED);
+
+ int i, ncpus = sysconf(_SC_NPROCESSORS_ONLN);
+ for (i = 0; i < ncpus - 1; i++) {
+ rc = pthread_create(&low_id, &low_attr, low_priority_thread, NULL);
+ if (rc != 0) {
+ printf(ERROR_PREFIX "pthread_create\n");
+ exit(PTS_UNRESOLVED);
+ }

and let this ran on 3 bare metal x86 systems over night (v4.5-rc1). It
failed on 2 systems (12 and 24 CPUs) with 1:1000 chance, it never failed
on 3rd one (4 CPUs).

>
> Also note that I don't think failing this test is a bug per se.
> Undesirable maybe, but within spec, since SIGALRM is process wide, so it
> being delivered to the SCHED_OTHER task is accepted, and SCHED_OTHER has
> no timeliness guarantees.
>
> That said; if I could reliably reproduce I'd have a go at fixing this, I
> suspect there's a 'fun' problem at the bottom of this.

Thanks for trying, I'll see if I can find some more reliable way.

Regards,
Jan