Re: Scheduler(?) regression from 2.6.22 to 2.6.24 for short-livedthreads
From: Olof Johansson
Date: Sun Feb 10 2008 - 01:59:28 EST
On Sun, Feb 10, 2008 at 07:15:58AM +0100, Willy Tarreau wrote:
> On Sat, Feb 09, 2008 at 11:29:41PM -0600, Olof Johansson wrote:
> > 40M:
> > 2.6.22 time 94315 ms
> > 2.6.23 time 107930 ms
> > 2.6.24 time 113291 ms
> > 2.6.24-git19 time 110360 ms
> >
> > So with more work per thread, the differences become less but they're
> > still there. At the 40M loop, with 500 threads it's quite a bit of
> > runtime per thread.
>
> No, it's really nothing. I had to push the loop to 1 billion to make the load
> noticeable. You don't have 500 threads, you have 2 threads and that load is
> repeated 500 times. And if we look at the numbers, let's take the worst one :
> > 40M:
> > 2.6.24 time 113291 ms
> 113291/500 = 227 microseconds/loop. This is still very low compared to the
> smallest timeslice you would have (1 ms at HZ=1000).
>
> So your threads are still completing *before* the scheduler has to preempt
> them.
Hmm? I get that to be 227ms per loop, which is way more than a full
timeslice. Running the program took in the range of 2 minutes, so it's
110000 milliseconds, not microseconds.
> > It seems generally unfortunate that it takes longer for a new thread to
> > move over to the second cpu even when the first is busy with the original
> > thread. I can certainly see cases where this causes suboptimal overall
> > system behaviour.
>
> In fact, I don't think it takes longer, I think it does not do it at their
> creation, but will do it immediately after the first slice is consumed. This
> would explain the important differences here. I don't know how we could ensure
> that the new thread is created on the second CPU from the start, though.
The math doesn't add up for me. Even if it rebalanced at the end of
the first slice (i.e. after 1ms), that would be a 1ms penalty per
iteration. With 500 threads that'd be a total penalty of 500ms.
> I tried inserting a sched_yield() at the top of the busy loop (1M loops).
> By default, it did not change a thing. Then I simply set sched_compat_yield
> to 1, and the two threads then ran simultaneously with a stable low time
> (2700 ms instead of 10-12 seconds).
>
> Doing so with 10k loops (initial test) shows times in the range 240-300 ms
> only instead of 2200-6500 ms.
Right, likely because the long-running cases got stuck at the busy loop
at the end, which would end up aborting quicker if the other thread got
scheduled for just a bit. It was a mistake to post that variant of the
testcase, it's not as relevant and doesn't mimic the original workload I
was trying to mimic as well as if the first loop was made larger.
> Ingo, would it be possible (and wise) to ensure that a new thread being
> created gets immediately rebalanced in order to emulate what is done here
> with sched_compat_yield=1 and sched_yield() in both threads just after the
> thread creation ? I don't expect any performance difference doing this,
> but maybe some shell scripts reliying on short-lived pipes would get faster
> on SMP.
There's always the tradeoff of losing cache warmth whenever a thread is
moved, so I'm not sure if it's a good idea to always migrate it at
creation time. It's not a simple problem, really.
> > I agree that the testcase is highly artificial. Unfortunately, it's
> > not uncommon to see these kind of weird testcases from customers tring
> > to evaluate new hardware. :( They tend to be pared-down versions of
> > whatever their real workload is (the real workload is doing things more
> > appropriately, but the smaller version is used for testing). I was lucky
> > enough to get source snippets to base a standalone reproduction case on
> > for this, normally we wouldn't even get copies of their binaries.
>
> I'm well aware of that. What's important is to be able to explain what is
> causing the difference and why the test case does not represent anything
> related to performance. Maybe the code author wanted to get 500 parallel
> threads and got his code wrong ?
I believe it started out as a simple attempt to parallelize a workload
that sliced the problem too low, instead of slicing it in larger chunks
and have each thread do more work at a time. It did well on 2.6.22 with
almost a 2x speedup, but did worse than the single-treaded testcase on a
2.6.24 kernel.
So yes, it can clearly be handled through explanations and education
and fixen the broken testcase, but I'm still not sure the new behaviour
is desired.
-Olof
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/