Re: 2.6.8-rc2-mm2 performance improvements (scheduler?)

From: Con Kolivas
Date: Mon Aug 09 2004 - 23:40:33 EST


Andrew Theurer writes:

On Monday 09 August 2004 22:40, you wrote:
Rick showed me schedstats graphs of the two ... it seems to have lower
latency, does less rebalancing, fewer pull_tasks, etc, etc. Everything
looks better ... he'll send them out soon, I think (hint, hint).

Okay, they're done. Here's the URL of the graphs:

http://eaglet.rain.com/rick/linux/staircase/scase-vs-noscase.html

General summary: as Martin reported, we're seeing improvements in a number
of areas, at least with sdet. The graphs as listed there represent stats
from four separate sdet runs run sequentially with an increasing load.
(We're trying to see if we can get the information from each run
separately, rather than the aggregate -- one of the hazards of an automated
test harness :)

What's quite interesting is that there is a very noticeable surge in load_balance with staircase in the early stage of the test, but there appears to be -no- direct policy changes to load-balance at all in Con's patch (or at least I didn't notice it -please tell me if you did!). You can see it in busy load_balance, sched_balance_exec, and pull_task. The runslice and latency stats confirm this -no-staircase does not balance early on, and the tasks suffer, waiting on a cpu already loaded up. I do not have an explanation for this; perhaps it has something to do with eliminating expired queue.

To be honest I have no idea why that's the case. One of the first things I did was eliminate the expired array and in my testing (up to 8x at osdl) I did not really notice this in and of itself made any big difference - of course this could be because the removal of the expired array was not done in a way which entitled starved tasks to run in reasonable timeframes.

I would be nice to have per cpu runqueue lengths logged to see how this plays out -do the cpus on staircase obtain a runqueue length close to nr_running()/nr_online_cpus sooner than no-staircase?

/me looks in the schedstats peoples' way

Also, one big change apparent to me, the elimination of TIMESLICE_GRANULARITY.

Ah well I tuned the timeslice granularity and I can tell you it isn't quite what most people think. The granularity when you get to greater than 4 cpus is effectively _disabled_. So in fact, the timeslices are shorter in staircase (in normal interactive=1, compute=0 mode which is how martin would have tested it), not longer. But this is not the reason either since in "compute" mode they are ten times longer and this also improves throughput further.

Do you have cswitch data? I would not be surprised if it's a lot higher on -no-staircase, and cache is thrashed a lot more. This may be something you can pull out of the -no-staircase kernel quite easily.

Well from what I got on 8x the optimal load (-j x4cpus) and maximal load (-j) on kernbench gives surprisingly similar context switch rates. It's only when I enable compute mode that the context switches drop compared to default staircase mode and mainline. You'd have to ask Martin and Rick about what they got.

-Andrew Theurer

Cheers,
Con

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/