Also, one big change apparent to me, the elimination of
TIMESLICE_GRANULARITY.
Ah well I tuned the timeslice granularity and I can tell you it isn't quite
what most people think. The granularity when you get to greater than 4 cpus
is effectively _disabled_. So in fact, the timeslices are shorter in
staircase (in normal interactive=1, compute=0 mode which is how martin
would have tested it), not longer. But this is not the reason either since
in "compute" mode they are ten times longer and this also improves
throughput further.
Interesting, I forgot about the "* nr_cpus" that was in the granularity calculation. That does make me wonder, maybe the timeslices you are calculating could have something similar, but more appropriate.
Since the number of runnable tasks on a cpu should play a part in latency (the more tasks, potentially the longer the latency), I wonder if the timeslice would benefit from a modifier like " / task_cpu(p)->nr_running ". With this the base timeslice could be quite a bit larger to start for better cache warmth, and as we add more tasks to that cpu, the timeslices get smaller, so an acceptable latency is preserved.
Do you have cswitch data? I would not be surprised if it's a lot higher
on -no-staircase, and cache is thrashed a lot more. This may be
something you can pull out of the -no-staircase kernel quite easily.
Well from what I got on 8x the optimal load (-j x4cpus) and maximal load
(-j) on kernbench gives surprisingly similar context switch rates. It's
only when I enable compute mode that the context switches drop compared to
default staircase mode and mainline. You'd have to ask Martin and Rick
about what they got.
OK, thanks!
-Andrew Theurer
Attachment:
signature.asc
Description: OpenPGP digital signature