Re: [PATCH 0/18] sched: simplified fork, enable load average intoLB and power awareness scheduling

From: Arjan van de Ven
Date: Tue Dec 11 2012 - 11:40:32 EST


On 12/11/2012 8:13 AM, Borislav Petkov wrote:
On Tue, Dec 11, 2012 at 08:03:01AM -0800, Arjan van de Ven wrote:
On 12/11/2012 7:48 AM, Borislav Petkov wrote:
On Tue, Dec 11, 2012 at 08:10:20PM +0800, Alex Shi wrote:
Another testing of parallel compress with pigz on Linus' git tree.
results show we get much better performance/power with powersaving and
balance policy:

testing command:
#pigz -k -c -p$x -r linux* &> /dev/null

On a NHM EP box
powersaving balance performance
x = 4 166.516 /88 68 170.515 /82 71 165.283 /103 58
x = 8 173.654 /61 94 177.693 /60 93 172.31 /76 76

This looks funny: so "performance" is eating less watts than
"powersaving" and "balance" on NHM. Could it be that the average watts
measurements on NHM are not correct/precise..? On SNB they look as
expected, according to your scheme.

well... it's not always beneficial to group or to spread out
it depends on cache behavior mostly which is best

Let me try to understand what this means: so "performance" above with
8 threads means that those threads are spread out across more than one
socket, no?

If so, this would mean that you have a smaller amount of tasks on each
socket, thus the smaller wattage.

The "powersaving" method OTOH fills up the one socket up to the brim,
thus the slightly higher consumption due to all threads being occupied.

Is that it?

not sure.

by and large, power efficiency is the same as performance efficiency, with some twists.
or to reword that to be more clear
if you waste performance due to something that becomes inefficient, you're wasting power as well.
now, you might have some hardware effects that can then save you power... but those effects
then first need to overcome the waste from the performance inefficiency... and that almost never happens.

for example, if you have two workloads that each fit barely inside the last level cache...
it's much more efficient to spread these over two sockets... where each has its own full LLC
to use.
If you'd group these together, both would thrash the cache all the time and run inefficient --> bad for power.

now, on the other hand, if you have two threads of a process that share a bunch of data structures,
and you'd spread these over 2 sockets, you end up bouncing data between the two sockets a lot,
running inefficient --> bad for power.


having said all this, if you have to tasks that don't have such cache effects, the most efficient way
of running things will be on 2 hyperthreading halves... it's very hard to beat the power efficiency of that.
But this assumes the tasks don't compete with resources much on the HT level, and achieve good scaling.
and this still has to compete with "race to halt", because if you're done quicker, you can put the memory
in self refresh quicker.

none of this stuff is easy for humans or computer programs to determine ahead of time... or sometimes even afterwards.
heck, even for just performance it's really really hard already, never mind adding power.

my personal gut feeling is that we should just optimize this scheduler stuff for performance, and that
we're going to be doing quite well on power already if we achieve that.


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/