Re: 20% performance drop on PostgreSQL 9.2 from kernel 3.5.3 to3.6-rc5 on AMD chipsets - bisected
From: david
Date: Thu Sep 27 2012 - 12:49:00 EST
On Thu, 27 Sep 2012, Peter Zijlstra wrote:
On Wed, 2012-09-26 at 11:19 -0700, Linus Torvalds wrote:
For example, it starts with the maximum target scheduling domain, and
works its way in over the scheduling groups within that domain. What
the f*ck is the logic of that kind of crazy thing? It never makes
sense to look at a biggest domain first.
That's about SMT, it was felt that you don't want SMT siblings first
because typically SMT siblings are somewhat under-powered compared to
actual cores.
Also, the whole scheduler topology thing doesn't have L2/L3 domains, it
only has the LLC domain, if you want more we'll need to fix that. For
now its a fixed:
SMT
MC (llc)
CPU (package/machine-for-!numa)
NUMA
So in your patch, your for_each_domain() loop will really only do the
SMT/MC levels and prefer an SMT sibling over an idle core.
I think you are bing too smart for your own good. you don't know if it's
best to move them further apart or not. I'm arguing that you can't know.
so I'm saying do the simple thing.
if a core is overloaded, move to an idle core that is as close as possible
to the core you start from (as much shared as possible).
if this does not overload the shared resource, you did the right thing.
if this does overload the shared resource, it's still no worse than
leaving it on the original core (which was shared everything, so you've
reduced the sharing a little bit)
the next balancing cycle you then work to move something again, and since
both the original and new core show as overloaded (due to the contention
on the shared resources), you move something to another core that shares
just a little less.
Yes, this means that it may take more balancing cycles to move things far
enough apartto reduce the sharing enough to avoid overload of the shared
resource, but I don't see any way that you can possibly guess if two
processes are going to overload the shared resource ahead of time.
It may be that simply moving to a HT core (and no longer contending for
registers) is enough to let both processes fly, or it may be that the
overload is in a shared floating point unit or L1 cache and you need to
move further away, or you may find the contention is in the L2 cache and
move further away, or it could be in the L3 cache, or it could be in the
memory interface (NUMA)
Without being able to predict the future, you don't know how far away you
need to move the tasks to have them operate at th eoptimal level. All that
you do know is that the shorter the move, the less expensive the move. So
make each move be as short as possible, and measure again to see if that
was enough.
For some workloads, it will be. For many workloads the least expensive
move won't be.
The question is if doing multiple, cheap moves (requiring simple checking
for each moves) ends up being a win compared to do better guessing over
when the more expensive moves are worth it.
Give how chips change from year to year, I don't see how the 'better
guessing' is going to survive more than a couple of chip releases in any
case.
David Lang
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/