Re: 20% performance drop on PostgreSQL 9.2 from kernel 3.5.3 to3.6-rc5 on AMD chipsets - bisected

From: david
Date: Thu Sep 27 2012 - 12:49:00 EST


On Thu, 27 Sep 2012, Peter Zijlstra wrote:

On Wed, 2012-09-26 at 11:19 -0700, Linus Torvalds wrote:

For example, it starts with the maximum target scheduling domain, and
works its way in over the scheduling groups within that domain. What
the f*ck is the logic of that kind of crazy thing? It never makes
sense to look at a biggest domain first.

That's about SMT, it was felt that you don't want SMT siblings first
because typically SMT siblings are somewhat under-powered compared to
actual cores.

Also, the whole scheduler topology thing doesn't have L2/L3 domains, it
only has the LLC domain, if you want more we'll need to fix that. For
now its a fixed:

SMT
MC (llc)
CPU (package/machine-for-!numa)
NUMA

So in your patch, your for_each_domain() loop will really only do the
SMT/MC levels and prefer an SMT sibling over an idle core.

I think you are bing too smart for your own good. you don't know if it's best to move them further apart or not. I'm arguing that you can't know.

so I'm saying do the simple thing.

if a core is overloaded, move to an idle core that is as close as possible to the core you start from (as much shared as possible).

if this does not overload the shared resource, you did the right thing.

if this does overload the shared resource, it's still no worse than leaving it on the original core (which was shared everything, so you've reduced the sharing a little bit)

the next balancing cycle you then work to move something again, and since both the original and new core show as overloaded (due to the contention on the shared resources), you move something to another core that shares just a little less.

Yes, this means that it may take more balancing cycles to move things far enough apartto reduce the sharing enough to avoid overload of the shared resource, but I don't see any way that you can possibly guess if two processes are going to overload the shared resource ahead of time.

It may be that simply moving to a HT core (and no longer contending for registers) is enough to let both processes fly, or it may be that the overload is in a shared floating point unit or L1 cache and you need to move further away, or you may find the contention is in the L2 cache and move further away, or it could be in the L3 cache, or it could be in the memory interface (NUMA)

Without being able to predict the future, you don't know how far away you need to move the tasks to have them operate at th eoptimal level. All that you do know is that the shorter the move, the less expensive the move. So make each move be as short as possible, and measure again to see if that was enough.

For some workloads, it will be. For many workloads the least expensive move won't be.

The question is if doing multiple, cheap moves (requiring simple checking for each moves) ends up being a win compared to do better guessing over when the more expensive moves are worth it.

Give how chips change from year to year, I don't see how the 'better guessing' is going to survive more than a couple of chip releases in any case.

David Lang
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/