Re: 20% performance drop on PostgreSQL 9.2 from kernel 3.5.3 to3.6-rc5 on AMD chipsets - bisected

From: Linus Torvalds
Date: Tue Sep 25 2012 - 15:08:32 EST


On Tue, Sep 25, 2012 at 11:42 AM, Borislav Petkov <bp@xxxxxxxxx> wrote:
>>
>> Is this literally just removing it entirely?
>
> Basically yes:

Ok, so you make it just always select 'target'. Fine. I wondered if
you just removed the calling logic entirely.

>> How does pgbench look? That's the one that apparently really wants to
>> spread out, possibly due to user-level spinlocks. So I assume it will
>> show the reverse pattern, with "kill select_idle_sibling" being the
>> worst case.
>
> Let me run pgbench tomorrow (I had run it only on an older family 0x10
> single-node box) on Bulldozer to check that out. And we haven't started
> the multi-node measurements at all.

Ack, this clearly needs much more testing. That said, I really would
*love* to just get rid of the function entirely.

>> Sad, because it really would be lovely to just remove that thing ;)
>
> Right, so why did we need it all, in the first place? There has to be
> some reason for it.

I'm not entirely convinced.

Looking at the history of that thing, it's long and tortuous, and has
a few commits completely fixing the "logic" of it (eg see commit
99bd5e2f245d).

To the point where I don't think it necessarily even matches what the
original cause for it was. So it's *possible* that we have a case of
historical code that may have improved performance originally on at
least some machines, but that has (a) been changed due to it being
broken and (b) CPU's have changed too, so it may well be that it
simply doesn't help any more.

And we've had problems with this function before. See for example:
- 4dcfe1025b51: sched: Avoid SMT siblings in select_idle_sibling() if possible
- 518cd6234178: sched: Only queue remote wakeups when crossing cache boundaries

so we've basically had odd special-case "tuning" of this function from
the original. I do not think that there is any solid reason to believe
that it does what it used to do, or that what it used to do makes
sense any more.

It's entirely possible that "prev_cpu" basically ends up being the
better choice for spreading things out.

That said, my *guess* is that when you run pgbench, you'll see the
same regression that we saw due to Mike's patch too. It simply looks
like tbench wants to have minimal cpu selection and avoid moving
things around, while pgbench probably wants to spread out maximally.

Linus
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/