[ Sunday, January 30, 2000 ] pwalt@onthenet.com.au wrote:
> My tests on one of those dual celerons show that in the extreme case of a set
> of processes that use all the available cache Linux is still far too eager to
> swap processes across CPU's.
>
> It's relatively easy to get 2X speedups in the extreme cases.
This is excellent news as it means the tunable SMP scheduling should
provide great benefit. It also means Andrea's idea of self-tuning
scheduling would also appear to be necessary in the long-term as the
current process mix should determine where ont he flexibility <-> performance
line we want to scheduler to be living at for any particular moment.
> What worries me is that this should be much worse with more
> processors/larger cache. And it gets worse in a non-linear fashion. When
> you cause a cache reload it not only slows the processor thats running
> from much slower ram for a while, but it's also using a lot of bus
> bandwidth. (Which slows the other processors).
Exactly. To be honest, I'm impressed you could get 2X speedups, but not
overly surprised. What frightens me (as you mention) is that if two
128K L2 caches can generate this large of a gap, then the 2MB (and
up to 8MB as the Athlon @ 0.18 will have) caches exacerbate this by
a *huge* amount.
> I'll admit that the attached code is an extreme case, but it's simillar to a
> lot of signal/image processing operations.
It's not all that extreme from a cache behavior standpoint. Loop blocking
is a technique taught in most good CS or comp. arch. classes for exactly
the huge cache benefits it can provide, and code that utilizes it or
other cache-efficient techniques (DSP, image, other scientific) will
need a scheduler to be much more sensitive to their cache behavior than
something like an office suite.
> I was expecting a few % change from fiddling with scheduling and running this
> code, but it's a lot more dramatic than that in practice.
Great, glad to hear it :)
Trying to make the self-tuning part might be difficult, though, as
even getting the L2 miss counts from P6 MSR's during a slice doesn't
necessarily mean the code was cache-inefficient or could have done better
on the processor it started on. Andrea may already have some ideas for
this, though, and we can certainly do things by-hand for now and add in
some more intelligence later. After all, the most performance-sensitive
processes that we care about are going to tend to be long-running so
tuning by hand shouldn't be too painful, and is still more flexible than
the current method :)
James
-- Miscellaneous Engineer --- IBM Netfinity Performance Development- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo@vger.rutgers.edu Please read the FAQ at http://www.tux.org/lkml/
This archive was generated by hypermail 2b29 : Sun Jan 23 2000 - 21:00:25 EST