Re: SMP scalability: 8 -> 32 CPUs

Jason Riedy (
Mon, 30 Nov 1998 22:17:38 -0500

Oh well. And Chuck Lever writes:
- although, lots of enterprise NT customers want to scale their network
- servers by running them on 4-way and 8-way hardware.

Lots of enterprise NT customers are in for a shock when they realize
their apps weren't written for that many processors, then. Scaling
isn't trivial. They can get better throughput, but not necessarily
better performance on a single task.

Besides, from the Linux point of view, each Intel-based 8-way system
has its own funky motherboard and chipset. I don't believe Intel has
any released ones that go there. (I'd like to know if I'm wrong, btw.)

Thus, there would be a huge amount of effort expended to support only
a single company's not-widespread machine. Not a good choice, imho.
Compare this to the Suns, where they pretty much use the same system
architecture and chipsets (more or less) up the xx00 line. Less effort
to support more machines. Nicer. When there's a major Intel chipset
that supports >4-way, people might change their minds. Maybe.

- what's the difference between an n-way cluster and an n-way SMP, besides
- cost and shared memory bandwidth?

Programming model. You can make either one look like the other,
but the fastest code will generally treat an SMP as a shared playpen
and a distributed system as a bunch of separate ones. It seems
obvious, but there have been many attempts to make distributed systems
look like SMPs. NUMA architectures are about the only reasonably
successful ones, but I'm biased (scientific computing, where you often
treat SMPs as distributed systems).

- won't this also change over time as NUMA becomes more economical?

NUMA architectures are not a panacea. They basically implement
hardware support for a shared memory illusion on a distributed
memory system. The Origins are pretty cool, but it's easy to
write software that causes heavy page bouncing between the
processors. The fastest codes (remember my bias) treat the Origins
as distributed systems. You can think of NUMAs as having one more
level in the memory hierarchy, with local memory sometimes acting
as a cache for remote memory.

There's another interesting paradigm out there. Consider clusters
of SMPs... Many problems break up into smaller ones that fit current
SMPs quite nicely. IBM's latest SP/2s are clump machines... The
Linux kernel scaling to 4 cpus could fit this beautifully on the right

- wouldn't that imply that scaling CPUs might also scale I/O capacity,
- as long as the system bus could keep up?

And so long as the system bus isn't a single bus, otherwise it's a point
of contention. The high-end Sun and Digital, err, Compaq boxes handle
this by using multiple bridges onto their PCI subsystems. Most dual-
processor PCs have only one PCI bridge, so there's only one path from PCI
to CPU. (Again, I'd love to find out I'm wrong.) That limits the I/O.

As the other person pointed out, multiprocessor efficiency is very much
a system issue. That's why Sun's doing so well even though their CPUs
aren't necessarily better in price/performance.

- you would also want a single multi-threaded application to take
- good advantage of multiple CPUs.

It's possible to support multi-threaded apps pretty well on a multi-
programmed system. You might want to check the work at U Texas,
particularly Their results
are quite nice. They demonstrated linear speedups until processor
saturation, and then a nice, horizontal line.

Most thread systems' performance drops through the floor when you
have more threads than CPUs. You do lose the potential for superlinear
speedups in their system, however. Many apps probably won't care.
Wish they'd get around to releasing their code... It's user-level and
can probably be ported to Linux pretty easily.


To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to
Please read the FAQ at