Re: Interesting analysis of linux kernel threading by IBM

From: Steve Underwood (steveu@coppice.org)
Date: Sun Jan 23 2000 - 04:25:06 EST


Hi,

Davide Libenzi wrote:

> [...]
>
> The rendering pipeline ( as the keyword state ) in an highly parallel
> environment in which a subsystem takes one type of data, transform it in a new
> kind of data, and pass the result to the next subsystem. This is true for a
> scanline renderer ( using shadow maps and environment mapping ) not for a
> raytracer. In this environment I'll espect ( You're right I've only coded
> single thread renderers ) that if I decompose the pipeline into N steps and
> I've an N way SMP system I'll get good performance. Where good does not mean
> TotalTime / N , but a time :
>
> (TotalTime / N) < T << TotalTime
>
> If even an highly parallel job like a renderer cannot be well coded in SMP,
> what we keep it for ?
>
> OK, probably the solution You push is clusters of SMPs.
>
> But recalling what I've asked You in head of this message, given a cluster of
> N computers having an M way SMP system and exchanging data through an
> ethernet, have You measured that ( cost apart ) a single M x N SMP system will
> perform ( scale ) less than the cluster ?
> I can't believe that cache effects are bigger that ethernet bottleneck.
>
> Unfortunately I don't have neither a Beowulf system nor a 32 SMP system to
> try my thoughts ( only a poor 2 way ).

The rendering pipeline should be considered as a concept, not a physical reality.
Many mathematical processes look like a pipeline - computational chemistry amd
pharmacy, signal processing, and so on. Implementing them as a physical pipeline
usually sucks. There was a period in the 80's when pure vector processing
machines, such as the Cray and FPS machines (are FPS still in business?) caused a
number of modelling packages to be rewritten to be more pipeline-like. Since then
they have changed back to regain performance on today's architectures. These
problems have a special quality - the blocks of data to be processed are between
99% and 100% independant of each other. The problem can truly be broken into
pieces across a large number of processors or computers with no (in some cases
almost no) interaction. To grasp what will work well for these problems with
today's technology, just consider a basic sound concept in computers and group
working - try to keep communication to a minmum, and just get on with your work.

Lets say you have one N CPU SMP machine with which to perform rendering. You could
render using N different processes which each perform 1/Nth of the rendering
function. Alternatively, you could run N copies of a program which does the whole
rendering job. Which would be best? Well, the program which does the whole job, if
sensibly structured, will load each pixel into a processor cache just once, and
unload it just once. The programs which break up the work have to keep storing
data into main memory, so it can be reloaded into another processor's cache, which
causes a _huge_ performance hit with current processor to main memory speed
ratios. Now, which strategy sounds like it will keep the processors busy, and
which sounds like it will merely keep the communications channels through main
memory busy? In this case it should be obvious that the low communications
strategy is best - one program which does the whole rendering job, carefully
structured to avoid cache thrashing.

Is the above SMP solution the most sensible, though? Well no. A better solution
exists. The N identical processes on the N CPUs are contending for main memory
bandwidth. As N increases this becomes intolerable. Even for small values of N it
is wasteful. N uniprocessor machines, each given a section of a movie (say a frame
at a time) to render, wouldn't do such a wasteful thing. If you assume the raw
movie data comes in by LAN, and the rendered results will be shipped out to
somewhere by LAN, N uniprocessor machines dividing up the work do the best
possible job. They will each perform 1/Nth of the network I/O, which is needed for
both strategies anyway, but avoid all main memory contention. Its still critical
that the rendering program avoid cache thrashing, but that is achievable with
careful design and optimisation.

So, that is why the movie industry, pharmaceutical industry, and various other
groups with a huge demand for simulation and modelling computation, like
uniprocessor farms of the Beowulf type. I have seen descriptions of a couple of
Beowulf systems composed of dual-processor Alphas, but I am puzzled as to why. It
doesn't usually work out as effective as using more single CPU machines.

These types of problems have little in common with most everyday computing
problems. They require lots of compute power, have a remarkably high degree of
independance between blocks of data, are very regular in their use of memory and
compute resources (in many cases every data block takes precisely the same number
of instructions and bytes of memory to process), and latency is almost totally
irrelevant.

SMP machines suit a totally different, but far more widespread class of problems.
With a sensible OS, they can give a good interactive feel running a poorly
structured mixture of everyday business activities. Design the OS around the needs
of a physical rendering pipeline and performance will suffer badly. Introduce more
threads than are need to spread work across the available processors and they will
suffer badly. Structure their use so the number of instantaneously runnable tasks
is large and they have already nearly ground to a halt.

Contention for main memory means, of course, that SMP won't scale well for any
class of problem. Since everyone should realise that, I won't comment further.

Steve

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.rutgers.edu
Please read the FAQ at http://www.tux.org/lkml/



This archive was generated by hypermail 2b29 : Sun Jan 23 2000 - 21:00:29 EST