Re: Interesting analysis of linux kernel threading by IBM

From: Larry McVoy (lm@bitmover.com)
Date: Fri Jan 21 2000 - 16:44:34 EST


: why You all speak about worse designed multithreaded apps.
: Every technology bad applied is worse !
: Suppose You have a rendering job to be done.
: It can be subdivided in a highly parallel system with distinct threads that
: can run together :
:
: 1) Viewing transformation
: 2) Triangulation
: 3) Scan conversion
: 4) Texturing
: 5) Illumination
: 6) Frame output
:
: just to keep it simple.
: Now can someone tell me why I would not split my job into threads ?

Rendering is what we call an embarrassingly parallel application.
In other words, very, very coarse grained parallelism works great for
this, in fact, it works orders of magnitude better than what you descibed.
Talk to Disney, Pixar, ILM, RFX - all of whom are heavily into this space,
all of whom I've visited personally to talk about their computing needs,
and all of whom use farms of uniprocessors for rendering. There are
a bunch of other ones too, Digital Design, Pacific something (used to be
Walnut Creek now are in Palo Alto), etc. All the production and post
production digital houses know that farms of machines that share nothing
but a network are the highest performance and least cost way to do
rendering.

If you suggested a multithreaded application to do that to any of those
guys in a job interview, and stuck to your opinion that it was a good
idea, my predicition is that you would be standing on the street wondering
what happened in less than 5 minutes. Those people are doing hard work
on short schedules and and really don't have time to waste.

I am starting to wonder if you've ever coded up an application both ways
and tested it. If you had tried the rendering model that you suggested
and then tried the same thing all in one process, I believe that your
way would show dramatically lower performance. It's been shown that
while the model of fine grained parallelism, especially in data parallel
applications like what you are talking about, while that model can be
supported, the cache effects of doing so on an SMP dramatically _REDUCE_
the performance. It's always been seen that you are better off to divide
up the data, do all the different transformations to a chunk of data by
one process on one processor in one cache, rather than by spreading the
same data over a bunch of caches. In fact, all the research in parallel
applications boils down to ``how much can you divide up the data''.
If there is so much focus on that, all of it performance related, why
is it that you believe something that certainly seems to fly the face
of both theory and practice?

If you don't mind, can you tell me if you have actually tried any of
these ideas you are advocating, tried them both ways and compared the
performance? It's been a couple of years since I looked at this in
depth and perhaps the technology or the transformations have changed
their characteristics and I'm all wrong, this stuff works great the way
you say it does. But I'd sure like to understand why that is, if that
is the case, because it shakes up some pretty basic understanding that
I have of how the apps and the SMPs work.

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.rutgers.edu
Please read the FAQ at http://www.tux.org/lkml/



This archive was generated by hypermail 2b29 : Sun Jan 23 2000 - 21:00:27 EST