Re: Help in DSM design

From: Larry McVoy (lm@bitmover.com)
Date: Sat Mar 04 2000 - 13:15:19 EST


On Sat, Mar 04, 2000 at 04:44:04PM +0100, Jamie Lokier wrote:
> Larry McVoy wrote:
> > : Pages are like huge cache lines. Don't gripe about cache lines, OK?
> > : Good SMP code avoids cache line ping pong. Bad code doesn't bother.
> > : (then again, if the "bad" code was cheap, maybe it isn't so bad)
> >
> > So you like a world were your cache line misses take 10,000
> > times longer than normal, do you? You know what, you're right.
> > If you have code that is so cache insensitive that it wouldn't
> > notice the difference, DSM would work great.
>
> Larry, you're deliberately ignoring the fact that a DSM "cache line
> transfer" doesn't block the requesting CPU from executing another thread
> while it waits.
>
> SMP cache line requests don't have that property.

I wasn't ignoring it, I hadn't considered it. OK, so let's think
about that. So now I need to have shared memory model, with multiple
threads, but the threads need to be split up such that the right threads
are wherever the data is and I need to have enough of them to cover up
huge memory latencies.

I don't believe that you take any significant number of shared memory
programs and make that model work. It's a nice idea but the reality is
that wherever you want the data to be, it's typically someplace else.
This is _inherent_ in shared memory programming. Let's see if I can
back up that statement:

Let's say we eliminate the programs which aren't really shared memory
programs, like parallel make, for example. While I'm sure that you
could have all the processes mmap the same makefile and claim that that
is shared memory programming, that's more than a little silly, right?
Am I right in assuming that you don't mean silly stuff like that?

OK, so that leaves us with programs that are actually sharing memory
because they need to do - they are communicating with each through the
memory rather than communicating through messages. Classic examples
of this are your basic simulation, where each process (thread if you
must) is working on some subset of the space. Let's say it is a three
dimensional simulation, so each process has a subcube of the larger cube.

This is a nasty problem space because each process' forward progress
depends not only on the results of its local computation but also on
the results of the computation of its neighbors, since if the cube right
next to you is blowing up, that's going to have some effect on your cube.

So contrast this model with "CPU from executing another thread while
it waits" statement. The only way you can keep all the CPUs busy all
the time is by adding other computations - it's obvious that this
computation will go much much slower under a DSM, isn't it? If it's
not obvious, you haven't thought about it enough, it's definitely
obvious to anyone who's worked in this space. So you could run multiple
simulations to cover up the latency. But that's wacko too - the latencies
are so high that you'd need 100s or 1000s of other tasks to cover up
the latencies. Clearly not an option, the caches and memories would need
to be 100s and 1000s of times bigger and the interconnect would need to
be accordingly faster.

So I just don't see it. If you want to study this area, the people who
have done the most work on it are the Tera folks, they have a design
where they _do_ run another thread in hardware on each cacheline miss.
I personally think this is insane but they keep managing to find more
funding.

-- 
---
Larry McVoy            	   lm@bitmover.com           http://www.bitmover.com/lm 

- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo@vger.rutgers.edu Please read the FAQ at http://www.tux.org/lkml/



This archive was generated by hypermail 2b29 : Tue Mar 07 2000 - 21:00:17 EST