Re: Process Migration on Linux - Impossible?

Larry McVoy (lm@cobaltmicro.com)
Tue, 30 Sep 1997 16:57:59 -0700


: I don't know that migration impacts performance.

In an SMP it totally screws up your cache utilization. Process migration
across machines is just a larger version of rescheduling on another CPU
in an SMP. There are literally hundreds of papers on cache affinity and
schedulers. They show that it is almost always a bad idea to move the
process, something which is intuitively obvious to hardware guys who have
memorized cache miss costs and know how long it takes to refill a cache.
Here's an example: suppose you have a 10 millisecond time slice, a 4MB
cache w/ only your data in it, a 32 byte cache line, and 500ns cache
miss penalty. If you move the process, it will take 131,072 cache misses
to fill up that cache. That is 65,536 usecs or about 7 time slices.
if your scheduler moves you ever time slice, you lose big time, you'll
never get any work done out of the cache.

The math for process migration across machine boundries is similar, except
it frequently gets worse - the cost of moving is in 100s of milliseconds
and the cache is a file system page cache with even longer refill times.

The counter argument is that you don't do it very often. I agree.
Do it once at exec time and never again. If you need to do it again,
that's a checkpoint/restart and that can be done with very little kernel
assistance.

: As far as reliability goes,
: you can *increase* reliability if you can arrange to migrate off of machines
: that will be taken down for some reason, so there's at least one advantage to
: go along with whatever disadvantages you might find.

Checkpoint/restart solves this problem. And this isn't the reliability I
meant - I was referring to kernel reliability wheich is put at rrisk each
time a new "feature" is added.

: Perhaps. Though I think what you say later about tightly-coupled clusters
: will apply to other scales and that eventually, migration will in fact be
: commonplace.

In the SMP world, it has already become commonplace and proven to be a
horrible mistake from a performance perspective. All tuned schedulers
do everything they possibly can to never move you. Including waiting
for the CPU. As the above math shows, quite frequently it is better
to wait for many time slices rather than move. Remember, that example
showed 7 time slices just to get back to the same cache state as you
were before. That's hefty performance difference.

: > Past experience has shown that process migration is costly in terms
: > of code required to make it work, time required to do it, and cache
: > utilization (both processor and file system).
: >
: > Since I despise nay-sayers that don't bother to offer a better answer,
: > here's my better answer: cluster small SMP machines. Allow full blown
: > "migration" from one CPU to another CPU on an SMP, but not across machine
: > boundries. This gives some degree of dynamic load balancing, which is
: > the second order throughput term. Use static load balancing at exec()
: > time to get the first order term.
: >
: > I happen to believe that such a system would keep up with, and in some
: > cases outperform, large SMP systems. I have a fair amount of real world
: > experience from Sun and SGI SMP systems that suggests that this approach
: > is better.
: >
: > Here, we are discussing process migration across machine boundries,
: > something that is certainly much more expensive than migration from one
: > CPU to another CPU within a SMP, right? It is safe to say, is it not,
: > that if you can't migrate well within an SMP then it is going to do
: > nothing but get worse if you try and migrate to a different machine.
: > I've worked on 128-512 processor systems at SGI with 100% cache coherent
: > memory and file systems (done in hardeware, memory latency was 300-800ns
: > depending on where you were). We couldn't make page migration work well
: > on those machines.
: >
: > Let me say that again. A company with SGI's resources, may full time
: > experienced engineers that would stack up with the best the research
: > community has to offer, couldn't get page migration to work. By "work"
: > I mean come up with a self tuning policy that results in better throughput
: > than just allocating the pages and leaving them where they were allocated
: > and leaving the process near them. /All/ attempts to improve performance
: > by using migration resulted in lower system throughput. The cost of moving
: > the process context and the pages outweighed any performance benefit.
: >
: > It would be easy for people to just say "Well, those SGI engineers are
: > stupid". Heck, I could say it, I didn't work on the migration stuff,
: > I thought it was a bad idea before they started working on it, so it
: > isn't like I have any skin in the idea, quite the opposite. But the
: > SGI engineers are top notch, at least in this area. I challenge the
: > minds out there to come up with a migration policy that actually
: > improves performance under any realistic workload, not some toy
: > benchmark. Show me a process migration system that makes TPC-C
: > run better. Or fortran jobs. Or make. Or web. Or NFS. Anything
: > that customers will pay money to get.
: >
: > The point is that yes, you can do it. But that is an academic point,
: > not anything that is actually useful. Not in my experience. I'm happy
: > to be proven wrong, but I'm unhappy with letting people go down a
: > proven rathole.
:
: Thanks for the lessons.

I'm a little disappointed that you didn't use your experience to either
counter or validate the analogy made above. It seems to me that you
might be able to shed some light here.

Let me ask again: if a $4 billion/year company can't get it right within
the confines of an SMP, and both industry practice and research has shown
that migration stinks on SMPs, what makes you encourage people to pursue
this line of thinking? Do you think process migration across machine
boundries is somehow less expensive than a reschedule onto other CPUs?
Or the relative speeds of migration, file system cache misses, etc.,
are different for machine-machine vs cpu-cpu?

Another way to put it: if you have remote exec() and you have 4 way
SMP nodes, under what circumstances would you ever want to migrate a
running process? And what data or insight exists that shows this to be
a good idea?

I'm sorry to be so negative about this idea. But I've seen things like
this pushed before and seen bright young people go follow these ideas,
only to find that it isn't a good idea after all. Seems like there
are more fruitful areas that we could push them towards, doesn't it?

--lm