Re: Core dumps & restarting

David S. Miller (davem@caip.rutgers.edu)
Tue, 29 Oct 1996 00:16:39 -0500 (EST)


From: Systemkennung Linux <linux@mailhost.uni-koblenz.de>
Date: Tue, 29 Oct 1996 04:59:00 +0100 (MET)

The big problem with freezing processes or machine state and restoring
it later is that the context gets partially lost like non-local network
connections. Some code dies because time suddently warps. Or where
to position file pointers when restoring a process? This is usually
trivial when the file hasn't changed since taking the snapshot but
can get very hairy otherwise.

This depends upon how short you can get "later" and how much state you
can fully save. Behold...

We can already do things (generic unix'y speaking) like dump a
complete core image of ram onto a disk when we punt, and we have the
technology for multiple initiator SCSI configuarations and to make
that work.

Why not dump the core ram image to another "machine", drop
reservations on all the SCSI devices you are talking to, and then tell
the machine "mount my disks, assume my ip addresses, and act like me,
because I'm going down". It can work with something like a 3 minute
max takeover time if you do it right. If the panic'ing machine can
come back up cleanly, the transfer of core image can happen again,
scsi device ownership given back, ip interfaces set back up, and you
are _still_ back in operation. You can get it so good that it only
looks like the network is saturated to your users ;-)

This idea only is useful when you hit the point that you can get the
machines running, image cloned and operational, somewhere else within
the timeouts for various things like tcp connections etc. And now you
are getting to the point where the delay is about the same as you'd
see if the machine was swapping a lot or otherwise heavily loaded.

Yes, I've been thinking about stuff like this.

David S. Miller
davem@caip.rutgers.edu