Re: Linux kernel and disaster recovery.

Richard B. Johnson (root@analogic.com)
Wed, 18 Jun 1997 10:22:18 -0400 (EDT)


On Wed, 18 Jun 1997, Peter Benie wrote:

> TIGRANA@dstiuk.ccmail.compuserve.com writes ("Linux kernel and disaster recovery."):
> > What if the Linux server itself crashes? If it is under UPS and
> > there was some clever kernel module that would be able to somehow
> > save the state of all (or specific) running processes and write to a
> > separate disk partition and then after reboot to be able to restore
> > the "memory dump" from the partition into memory thus revitalising
> > all those running processes that would be very nice.
>
> The term you are looking for is "checkpointing". You take a snapshot
> of a process every so often and when the machine is rebooted after a
> crash, you can restore the process to the state of the snapshot.
> Alternatively, you can stop a process and start it again later
> (perhaps on a different machine).
>
> > Of course, I understand that the network sockets will be lost but it
> > is fine because with the scheme described above one simply
> > reattaches to the sessions using the UNIX domain socket and resumes
> > it.
[SNIPPED]

If kernel driver code was written so that it, too, was checkpointed,
then it is possible to restore a running kernel and all the user tasks.
VAX/VMS had this capability since Version 6.0.

Basically, the boot process commences as usual, getting all the
hardware interface up and running, then the driver's buffers and
internal software state machine is "overlayed" with the previously
saved image. Then the rest of the kernel is overlayed with the saved
memory image. Then a "return" is made from the previous checkpoint
trap and the machine runs as it was running before.

VAX/VMS uses this for a "fast boot" option. The snapshot is taken
when the system is up with all it's normal "System" tasks running.

Then when you "fast boot", the machine will be quickly restored to
this saved state.

__BUT__ The problem is that you don't want to restore the machine to
its exact saved state at the moment it crashed. It will immediately
crash again! Most crashes are the result of the CPU executing garbage
either because of a hardware or programming error. You need to restore
the state of the machine before it executes garbage and you don't know
when that was.

In the days when it took 30 to 40 minutes to boot a VAX, it was useful
to save the state of a perfectly-running machine so that it could be
quickly re-booted within a minute or two.

Now we can boot the most complex machine in 30 seconds or so. It really
doesn't make much sense to save this "freshly booted" state. Instead,
servers and database engines should be written to quickly recover by
doing internal checkpointing at regular intervals. If you pull the plug
and reboot the machine, these programs should "know" how to completely
recover in a very short period of time. Problems cited about sockets
and file descriptors are not problems at all. The checkpointing routines
save all information necessary to reestablish logical connections,
including any security considerations. It's just part of the complete
solution and is application specific.

Checkpointing of database engines generally forces a designer to produce
a superior product because more discipline must be used during code
development. When a programmer has to think about how to unroll and
redo something that could be terminated at any instant, the result is
usually "lean and mean" code.

Cheers,
DJ
-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-
Richard B. Johnson
Analogic Corporation
Email : rjohnson@analogic.com, johnson@analogic.com
Penguin : Linux version 2.1.42 on an i586 machine (66.15 BogoMips).
Warning : It's hard to stay on the trailing edge of technology.
-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-