Re: Core dumps & restarting

Ian main (imain@vcc.bc.ca)
Tue, 5 Nov 1996 13:36:07 -0800 (PST)


On 5 Nov 1996, Kai Henningsen wrote:

> davem@caip.rutgers.edu (David S. Miller) wrote on 29.10.96 in <199610290516.AAA13856@caip.rutgers.edu>:
>
> > Why not dump the core ram image to another "machine", drop
> > reservations on all the SCSI devices you are talking to, and then tell
> > the machine "mount my disks, assume my ip addresses, and act like me,
> > because I'm going down". It can work with something like a 3 minute
> > max takeover time if you do it right. If the panic'ing machine can
> > come back up cleanly, the transfer of core image can happen again,
> > scsi device ownership given back, ip interfaces set back up, and you
> > are _still_ back in operation. You can get it so good that it only
> > looks like the network is saturated to your users ;-)

Hmm.. just a crazy thought.. What if you were to have 2 machines with
the same IP, on the same subnet, but have only one of them send
_anything_ back in response. The 2 machines would have identical
hardware, and an identical installation. If all connections to the
master (the one that will respond) were duplicated on the secondary, which
made no attempt to respond at the physical layer in any way. Then the
secondary would naturally operate identically.

Note that all the applications and services would think they were the
ones responding on the secondary, and it works only because the 2
machines are identical in every way.

A udp "heartbeat" (stolen from a previous message :) ) could be used to
let the secondary take over the connections if the primary stopped.

Hopefully I have described my idea well enough for you to understand it,
and I'm sure it'd have some problems, but I just thought I'd share it
with you anyway :)

Ian

>
> Sounds like what you *really* want is the thing Novell calls SFTIII. I've
> never seen it myself, but I gather the idea is to have (usually) two
> machines with identical hardware (and preferrably a *very fast* dedicated
> network connecting them), acting to the outside world like a single
> machine. One of them dies, you clients never notice.
>
> It seems to work something like this: you have a lower level OS part on
> both of them that deals with the hardware, and you have a higher level OS
> that keeps itself synchronized via the internal net connection and acts
> like a single OS to the net and any user processes. Keeping the disks in
> synch reduces to simple mirroring in this scenario, btw. (Of course, it
> probably needs to be able to change ethernet hardware addresses to make
> the interface on one server look *exactly* like the interface on the other
> to the clients.)
>
> (SFTII seems traditionally to have all sorts of convolutions connected to
> where you load stuff, in the lower level, individual parts, or in the
> common part. But that may be because Netware doesn't have a kernel/user
> separation.)
>
> If you can do something like that, you can probably also do a distributed
> OS with very similar code.
>
> MfG Kai
>