Re: [PATCH 00/16] DRBD: a block device for HA clusters

From: James Bottomley
Date: Tue May 05 2009 - 13:06:02 EST


On Tue, 2009-05-05 at 17:56 +0200, Philipp Reisner wrote:
> On Tuesday 05 May 2009 16:09:45 James Bottomley wrote:
> > On Tue, 2009-05-05 at 10:21 +0200, Philipp Reisner wrote:
> > > > > When you do asynchronous replication, how do you ensure that implicit
> > > > > write-after-write dependencies in the stream of writes you get from
> > > > > the file system above, are not violated on the secondary ?
> > > >
> [...]
> > > > The way nbd does it (in the updated tools is to use DIRECT_IO and
> > > > fsync).
> > >
> [...]
> > I think you'll find the dio/fsync method above actually does solve all
> > of these issues (mainly because it enforces the semantics from top to
> > bottom in the stack). I agree one could use more elaborate semantics
> > like you do for drbd, but since the simple ones worked efficiently for
> > md/nbd, there didn't seem to be much point.
> >
>
> Do I get it right, that you enforce the exact same write order on the
> secondary node as the stream of writes was comming in on the primary?

Um, yes ... that's the text book way of doing replication: write order
preservation.

> Using either DIRECT_IO or fsync() calls ?

Yes.

> Is DIRECT_IO/fsync() enabled by default ?

I'd have to look at the tools (and, unfortunately, there are many
variants) but it was certainly true in the variant I used. However, the
current main use case of md/nbd is a secondary transaction log to allow
rollback anyway, so the incoming network stream is stored on the device
in write order and the problem doesn't arise.

I also think you're not quite looking at the important case: if you
think about it, the real necessity for the ordered domain is the
network, not so much the actual secondary server. The reason is that
it's very hard to find a failure case where the write order on the
secondary from the network tap to disk actually matters (as long as the
flight into the network tap was in order). The standard failure is of
the primary, not the secondary, so the network stream stops and so does
the secondary writing: as long as we guarantee to stop at a consistent
point in flight, everything works. If the secondary fails while the
primary is still up, that's just a standard replay to bring the
secondary back into replication, so the issue doesn't arise there
either.

The case where it does matter is failure of the primary followed by
instantaneous failure of the secondary before the actual network stream
completes, so guaranteeing that the secondary can be brought back up
consistently. However, this is an incredibly rare failure scenario
given the tight race timings.

James


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/