Re: [PATCH 00/16] DRBD: a block device for HA clusters

From: Philipp Reisner
Date: Tue May 05 2009 - 17:44:43 EST


Am Dienstag 05 Mai 2009 19:05:46 schrieb James Bottomley:
> On Tue, 2009-05-05 at 17:56 +0200, Philipp Reisner wrote:
> > On Tuesday 05 May 2009 16:09:45 James Bottomley wrote:
> > > On Tue, 2009-05-05 at 10:21 +0200, Philipp Reisner wrote:
> > > > > > When you do asynchronous replication, how do you ensure that
> > > > > > implicit write-after-write dependencies in the stream of writes
> > > > > > you get from the file system above, are not violated on the
> > > > > > secondary ?
> >
> > [...]
> >
> > > > > The way nbd does it (in the updated tools is to use DIRECT_IO and
> > > > > fsync).
> >
> > [...]
> >
> > > I think you'll find the dio/fsync method above actually does solve all
> > > of these issues (mainly because it enforces the semantics from top to
> > > bottom in the stack). I agree one could use more elaborate semantics
> > > like you do for drbd, but since the simple ones worked efficiently for
> > > md/nbd, there didn't seem to be much point.
> >
> > Do I get it right, that you enforce the exact same write order on the
> > secondary node as the stream of writes was comming in on the primary?
>
> Um, yes ... that's the text book way of doing replication: write order
> preservation.
>
> > Using either DIRECT_IO or fsync() calls ?
>
> Yes.
>
> > Is DIRECT_IO/fsync() enabled by default ?
>
> I'd have to look at the tools (and, unfortunately, there are many
> variants) but it was certainly true in the variant I used.

[...]

My experience is that enforcing the exact same write order as on the primary
by using IO draining, kills performance. - Of course things are changing in
a world where everybody uses a RAID controller with a gig of battery
backed RAM. But there are for sure some embedded users that run
the replication technology on top of plain hard disks.

What I want to work out is, that in DRBD we have that capability to allow
limited reordering on the secondary, to achieve the highest possible
performance, while maintaining these implicit write-after-write dependencies.

> I also think you're not quite looking at the important case: if you
> think about it, the real necessity for the ordered domain is the
> network, not so much the actual secondary server. The reason is that
> it's very hard to find a failure case where the write order on the
> secondary from the network tap to disk actually matters (as long as the
> flight into the network tap was in order). The standard failure is of
> the primary, not the secondary, so the network stream stops and so does
> the secondary writing: as long as we guarantee to stop at a consistent
> point in flight, everything works. If the secondary fails while the
> primary is still up, that's just a standard replay to bring the
> secondary back into replication, so the issue doesn't arise there
> either.

A common power failure is possible. We aim for an HA system, we can
not ignore a possible failure scenario. No user will buy: Well in most
scenarios we do it correctly, in the unlikely case of a common power
failure, and you loose your former primary at the same time, you might
have a secondary with the last write but not that one write before!

Correctness before efficiency!

But I will now stop this discussion now. Proving that DRBD does some
details better than the md/nbd approch gets pointless, when we agreed
that DRBD can get merged as a driver. We will focus on the necessary
code cleanups.

-Phil

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/