[PATCH 00/16] DRBD: a block device for HA clusters

From: Philipp Reisner
Date: Thu Apr 30 2009 - 07:28:26 EST


Hi,

This is a repost of DRBD, to keep you updated about the ongoing
cleanups and improvements.

Patch set attached. Git tree available:
git pull git://git.drbd.org/linux-2.6-drbd.git drbd

We are looking for reviews!

Description

DRBD is a shared-nothing, synchronously replicated block device. It
is designed to serve as a building block for high availability
clusters and in this context, is a "drop-in" replacement for shared
storage. Simplistically, you could see it as a network RAID 1.

Although I use the "RAID1+NBD" metaphor myself, recent discussion
unveiled that one needs to understand the differences as well.
Here are just two examples of that:

1) Think of a two node HA cluster. Node A is active ('primary' in DRBD
speak) has the filesystem mounted and the application running. Node B is
in standby mode ('secondary' in DRBD speak).

We loose network connectivity, the primary node continues to run, the
secondary no longer gets updates.

Then we have a complete power failure, both nodes are down. Then they
power up the data center again, but at first the get only the power
circuit of node B up and running again.

Should node B offer the service right now ?
( DRBD has configurable policies for that )

Later on they manage to get node A up and running again, now lets assume
node B was chosen to be the new primary node. What needs to be done ?

Modifications on B since it became primary needs to be resynced to A.
Modifications on A sind it lost contact to B needs to be taken out.

DRBD does that.

How do you fit that into a RAID1+NBD model ? NBD is just a block
transport, it does not offer the ability to exchange dirty bitmaps or
data generation identifiers, nor does the RAID1 code has a concept of
that.

2) When using DRBD over small bandwidth links, one has to run a resync,
DRBD offers the option to do a "checksum based resync". Similar to rsync
it at first only exchanges a checksum, and transmits the whole data
block only if the checksums differ.

That again is something that does not fit into the concepts of NBD or RAID1.

DRBD can also be used in dual-Primary mode (device writable on both
nodes), which means it can exhibit shared disk semantics in a
shared-nothing cluster. Needless to say, on top of dual-Primary
DRBD utilizing a cluster file system is necessary to maintain for
cache coherency.

More background on this can be found in this paper:
http://www.drbd.org/fileadmin/drbd/publications/drbd8.pdf

Beyond that, DRBD addresses various issues of cluster partitioning,
which the MD/NBD stack, to the best of our knowledge, does not
solve. The above-mentioned paper goes into some detail about that as
well.

DRBD can operate in synchronous mode, or in asynchronous mode. I want
to point out that we guarantee not to violate a single possible write
after write dependency when writing on the standby node. More on that
can be found in this paper:
http://www.drbd.org/fileadmin/drbd/publications/drbd_lk9.pdf

Last not least DRBD offers background resynchronisation and keeps
a on disk representation of the dirty bitmap up-to-date. A reasonable
tradeoff between number of updates, and resyncing more than needed
is implemented with the activity log.
More on that:
http://www.drbd.org/fileadmin/drbd/publications/drbd-activity-logging_v6.pdf

Changes since 2009-04-10

* Cleanup: Removed all CamelCase
* Cleanup: Replaced DRBD's own tracing stuff with regular tracepoints
* Cleanup: Removed ERR/INFO/ALERT ... macros, using dev_err/dev_info/... now
* Cleanup: Minor stuff, as suggested in feedback on LKML
* DRBD: Bitmap compression feature was finalised
* DRBD: new disable_sendpage parameter

Changes since the post on 2009-03-30, all triggered by reviews

* Improvements to Makefile and Kconfig
* Simplified definitions of bm_flags' bitnumbers
* Removed debugging aid

Changes since the post on 2009-03-23, from drbd-mainline

* Updated to the final drbd-8.3.1 code
* Optionally run-length encode bitmap transfers

Changes since the post on 2009-03-23, triggered by reviews

* Using the latest proc_create() now
* Moved the allocation of md_io_tmpp to attach/detach out of drbd_md_sync_page_io()
* Removing the mode selection comments for emacs
* Removed DRBD_ratelimit()

cheers,
Phil
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/