Re: clustered MD

From: David Teigland
Date: Wed Jun 10 2015 - 13:05:47 EST


On Wed, Jun 10, 2015 at 11:23:25AM -0500, Goldwyn Rodrigues wrote:
> To start with, the goal of (basic) MD RAID1 is to keep the two
> mirrored device consistent _all_ of the time. In case of a device
> failure, it should degrade the array pointing to the failed device,
> so it can be (hot)removed/replaced. Now, take the same concepts to
> multiple nodes using the same MD-RAID1 device..

"multiple nodes using the same MD-RAID1 device" concurrently!? That's a
crucial piece information that really frames the entire topic. That needs
to be your very first point defining the purpose of this.

How would you use the same MD-RAID1 device concurrently on multiple nodes
without a cluster file system? Does this imply that your work is only
useful for the tiny segment of people who could use MD-RAID1 under a
cluster file system? There was a previous implementation of this in user
space called "cmirror", built on dm, which turned out to be quite useless,
and is being deprecated. Did you talk to cluster file system developers
and users to find out if this is worth doing? Or are you just hoping it
turns out to be worthwhile? That's might be answered by examples of
successful real world usage that I asked about. We don't want to be tied
down with long term maintenance of something that isn't worth it.


> >What's different about disks being on SAN that breaks data consistency vs
> >disks being locally attached? Where did the dlm come into the picture?
>
> There are multiple nodes using the same shared device. Different
> nodes would be writing their own data to the shared device possibly
> using a shared filesystem such as ocfs2 on top of it. Each node
> maintains a bitmap to co-ordinate syncs between the two devices of
> the RAID. Since there are two devices, writes on the two devices can
> end at different times and must be co-ordinated.

Thank you, this is the kind of technical detail that I'm looking for.
Separate bitmaps for each node sounds like a much better design than the
cmirror design which used a single shared bitmap (I argued for using a
single bitmap when cmirror was being designed.)

Given that the cluster file system does locking to prevent concurrent
writes to the same blocks, you shouldn't need any locking in raid1 for
that. Could elaborate on exactly when inter-node locking is needed,
i.e. what specific steps need to be coordinated?


> >>Device failure can be partial. Say, only node 1 sees that one of the
> >>device has failed (link break). You need to "tell" other nodes not
> >>to use the device and that the array is degraded.
> >
> >Why?
>
> Data consistency. Because the node which continues to "see" the
> failed device (on another node) as working will read stale data.

I still don't understand, but I suspect this will become clear from other
examples.


> Different nodes will be writing to different
> blocks. So, if a node fails, you need to make sure that what the
> other node has not synced between the two devices is completed by
> the one performing recovery. You need to provide a consistent view
> to all nodes.

This is getting closer to the kind of detail we need, but it's not quite
there yet. I think a full-blown example is probably required, e.g. in
terms of specific reads and writes

1. node1 writes to block X
2. node2 ...


> Also, may I point you to linux/Documentation/md-cluster.txt?

That looks like it will be very helpful when I get to the point of
reviewing the implementation.

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/