Re: Q: enterprise-readiness of MD-RAID (md: kicking non-fresh dm-14from array!)

From: NeilBrown
Date: Sun May 20 2012 - 20:35:58 EST



Hi,
I'd just like to say up front that I don't think "enterprise" means anything
from a technical perspective.
Maybe it means "more willing to spend on insurance" or "more bureaucracy
here", but I don't think either of those are really relevant here.

We all value our data, and we all want to minimise costs. Different people
will resolve the tension in different ways and buy different hardware, but
md should work with all of them.

So lets just talk about "data readiness" - is md ready for you to trust your
data to - whatever that data is.

On Wed, 16 May 2012 14:29:56 +0200 "Ulrich Windl"
<Ulrich.Windl@xxxxxxxxxxxxxxxxxxxx> wrote:

> Hi!
>
> I'm using disk mirroring with HP-UX and LVM in an enterprise environment for about 20 years. Not loo long ago I started to use disk mirroring with Linux and MD-RAID.
>
> Unfortunately I found a lot of bugs (e.g. mdadm being unable to setup the correct bitmaps) and inefficiencies. Recently I found out that some of our RAID1 are not mirrored any more, and even during boot the kernel does not even try to resynchronize them.
>

Did you report them? Were they fixed?
Which release of linux and mdadm were you using?
There will always be bugs. If you want to minimise bugs, your best bet is to
pay some distro that does more testing and stabilisation.

If your system is configured properly, then you should get email every day
when any array is not fully synchronised.


> The message reads to me like "I found out that one of the disks has obsolete data on it; let's throw it out from the RAID". Naturally my expectations were that the kernel would resynchronize the stale disk blocks.

You are misinterpreting the message.
What is really says is "This device looks like it was ejected from the array,
presumably because it reported an error. I don't know if you want to trust a
device that has produced errors so I'm not even going to try including it
into the array. You might want to do something about that".

The "something" might "find out why it produced an error and fix it" or
"replace it" or "just add it anyway, I don't care if the disk is a bit dodgy".


>
> <6>[ 15.248125] md: md0 stopped.
> <6>[ 15.249075] md: bind<dm-14>
> <6>[ 15.249290] md: bind<dm-16>
> <4>[ 15.249409] md: kicking non-fresh dm-14 from array!
> <6>[ 15.249525] md: unbind<dm-14>
> <6>[ 15.293560] md: export_rdev(dm-14)
> <6>[ 15.296317] md: raid1 personality registered for level 1
> <6>[ 15.296814] raid1: raid set md0 active with 1 out of 2 mirrors
> <6>[ 15.325348] md0: bitmap initialized from disk: read 8/8 pages, set 97446 bits
> <6>[ 15.325461] created bitmap (126 pages) for device md0
> <6>[ 15.325781] md0: detected capacity change from 0 to 537944588288
>
> On another occasion we had the case that after a hard reset (from cluster) one of our biggest RAIDs (several hundred GB) was resynchronized fully, even though it had a bitmap. After a little reading I got the impression that MD-RAID1 always copies disk0 to disk1 if there are mismatches. My expectation was that the more recent disk would be copied to the outdated disk. Note that even if writes to both disks are issued to the queues simultaneously, it's not clear (especially with SAN storage and after a reset situation) which of the disks got the write done first.

I am surprised that a full sync happened when a bitmap was present. I would
need details to help you understand what actually happened and why.

You seem to be saying that you expect md/raid1 to copy the newer data even
though you know it is not possible to know which is the newer data. ....

There is a very common misunderstanding here. Between the time when you start
writing to a device and the time when that write reports that it is
complete, there is no "correct" or "best" value for the data in the target
block. Both the old and the new are equally "good". Any filesystem or other
client of the storage must be able to correctly handle either value being
returned by subsequent reads after a crash, and I believe they do.
There is no credible reason to prefer the "newer" data.


>
> My latest experience was with SLES11 SP1 which may not have the latest code bits.

It is a little old, but not very. It should be fairly stable. If you have
problems with SLES11-SP1 and have a maintenance contract with SUSE, I suggest
you log an issue.

>
> If anybody wants to share his/her wisdom on that (the enterpris-readyness of MD-RAID, please reply to the CC: also as I'm not subscribed to the kernel list.

Yes, there are bugs from time to time, but if you manage your arrays sensibly
and have regular alerts configured with "mdadm --monitor", all should be well.

>
> BTW: I had made some performance comparisons between our SAN storage ("hardware") and MD-RAID ("software") regarding RAID levels:
>
> hardware seq. read: RAID0=100%, RAID1=67%, RAID5=71%, RAID6=72%
> hardware seq. write: RAID0=100%, RAID1=67%, RAID5=64%, RAID6=42%
>
> software seq. read: RAID0=100%, RAID1=44%, RAID5=36%, RAID6=not done
> software seq. write: RAID0=100%, RAID1=48%, RAID5=19%, RAID6=not done
>
> Note: I was using two independent SAN storage units for the RAID1 tests; for the higher levels I had to reuse one of those SAN storage units.
>
> Measuring LVM overhead I found a penalty of 27% when reading, but a 48% boot for writing. I never quite understood ;-)
> Comparing I/O schedulers "cfq" with "noop", I found that the latter improved throughput from about 10% to 25%.
>
> Now if you combine "cfq", MD-RAID5 and LVM, you'll see that Linux is very effective when taking your performance away ;-)

cfq is probably not a good choice for a SAN. noop is definitely best there.
RAID5 is obviously slower than native access, but also safer.
I cannot comment on LVM.

In summary: md works for many people. If it does not work for you, I am
sorry.
If you have specific issues or question, I suggest you report them with
details and you may well get detailed answers.

NeilBrown

Attachment: signature.asc
Description: PGP signature