Point of thread: Two problems, mentioned in detail below, NCQ in Linux when used in a RAID configuration and two, something with how Linux interacts with the drives causes lots of problems as when I run the WD tools on the disks, they do not show any errors.
If anyone has/would like me to run any debugging/patches/etc on this system feel free to suggest/send me things to try out. After I put the VR's in a test system, I left NCQ enabled and I made a 10 disk raid5 to see how fast I could get it to fail, I ran bonnie++ shown below as a disk benchmark/stress test:
For the next test I will repeat this one but with NCQ disabled, having NCQ enabled makes it fail very easily. Then I want to re-run the test with RAID6.
bonnie++ -d /r1/test -s 1000G -m p63 -n 16:100000:16:64
$ df -h
/dev/md3 2.5T 5.5M 2.5T 1% /r1
And the results? Two disk "failures" according to md/Linux within a few hours as shown below:
Note, the NCQ-related errors are what I talk about all of the time, if you use
NCQ and Linux in a RAID environment with WD drives, well-- good luck.
Two-disks failed out of the RAID5 and I currentlty cannot even 'see' one of the drives with smartctl, will reboot the host and check sde again.
After a reboot, it comes up and has no errors, really makes one wonder where/what the bugs is/are, there are two I can see:
1. NCQ issue on at least WD drives in Linux in SW md/RAID
2. Velociraptor/other disks reporting all kinds of sector errors etc, but when you use the WD 11.x disk tools program and run all of their tests it says the disks have no problems whatsoever! The smart statistics do confirm this. Currently, TLER is on for all disks, for the duration of these tests.