Re: possible nfsv3 write corruption

From: Pallissard, Matthew
Date: Fri Feb 28 2020 - 11:38:40 EST


I'll just bump this once before letting it slip into the ether.

Matt Pallissard

On 2020-02-27T08:28:43, Pallissard, Matthew wrote:
>
> Forgive me if this is the wrong list.
>
> Ok, I have this super infrequent data corruption on write that seems to be limited to nfsv3 async mounts. I have not tested nfsv4 yet. I _think_ I've narrowed down to the 5.5.0 > X >= 5.1.4 (maybe earlier) kernels. I had some users report they had random data corruption. A bit of testing shows that it's reproducible and the corruption is nearly identical every time.
>
> I'd like to get to the bottom of this so I can guarantee that a kernel upgrade will resolve the issue.
>
> What winds up happening is every several hundred GiB[ish] we wind up with the first half of a 64 bit segment corrupted. Here is some example output from a test. My test writes a few Gib, alternating between 64 bits of `0`'s and 64 bits of `1`'s. I then read it in and check the contents. Re-reading the file shows that it's corrupted on write, not read.
>
> > 2020-02-14 11:04:34 crit found mis-match on word segment 11911168 / 33554432!
> > 2020-02-14 11:04:34 crit found mis-match on byte 7, 188 != 255
> > 2020-02-14 11:04:34 crit found mis-match on byte 6, 0 != 255
> > 2020-02-14 11:04:34 crit found mis-match on byte 5, 16 != 255
> > 2020-02-14 11:04:34 crit found mis-match on byte 4, 128 != 255
> > 2020-02-14 11:04:34 crit 1011110000000000000100001000000011111111111111111111111111111111
>
> > 2020-02-14 13:38:11 crit found mis-match on word segment 1982464 / 33554432!
> > 2020-02-14 13:38:11 crit found mis-match on byte 7, 188 != 255
> > 2020-02-14 13:38:11 crit found mis-match on byte 6, 0 != 255
> > 2020-02-14 13:38:11 crit found mis-match on byte 5, 16 != 255
> > 2020-02-14 13:38:11 crit found mis-match on byte 4, 128 != 255
> > 2020-02-14 13:38:11 crit 1011110000000000000100001000000011111111111111111111111111111111
>
>
> Knowns;
>
> * does not appear to happen on CentOS/EL 3.10 series kernel
>
> * does not appear to happen on a 5.5 series kernel
> * I'm re-running all my tests now to confirm this.
>
> * not hardware dependent
>
> * not processor dependent
> * I tested 3 different Intel processors
>
> * appears to only happen on NFS v3 async mounts
> * local disk and `-o sync` NFS v3 mounts have been tested
>
> * It happens on random 64 bit segments
>
> * It's *always* the same 4 bytes that are corrupted
>
> * While often identical, the corrupted bytes are not always identical
> * the identical corruption pattern can appear on separate computers.
>
> * It's *always* on words that are written with `1`'s <- this is the part I find most interesting
>
> * whether or not I explicitly call `fflush` and `sync` has no effect on the results.
>
> * usually takes ~80-2000Gib to reproduce, sometimes higher or lower but infrequent.
> * I've been writing 2GiB files
> * sometimes I never hit the corruption case.
>
> * I've yet to see more than one corrupted segment in a file.
>
>
> A little bit about the build/run environments;
>
> the hardware
> CentOS 7.
> CentOS glibc 2.17
> clang 9 / lld
> Dell PowerEdge R620
> Dell PowerEdge C6320
> Dell PowerEdge C6420
> Intel(R) Xeon(R) Gold 6230 CPU @ 2.10GHz
> Intel(R) Xeon(R) CPU E5-2660 v4 @ 2.00GHz
> Intel(R) Xeon(R) CPU E5-2680 v2 @ 2.80GHz
>
> * I did compile locally on every box. I also tested every compiled binary on every box. It didn't seem to affect the results.
> * I don't have a tcpdump of this yet. I'm hoping to get that started before the end of the week.
> * I read and write to the same file every time, unlinking it before writing again
> * I have not tried dropping the cache between any of the steps.
> * I have engaged our storage vendor to see what they have to say. They're pretty good at getting useful metrics and insight so if there is anything I should have them gather server-side please let me know.
>
>
> If anyone as any insight or additional testing I can perform I would *greatly* appreciate it. I would be thrilled if this turned out to be some dumb configuration option or other operational thing performed incorrectly.
>
>
> Thank you for your time.
>
> Matt Pallissard