Re: NFS Data CORRUPTION Between Linux and SunOS 5.5.1

Ben McCann (bmccann@indusriver.com)
Sun, 16 Aug 1998 14:11:26 -0400


Bill Hawes wrote:
>
> Ben McCann wrote:
>
> > 1. The corruption always begins on a 4096 byte aligned offset
> > in the file (i.e. on a page boundary).
> >
> > 2. 1, 2, or 3 bytes of ZERO are written at the beginning of the page
> > and the rest of the page is SHIFTED by that amount. (When we first
> > saw this we thought a SCSI controller was failing on the Sun
> > server but we've not had any problems with data written via
> > NFS to this Sun from a bunch of WinNT boxes we have here. And,
> > as I said earlier, 2.1.84 works fine).
>
> When you find the corruption in a particular page, is the following page also
> corrupted by having the data shifted over? Or are the extra bytes at the
> beginning of the page dropped from the end?

In every case but one, the last 1, 2, or 3 bytes of the smashed page were
gone (i.e. shifted off the end of this 4096 byte 'shift register')
and the next page was perfectly intact. There was one exception, when
TWO adjacent pages were shifted TOGETHER. But that only happened once
and I'm only 97% certain I actually saw it.

>
> > 3. The location of the smashed page or pages is random. The first
> > is usually 4 or 5 megabytes into the file (which is 11M long) but
> > occasionally it is only 56K into the file.
> >
> > 4. The number of corrupted blocks in a 11M file is small, like
> > 5 or 10.
> >
> > Hope this provides a clue. I couldn't fathom why the data was
> > SHIFTED because that implies the page was COPIED someplace.
> > How many places in the NFS logic COPY entire pages? Perhaps that
> > is a place to look.
>
> When you say that the corruption is random, does this mean that sometimes the
> file is written correctly? It would be very helpful if you could capture
> (e.g. tcpdump) two sessions writing the file, one with the corruption and one
> when it's correct.

The corruption is random in that it smashes different pages on different
trials. However, the failure itself is about 100% repeatable.

I'll try to get some test runs with TCPDUMP next week. However, since I
can't get it to succeed with SunOS, it will have to be runs of NFS to
Linux versus NFS to SunOS. Is that information still usable?

-Ben McCann

-- 
Ben McCann                              Indus River Networks
                                        31 Nagog Park
                                        Acton, MA, 01720
email: bmccann@indusriver.com           web: www.indusriver.com 
phone: (978) 266-8140                   fax: (978) 266-8111

- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo@vger.rutgers.edu Please read the FAQ at http://www.altern.org/andrebalsa/doc/lkml-faq.html