NFS Client Problems with 2.2.12

Olaf Flebbe (o.flebbe@science-computing.de)
Wed, 03 Nov 1999 14:16:11 +0100


Hi,

We have a nasty problem with the linux NFS client code in the late
2.2.1x Kernels. It results in file corruption.

As far as I investigated it, it is an AIX NFS Server bug, but maybe
there is a way around it, without loosing the attribute cache. (I.e.
Mounting noac is a bug work-around)

Now some more detailled information:

+ AIX NFS Server, Linux 2.2.12 mounting a filesystem from aix.

+ Compiling a special FORTRAN programm with the Portland Group Compiler
(Commercial).

+ The resulting binary is different with respect to a binary which is
compiled locally. (If you compare it, some patterns are shifted by one,
but the total length of the exceutable is the same) The binary dumps
core, btw.

+ Mounting the filesystem -onoac, the executable is O.K.

+ SGI (IRIX 6.2 and 6.5) as NFS Server: No Problem

+ Solaris may have a problem, too.

Now some more details on the compile step:

+ Compiling the Executable consist of several steps: One of it is
generating assembler and the last step is the linker ld.

+ The generated object file from the assembler is O.K, rather then the
a.out is not O.K. i.e. the link step fails.

+ If one starts with mounting the AIX Filesystem and only invocing the
linker pass alone (object was created before) one can _not_ trigger the
bug. Copying the object is not sufficient. Looks like a caching bug: But
on which side?

A deeper look into the link step:

+ Analysing the link step with strace one can see that it does many
write
random seek
read

It reads portions which are written before, in order to backpatching
some code. So far O.k.

But if one compares data which a written before and the data read
afterwards sometimes
there is a mismatch! There is a off by one shift in it.

(for example seek 1024, write 1,2,3,4,5 some other fancy writes at
other places, and then
seek 1024 it reads 2,3,4,5!)

Is it Linux or AIX???

+ Now I did a hexdump of the NFS traffic with tcpdump (It is a little
bit hard to read, because one has to leave out the RPC and so header...
but one can find some patterns)

I am quite sure that indeed the NFS traffic over the wire reads back the
wrong data.

Now I am at the end ;-( I see no chance to explain this to the AIX
Support, so did not try it. Is someone willing to digging into it?
Unfortunatly IBM is supporting 2.2.5(?) (on an netfinity server, which
is present in this environment) But one can _not_ reproduce it with this
kernel.

Please answer with a CC, because I am not on the kernel-list any
more....

BTW: 2.3.xx did not help.

Olaf
PS: Let me say that I appreciate the new super-optimzed NFS-client Code
very much, but unfortuntely it seems to trigger NFS Server bugs on other
platforms.

-- 
  Dr. Olaf Flebbe                            Phone +49 (0)7071-9457-32
  science + computing gmbh                     FAX +49 (0)7071-9457-27
  Hagellocher Weg 71
  D-72070 Tuebingen  Email: o.flebbe@science-computing.de

The amount of work to be done increases in proportion to the amount of work already completed.

- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo@vger.rutgers.edu Please read the FAQ at http://www.tux.org/lkml/