Re: access beyond end of device errors in 2.2.12/13pre12

Mark Hagger (mhagger@dera.gov.uk)
Thu, 30 Sep 1999 16:04:48 +0100


On Thu, 30 Sep 1999, Stephen C. Tweedie wrote:
> Hi,
>
> On Mon, 27 Sep 1999 12:29:12 +0100, Mark Hagger <mhagger@dera.gov.uk>
> said:
>
> > I've seen some reports of people getting "access beyond end of device"
> > file access errors in 2.2.12, I have just seen a large number of these
> > myself during a program run that was doing considerable file io.
>
> OK, I'd really like to know what sort of IO was going on. Is it heavy,
> sustained access to a single large file or is there a lot of open/close
> activity going on? Were you deleting data at the time? Allocating
> data? If we can get some feedback on the load patterns, that will help.
>
Glad to oblige.

The application was throwing large amounts of data in and out of memory, we're
talking a machine with 512Mbytes RAM and 2 PII 450's, running 2 processes
each of which is regularly handling disk i/o of around 1200Mbytes, both reading
and writing. More precisely each process was reading about 600Mbytes in during
a matrix-vector multiply cycle, and then writing the data back to a part of the
other 600Mbyte block, which would involve both reading and writing as blocks
were swapping in and out. The 600Mbyte blocks were split over about 6 files,
all of which were open during the entire run. But there would have been a
reasonable amount of lseek-ing going on for the read/write stage as the blocks
are accessed essentially randomly.

Apart from these 2 processes on the machine there would have been very little
other activity, certainly no other users logged on.

Typically it takes about 24 hours of disk thrashing to start generating the
problems, HOWEVER, this may just be that it takes 24 hours for the program run
to get to the point of accessing parts of the file that causes the
problem.....!!

I don't know if is it relevant, but it occured to me this morning that the data
block where the read/write was happening is never explicitly set to anything at
the start, it is implicitly created by lseek'ing to various parts ot the file,
if you see what I mean. As I understood it this is fine, I can "create" a big
file by opening it and lseeking to the "end" of the file and writing a single
byte. The "actual" file will only have 1 byte in it, but if I try to read the
intermediate block then I get zero's. I've modified the code to actually
write some data to the big file at the start and am running the application now
to see if it makes any difference.

Finally, I should say that once these errors have started occuring then some of
the data block files are pretty badly corrupted in that all i/o to them (eg. cp
or mv) results in file input/output error's.

Although I can generate this problem fairly consistently it isn't an easy
matter, as I say I need to run the application (a multi-processor number
crunching job) for a long time before it occurs, I have tried to simpler
programs in an attempt to generate the problem but failed.

I did notice that ll_rw_blk.c in drivers/block (the part of the kernel that
generates the error) had changed between kernel 2.2.5 (the last one that I'm
certain never generated this problem) and 2.2.12, although this too might be a
red herring....

Let me know if you want any additional info.

Mark

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.rutgers.edu
Please read the FAQ at http://www.tux.org/lkml/