Re: attempt to access beyond end of device

Gerard Roudier (groudier@club-internet.fr)
Mon, 20 Dec 1999 09:58:52 +0100 (MET)


Hi Robert,

On Sat, 18 Dec 1999, Robert Johannes wrote:

> SNIPPET####################################################
>
> ncr53c875-0-<0,0>: phase change 2-7 6@00fbc038 resid=1.
> attempt to access beyond end of device
> 08:04: rw=0, want=1414745929, limit=6197248
> attempt to access beyond end of device
> 08:04: rw=0, want=538976289, limit=6197248
> attempt to access beyond end of device
> 08:04: rw=0, want=538976289, limit=6197248
>
> SNIPPET####################################################
>
> That was just a snippet of the kind of errors I'm getting. Ok, I know
> that for those of you who have been following the above topic are probably
> getting tired of it, but I live with this situation everyday, and have
> some insights I would like to air, that I've not seen aired yet (unless I
> missed something; I know I did, I just don't know what).

> ncr53c875-0-<0,0>: phase change 2-7 6@00fbc038 resid=1.

This message means that the SCSI device changed from COMMAND phase to
MESSAGE IN phase for a 6 bytes command after having accepted 5 bytes.
This likely happens when the device is provided with some bad command.
BTW, command data are the only data that can be checked against bad format
by the device.

> attempt to access beyond end of device
> 08:04: rw=0, want=1414745929, limit=6197248

This happens when the kernel checks the block number against file system
limits. BTW, this is the only possible checking by the kernel about
bad data.

The above messages mean that it happens that the device or the kernel
detects bad data. But since the corresponding checkings are just checkings
about some limits, it may have happen that numerous errors of this kind
are _not_ detected and so silent data corruption did occur.

> My system hardware is as follows: AMD k62 300, with Tekram DC390F scsi
> card (with seagate ST39173WC), matrox G200 AGP card, FIC PA-2013
> mother board with 1mb cache, DEC500 ethernet and 64mb memory. This
> hardware drives redhat 5.2, kernel 2.0.36.

> I've run kernels 2.0.36 through 2.2.13 with above configuration, and still
> gotten those errors and file corruption.

How behave other O/Ses ?

> I've been running redhat 5.2 on this system, with slightly varying
> hardware configuration, for just over a year now. I ONLY started getting
> the above errors and file corruption in late October '99, which is when I
> switched my vga hardware AND software, from using PCI S3_Virge/4mbram and
> XF86-3.3.3 to using matrox AGP G200/8mbram with XF86-3.3.5. One time I
> came back home, to find my system's boot sector wiped out, so I had to
> re-install the distribution (lackily, I had backed-up my system).

The change is very significant, isn't it ? :-)

I would suggest you to check the following points:

1) Fast video boards are good at cooking chips all around.;)
May-be you should check that the new video board does not prevent
proper cooling of the mother-board components.

2) The chipset used on the PA-2013 has been reported to make problems when
using an ATI-128 on AGP and a DC 390F on PCI. The work-around consist
in disabling from the BIOS setup some optimisation involving the cache.
You should try to disable any feature that sounds so.

> I've observed that I only get file corruption and the above errors when
> I'm running in X and the system is fairly busy, say, when I'm compiling
> something, or doing semi-intensive disk i/o. To test this theory, I
> embarked on compiling glibc-2.1.2 and the latest version of gcc. I
> compiled glibc while in console mode (no x running) without a single
> glitch. I rebooted the system (just to use a fresh system) so I can
> compile in X windows, and sure enough, about midway through the compile of
> glibc, I got filesystem corruption and the above errors. I did the same
> thing with gcc, and got similar results. Whenever I compile anything
> fairly large in X, I get the above errors.
>
> It is this observation that has led me to post to this list, at least to
> point out that I've a pattern here that is being caused by a specific
> combination of interactions; i.e, using x and doing fairly intensive disk
> i/o. Could it be the XF86-3.3.5 driver that's the cause of this problem?
> Could it be conflict between XF86-3.3.5 and the scsi driver in the kernel,
> conflict between the AGP and SCSI card.

IMO, the interactions that break also involve the MB chipset, or may-be it
is actually the MB chipset that is the cause of the problem.

> My observations have not been scientific, so I'm not claiming anything.
> I'm simply stating observations that I think might be helpful in narrowing
> down the problem. If this problem has been solved, please notify me of
> what and where the fix is.

I would suggest you to play with all options that looks like caching
optimization alchemy from the BIOS setup. May-be, some will do the trick.

Gérard.

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.rutgers.edu
Please read the FAQ at http://www.tux.org/lkml/