(cached?) files corruption.

QingLong (qinglong@Bolizm.ihep.su)
Wed, 11 Jun 1997 09:28:30 +0400 (MSD)


Hi, All!

This is bug report and a call for help.

I've faced very-very strange (really weird) behaviour of the system.
The problem is so strange that I even don't know what developers group
to contact, send bug report, ask for help. Let me describe it.

First problems of this type have appeared about 6 months ago,
when I had plugged in new EIDE 3GB hard disk:
`Western Digital "Caviar 33100"' 6136 cyl, 16 hd, 63 sec

in addition to EIDE 1.3GB hard disk:
`Seagate "Medalist"' 2477 cyl, 16 hd, 63 sec.

Both HD's installed on primary IDE interface, bus is VLB,
I/O card is "EVLSIO-V2" (PDC20230 (main chip on the card)) (made in Taiwan).
There is standard IDE (ATAPI) CDROM drive on the secondary interface.

Approximately at the same time I've installed the latest 2.1.* kernel
(I doubt if this can be the main matter of the problems, as switching to
2.0.27 and 2.0.30 kernels haven't solved the problem).

Never before I had had such problems (described below). Both CD drive
and old (1.3GB) HD worked well. The new configuration have worked well for
a while (the new HD had been plugged in November 1996 and the first
problems arised in February 1997 (2.1.27 kernel was running at the time)).

Now the problem itself.
Frequently accessed large (more than 1MB) files suffer corruption.
The most frequent victim of such corruption is GCC `cc1' binary
(which is located at /usr/lib/gcc-lib/i486-linux/2.7.2.1/cc1 and
is launched by gcc to make the main compilation).

Comparing (`cmp -l') corrupted binary against fresh copy gives:
528122 373 377 (The difference is in one bit!)
528046 373 377
949014 373 377
950542 373 377
These set of differences is stable: every time I face it,
the corruption is restricted to these set, i.e. it is some arbitrary
subset of this set.

I've already tried (just after refreshing from GCC distribution .tar.gz):
to move `cc1' to another place on the same partition,
to move `cc1' to another partition on the same HD,
`chattr +i' it and the appropriate directory
Without any success... The aforementioned set remains constant,
i.e. both offsets, and byte differences remain the same!
This fact makes me guess that the problem actually is in Linux cache
rather than HD.

Moreover, often putting fresh copy of cc1 to it's standard place
(`cp cc1 /usr/lib/gcc-lib/i486-linux/2.7.2.1/cc1' or `cat cc1 > /usr/...')
doesn't repair the binary, but (!!!) corrupts _both_ the binary _and_
_fresh_ _copy_ which is located on the other HD partition!
And corruptions to a fresh copy are also restricted to the mentioned set.
I.e. corruption takes place when a file is just accessed for reading
often enough (probably because it goes to io cache?).

Sometimes (after waiting for awhile) corrupted files repair automagically
without any manual intervention...

This weird problems aren't restricted to `cc1', I've also faced them with
e.g. tar.gz archives.

The frequency is very unstable. Sometimes it goes smooth for a week or
even a fortnight. Sometimes such corruptions begin to happen every minute,
and I can do (compile) nothing for a long time. :(

Please, help me to solve this problem.
I'll be glad to answer your questions and give you any additional info.

Thank You very much!

QingLong.

PS. Please `Cc' your answers to me.
(I am not on the list itself, I am suscribed to the mailing list digest.)