cerberus errors on 2.4.19 (ide dma related)

From: Ed Sweetman (safemode@speakeasy.net)
Date: Sun Aug 18 2002 - 01:44:44 EST


(overview written in hindsight of writing email)
I ran all these tests on ide/host2/bus0/target0/lun0/part1
when dma was enabled or disabled is was done to both drives at the same
time.
I do not know if cerberus cares where it is run or not to do it's tests,
but the program was on the drive it was tested on when run and
throughout the email i assume it only runs it's drive tests primarily on
the partition you've run it on. I see now that this is probably wrong
and instead of changing where i run the test i should alternate which
drive gets dma enabled and disabled and process of elimination will show
just the kind of dma bug i'm seeing
(/overview)

I've been trying to track down why i seem to get disk corruption on my
harddrives after some good amount of usage all the time. It's been
happening for a long time across a number of different kernel versions.
I believe this is because i stick to the same board manufacturer, Abit
and use via chipsets.

I ran cerberus with dma enabled at UDMA4 and UDMA2, at udma4 cerberus
reports MEMORY errors and BBidehost2bus0target0lun0discN1 errors, but
mostly MEMORY errors before the kernel panics after a minute or two. At
udma2 the cerberus reports no errors but panics after a minute or two.
I ran cerberus a couple times on each, with UDMA4 it began to error
about 30 seconds into the test with MEMORY errors.

I thought, well this could be ram errors, so i ran memtest for a couple
hours. Nothing reported as being bad. I then thought, my hardware
could be the problem, so I ran e2fsck -c on the partition I was running
cerberus on with dma disabled via hdparm -d0 and it completed with no
errors found. I then rebooted, enabled udma2 and the kernel panic'd
with the same test after a few minutes.

The rest of this email is just information regarding the setup

First off the way my fs's are setup are as follows:

swap + files are now all on my primary master ide drive on the
motherboard ide controller. Swap on my primary master promise controller
seemed too problematic because of corruption, but i'm not sure if the
corruption i've seen is related only to the promise controller or if
it's not controller specific. I'll have to run the test without swap on
the promise drive and then run the test on my primary motherboard hdd
and again without swap.

cerberus version : 1.3.0pre4
dmesg info : http://signal-lost.homeip.net/lkml/dmesg
hdparm info : http://signal-lost.homeip.net/lkml/hdparm
pci info : http://signal-lost.homeip.net/lkml/lspci

tests completed before escaping in pio mode:
http://signal-lost.homeip.net/lkml/tests_passed

Errors during last test that caused kernel panic (udma2)
http://signal-lost.homeip.net/lkml/memory

Errors during test of udma4 (first test)
http://signal-lost.homeip.net/lkml/memory2
http://signal-lost.homeip.net/lkml/dmesg2
various segfaults of badblocks of BBidehost tests.

I ran memtest for an extensive amount of time after the first test
reported memory errors and go absolutely no errors (wasn't using dma
mode at the time either). And since these errors aren't produced when
not using DMA on my drives I find it very unlikely that it's "System
Ram" as the cause of them. I'm going to rerun the test on my
motherboard primary drive after posting this in case something happens
and i hose everything.
 

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/



This archive was generated by hypermail 2b29 : Fri Aug 23 2002 - 22:00:15 EST