Re: 2.4.22pre8 hangs too (Re: 2.4.21-jam1 solid hangs)

From: Ville Herva
Date: Fri Aug 29 2003 - 15:08:34 EST


On Fri, Aug 29, 2003 at 01:35:25PM -0300, you [Marcelo Tosatti] wrote:
>
> So NMI and sysrq doesnt help. I suggest you a few things:
>
> Try to make the bug easy to reproduce. Force the Oracle dumps again and
> again to crash the box.

I happened to work towards that direction this morning (before I read your
mail). Taking the stance that this very probably had something to do with io
stress, I played around with several io loads. Eventually I found out that
fsx on scsi disk reliably caused the box to either lock up or the aic7xxx
driver to barf. What's more, it took under 15 minutes to trigger.

So I copied the rootfs and everything else from the scsi disk to the ide
disk (just barely had enough space), and took all the scsi disk partitions
away from fstab. After reboot, I have been unable to lock it up with fsx
(scsi disk is not accessed at all), but it will take several weeks before
I'm confident that the lock up is cured.

aic7xxx / scsi hw seems quite strong suspect for the lock ups. 2.2 possibly
worked because it has the older aic7xxx 5.x driver.

> Can you try it or its a production machine?

It is a sort-of-a production machine -- that's way I have been so wary on
trying different things. Sorry for that...

> BTW, can you describe this "Oracle dumps" in more detail? What do they do?
> Save lots of data to disk and thats all or ?

They dump the oracle data base to a backup file.

${ORAHOME}/bin/exp \
***/*** full=Y grants=Y \
file=${DMPDIR}/fullexp.dmp 1>${LOGDIR}/fulllog.`date '+%a'` 2>&1

So basically just heavy IO afaict.

> Hope we can trace this down.

I'm still not 100% sure that the aic7xxx brafs (see
http://lkml.org/lkml/2003/7/29/33 for an example) and the lockups are of
the same origin. But it seems at least 99.5% certain.

If aic7xxx/scsi is to blame, then is it the
- 2940 scsi adapter
- the disk
- the cabling or something (I've checked the termination)
- the motherboard (irq routing?)
- the aic7xxx driver?
- some other kernel issue?

The hw is:
Intel 815EEA2LU (i815 Chipset)
Celeron 1.3GHz (Tualatin)
Adaptec AHA-2940 / AIC-7871
- Disk (rootfs) SEAGATE Model: ST19171W Rev: 0024
- Tape Drive HP Model: C1537A Rev: L708
30GB IDE disk (scratch)


-- v --

v@xxxxxx

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/