IDE - New Raid - Automatic Thread Kill Problems

CJones (zerver@iname.com)
Tue, 28 Sep 1999 12:41:47 -0700


This is a kernel question, NOT a raid question. Please CC me on any
replies as I am not on linux-kernel.

I have been testing the new raid code for stability. I discovered some
serious problems when attempting to run raid on 486 systems using IDE
drives.

SCENARIO:
CPU = AMD 486 on the ALI chipset (IDE has no DMA)
I have 4 - 25gb IBM IDE drives (ide0 master/slave, ide1 master/slave).
I run mkraid to put all 4 drives into a single raid 5 resulting in a
75gig /dev/md0
I then try to run mke2fs /dev/md0

I know that the raid 5 array is being synchronized while mke2fs is
running. (Its supposed to work and does - read on).

Occasionally, there will be a kernel oops immediatly after calling
mkraid and before I can even call mke2fs. The oops results from call to
address 0 generated by run_task_queue(&tq_disk). The md resync thread
has been killed in these cases.

If the oops does not happen and I get to run mke2fs, it is killed with a
signal 9 while writing the inode blocks. I can run mke2fs over and over
with it always being killed. Inserting a sleep(1) after writing each
inode block solves this problem.

I have used kernel 2.2.10, 2.2.11, 2.2.12 with only the latest raid
patches applied. All kernels seemed to act similarly.

I then commented out the two lines (in kernel/sched.c in the
do_process_times() subroutine) which kill threads when
resource[RLIMIT_CPU].rlim_max is exceeded.

Once this code was removed, all raid functions work as expected. The
system does not even seem to be overly degraded. I can mkraid and
mke2fs with no problems

I am hypothesizing that because the IDE has no DMA, all programmed I/O
is causing a single thread to take too much time and thus be killed.

HOWEVER!!

The tasks are killed within about 10 seconds and sometimes < 1 second.
[RLIMIT_CPU].rlim_max on 386 systems is initialized to LONG_MAX, and a
call to getrlimit shows it still set to 2gig. Shouldn't that mean the
task won't be killed until 2 million seconds have elapsed?

I have yet to find a reason why rlim_max was exceeded. Is it safe for
me to put prink statements into do_process_times since it runs at
interrupt time?

Should there be throttling code in IDE and/or raid to prevent such
dominance?

Clay

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.rutgers.edu
Please read the FAQ at http://www.tux.org/lkml/