Re: high system cpu load during intense disk i/o

From: Dimitrios Apostolou
Date: Sun Aug 05 2007 - 12:05:44 EST


Hello again,

was my report so complicated? Perhaps I shouldn't have included so many
oprofile outputs. Anyway, if anyone wants to have a look, the most important
is two_discs_bad.txt oprofile output, attached on my original message. The
problem is 100% reproducible for me so I would appreciate if anyone told me
he has similar experiences.


Thanks,
Dimitris



On Friday 03 August 2007 19:03:09 Dimitrios Apostolou wrote:
> Hello list,
>
> I have a P3, 256MB RAM system with 3 IDE disks attached, 2 identical
> ones as hda and hdc (primary and secondary master), and the disc with
> the OS partitions as primary slave hdb. For more info please refer to
> the attached dmesg.txt. I attach several oprofile outputs that describe
> various circumstances referenced later. The script I used to get them is
> the attached script.sh.
>
> The problem was encountered when I started two processes doing heavy I/O
> on hda and hdc, "badblocks -v -w /dev/hda" and "badblocks -v -w
> /dev/hdc". At the beginning (two_discs.txt) everything was fine and
> vmstat reported more than 90% iowait CPU load. However, after a while
> (two_discs_bad.txt) that some cron jobs kicked in, the image changed
> completely: the cpu load was now about 60% system, and the rest was user
> cpu load possibly going to the simple cron jobs.
>
> Even though under normal circumstances (for example when running
> badblocks on only one disc (one_disc.txt)) the cron jobs finish almost
> instantaneously, this time they were simply never ending and every 10
> minutes or so more and more jobs were being added to the process table.
> One day later, the vmstat still reports about 60/40 system/user cpu load,
> all processes still run (hundreds of them), and the load average is huge!
>
> Another day later the OOM killer kicks in and kills various processes,
> however never touches any badblocks process. Indeed, manually suspending
> one badblocks process remedies the situation: within some seconds the
> process table is cleared from cron jobs, cpu usage is back to 2-3% user
> and ~90% iowait and the system is normally responsive again. This
> happens no matter which badblocks process I suspend, being hda or hdc.
>
> Any ideas about what could be wrong? I should note that the kernel is my
> distro's default. As the problem seems to be scheduler specific I didn't
> bother to compile a vanilla kernel, since the applied patches seem
> irrelevant:
>
> http://archlinux.org/packages/4197/
> http://cvs.archlinux.org/cgi-bin/viewcvs.cgi/kernels/kernel26/?cvsroot=Curr
>ent&only_with_tag=CURRENT
>
>
> Thank in advance,
> Dimitris
>
>
> P.S.1: Please CC me directly as I'm not subscribed
>
> P.S.2: Keep in mind that the problematic oprofile outputs probably refer
> to much longer time than 5 sec, since due to high load the script was
> taking long to complete.
>
> P.S.3: I couldn't find anywhere in kernel documentation that setting
> nmi_watchdog=0 is neccessary for oprofile to work correctly. However,
> Documentation/nmi_watchdog.txt mentions that oprofile should disable the
> nmi_watchdog automatically, which doesn't happen with the latest kernel.


-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/