Re: Idle loadavg of ~1, maybe MD related

From: Andrew Morton
Date: Sat Jan 05 2008 - 04:30:39 EST


On Sun, 30 Dec 2007 02:06:14 -0800 "Robin H. Johnson" <robbat2@xxxxxxxxxx> wrote:

> (Please CC me, I'm subbed to LKML).
>
> My G5, while running practically nothing (just sshd and some to watch the
> load), has a weird cycle of load averages. I think it might be related to MD,
> simply because that's the only thing that is clocking up cputime.

Usually this is caused by a process which is stuck in "D" state
(uninterruptible sleep) when it shouldn't be.

Check this by running `ps aux' or whatever.

> A full cycle lasts approximately 27 minutes.
> Mins Load
> 0-2 0.0-0.15 (stable, 0 level)
> 3-5 0.50, 0.80, 0.95 (fast increase)
> 6-21 0.95-1.10 (stable, 1 level)
> 22-24 0.9, 0.8, 0.1 (fast decrease, to 0 level)
> 25-27 0.2, 0.3, 0.15 (local maxima peak)
>
> Here's a graph of it, spanning 230 minutes:
> http://dev.gentoo.org/~robbat2/20071230-g5-loadavg-bug.png
>
> Processed data for the graph here:
> http://dev.gentoo.org/~robbat2/20071230-g5-loadavg-bug.txt
>
> For the entire 230 minute period, there was _no_ disk I/O.
> Not recorded by iostat, nor generated.
>
> # while true ; do uptime ; iostat -t 60 2 -N -d | tail -n15 ; done >/dev/shm/foo
>
> Example of single output pass for the above loop:
> 00:59:37 up 8:32, 2 users, load average: 0.02, 0.47, 0.66
> Device: tps Blk_read/s Blk_wrtn/s Blk_read Blk_wrtn
> sda 0.00 0.00 0.00 0 0
> sdb 0.00 0.00 0.00 0 0
> md1 0.00 0.00 0.00 0 0
> md0 0.00 0.00 0.00 0 0
> md2 0.00 0.00 0.00 0 0
> md3 0.00 0.00 0.00 0 0
> vg-usr 0.00 0.00 0.00 0 0
> vg-var 0.00 0.00 0.00 0 0
> vg-tmp 0.00 0.00 0.00 0 0
> vg-opt 0.00 0.00 0.00 0 0
> vg-home 0.00 0.00 0.00 0 0
> vg-usr_src 0.00 0.00 0.00 0 0
> vg-usr_portage 0.00 0.00 0.00 0 0
>
> This is basically 1842c7f2 from Linus's tree, my own stuff is config'd out with
> =n for the moment. And the problem does still occur in the main tree.
>
> Snippet from the head of 'top', sorting by cputime.
>
> top - 01:59:08 up 9:32, 2 users, load average: 1.04, 0.87, 0.70
> Tasks: 74 total, 1 running, 73 sleeping, 0 stopped, 0 zombie
> Cpu0 : 0.0%us, 0.0%sy, 0.0%ni,100.0%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st
> Cpu1 : 0.0%us, 0.5%sy, 0.0%ni, 99.5%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st
> Cpu2 : 0.0%us, 0.2%sy, 0.0%ni, 99.8%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st
> Cpu3 : 0.0%us, 0.0%sy, 0.0%ni,100.0%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st
> Mem: 12074480k total, 292520k used, 11781960k free, 76812k buffers
> Swap: 8388536k total, 0k used, 8388536k free, 144276k cached
>
> PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
> 4635 root 15 -5 0 0 0 S 0 0.0 8:14.33 md3_raid1
> 3121 root 15 -5 0 0 0 S 1 0.0 3:15.12 md2_raid1
> 3098 root 15 -5 0 0 0 S 0 0.0 3:07.83 md1_raid1
> 3076 root 15 -5 0 0 0 S 0 0.0 0:45.82 md0_raid1
> 829 root 15 -5 0 0 0 D 0 0.0 0:01.85 kwindfarm
> 13 root 15 -5 0 0 0 S 0 0.0 0:01.41 ksoftirqd/3
> 18 root 15 -5 0 0 0 S 0 0.0 0:01.39 events/3
> 10 root 15 -5 0 0 0 S 0 0.0 0:00.94 ksoftirqd/2
> 1 root 20 0 1900 652 576 S 0 0.0 0:00.89 init
> 32086 root 20 0 9336 2696 2076 S 0 0.0 0:00.87 sshd

>From that I'd suspect that kwindfarm is being a bad citizen.

If a process is consistently stuck in D state, run

echo w > /proc/sysrq-trigger

then record the resulting dmesg output so we can see where it got stuck.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/