Re: serious MM problem with pre8,9 - vmstat stuck in D state

Mike Galbraith (mikeg@weiden.de)
Sun, 24 Jan 1999 07:36:55 +0100 (CET)


On Sat, 23 Jan 1999, David Mansfield wrote:

> ... actually, any process attempting to read from proc can stick in D
> state for considerable time (minutes??) during heavy swapping *and*
> context switching. pre7 was fine. My system is a UP PPro200, 96MB ram,
> IDE disks (3 of them) with swap partitions on each. Compiled under RedHat
> 5.1 (plus nec. upgrades) with gcc 2.7.2.3.

(please excuse the length of this)

Hi,

My setup is very similar, 4 scsi drives w. 3 swap partitions active.
(because with 4 active, I get too darn many timeouts/bus_lockups):

Maybe this isn't the same thing, but maybe it is. What I see is
processes starving for cpu time. My test script is essentially..
for i in 1 2 3 4 5 6 7 8 9 10 11 12
do
lmdd opat=1 bs=8m count=128 &
done
wait
..which sends my box 32M into swap.

vmstat 1 | tee xx on another terminal.

procs memory swap io system cpu
r b w swpd free buff cache si so bi bo in cs us sy id
0 0 0 3540 69264 1656 4072 8 0 2 0 123 11 1 2 97
12 0 0 3528 67348 1712 4076 12 0 50 0 152 236 9 8 84
12 0 0 3528 39512 1712 4076 0 0 0 0 125 9 58 42 0
12 0 0 3528 1984 1712 4076 0 0 0 0 163 11 51 49 0
7 6 0 33344 792 1636 3040 3072 31652 1157 7939 5736 4081 52 17 31
8 4 0 32416 908 1636 1996 36364 24452 10515 6124 8469 7459 93 7 0
7 5 0 32308 1512 1636 1284 35560 22120 9738 5545 7658 6709 93 7 0
8 4 0 32424 1984 1636 928 12340 7404 3153 1852 2596 2317 94 6 0
8 5 0 32204 664 1648 2028 14844 6916 5682 1750 3187 2671 91 9 0
7 5 0 32128 928 1636 1696 47268 26204 12840 6585 9735 8337 93 7 0
7 5 0 31912 1108 1636 1300 10888 5660 2856 1415 2236 1878 92 8 0
7 5 0 32260 1084 1636 1672 19032 10300 5008 2578 3671 3045 93 7 0
7 5 0 32184 1528 1636 1152 32080 16264 8759 4081 6337 5426 93 7 0
7 5 0 32600 1896 1636 1200 33616 16840 8949 4218 6464 5314 94 6 0
7 5 0 32660 1892 1636 1264 31764 15680 9472 3929 6213 5182 93 7 0
[here the 7 jobs which were getting cpu time finish and the rest start.
it took much more than 14 secs to get here.. nothing snipped]
0 4 0 24240 45540 1636 6880 40464 12272 11152 3083 6493 5771 88 5 6
0 5 0 17932 38908 1640 7200 6600 0 1681 0 794 1026 15 5 80
4 2 0 4928 29344 1640 3760 9564 0 2391 1 1291 1543 97 3 0
5 0 0 4416 29188 1640 3404 156 0 39 0 130 60 97 3 0
6 0 0 4416 29172 1640 3420 0 0 16 0 108 13 97 3 0
5 0 0 4416 29172 1640 3420 0 0 0 0 104 13 97 3 0
5 0 0 4416 29172 1640 3420 0 0 0 0 104 12 97 3 0
5 0 0 4416 29172 1640 3420 0 0 0 0 104 13 97 3 0
6 0 0 4416 29072 1640 3500 0 0 75 0 112 18 97 3 0
[snip rest]

Set vmstat and tee realtime and repeat.

procs memory swap io system cpu
r b w swpd free buff cache si so bi bo in cs us sy id
0 0 0 3840 68196 1728 5228 0 0 0 0 132 12 1 2 97
3 8 0 3840 67112 1732 5232 0 0 5 7 117 28 4 6 90
12 0 1 3840 44052 1740 5232 0 0 2 0 107 36 61 37 3
12 0 0 3840 20180 1740 5232 0 0 0 0 104 10 63 37 0
12 0 0 3516 1684 1636 3396 44 0 1035 0 137 56 59 41 0
9 7 1 28192 1452 1636 3592 536 24928 469 6234 4358 2886 84 16 0
[still large delay when heavy swapout starts, not exactly 1 sec interval
after megaswapout either. Only half of the jobs are getting cpu]
6 8 0 33200 1776 1640 1896 1112 5324 594 1339 1179 659 84 16 0
6 6 0 33524 1624 1636 2400 1304 1104 526 276 416 350 93 7 0
6 6 1 32420 1064 1636 1852 1876 212 598 53 293 318 93 7 0
6 6 0 32940 1116 1636 2320 1368 1524 343 383 485 382 89 11 0
6 6 0 32744 1604 1636 1636 2256 1480 564 370 549 471 92 8 0
6 6 0 32392 1032 1636 1856 1988 1144 497 286 416 402 95 5 0
[snip rest]

The rest goes as above. When done...

1024.00 MB in 192.24 secs, 5.33 MB/sec
1024.00 MB in 194.85 secs, 5.26 MB/sec
1024.00 MB in 193.96 secs, 5.28 MB/sec
1024.00 MB in 193.84 secs, 5.28 MB/sec
1024.00 MB in 194.57 secs, 5.26 MB/sec
1024.00 MB in 187.88 secs, 5.45 MB/sec
1024.00 MB in 352.11 secs, 2.91 MB/sec
1024.00 MB in 355.57 secs, 2.88 MB/sec
1024.00 MB in 365.19 secs, 2.80 MB/sec
1024.00 MB in 354.75 secs, 2.89 MB/sec
1024.00 MB in 359.50 secs, 2.85 MB/sec
1024.00 MB in 354.44 secs, 2.89 MB/sec

strace -T vmstat 1 2>yy and start testvm script on other term. After
initial swapout flurry when vmstat finally outputs, stop and grep read
yy | grep lmdd. [none snipped]

read(8, "962 (lmdd) R 961 961 163 1025 96"..., 1023) = 150 <0.000235>
read(8, "963 (lmdd) R 961 961 163 1025 96"..., 1023) = 147 <0.000235>
read(8, "964 (lmdd) R 961 961 163 1025 96"..., 1023) = 146 <0.000226>
read(8, "968 (lmdd) R 961 961 163 1025 96"..., 1023) = 149 <0.000234>
read(8, "969 (lmdd) R 961 961 163 1025 96"..., 1023) = 149 <0.000229>
read(8, "962 (lmdd) D 961 961 163 1025 96"..., 1023) = 151 <17.774210>
read(8, "963 (lmdd) D 961 961 163 1025 96"..., 1023) = 156 <3.274794>
read(8, "964 (lmdd) R 961 961 163 1025 96"..., 1023) = 152 <0.000247>
read(8, "965 (lmdd) R 961 961 163 1025 96"..., 1023) = 152 <0.000240>
read(8, "966 (lmdd) D 961 961 163 1025 96"..., 1023) = 154 <12.218780>
read(8, "967 (lmdd) D 961 961 163 1025 96"..., 1023) = 154 <30.641330>
read(8, "968 (lmdd) R 961 961 163 1025 96"..., 1023) = 153 <0.000240>
read(8, "969 (lmdd) R 961 961 163 1025 96"..., 1023) = 153 <0.000234>
read(8, "970 (lmdd) D 961 961 163 1025 96"..., 1023) = 153 <4.549036>
read(8, "971 (lmdd) D 961 961 163 1025 96"..., 1023) = 155 <6.701013>
read(8, "972 (lmdd) D 961 961 163 1025 96"..., 1023) = 155 <9.581525>
read(8, "973 (lmdd) R 961 961 163 1025 96"..., 1023) = 154 <0.000289>

Repeat with vmstat and strace set realtime.
[grep read yy|grep lmdd|sort -r +11 and snipped due to length]

read(8, "1008 (lmdd) D 1004 1004 163 1025"..., 1023) = 173 <0.822696>
read(8, "1005 (lmdd) R 1004 1004 163 1025"..., 1023) = 173 <0.612336>
read(8, "1012 (lmdd) R 1004 1004 163 1025"..., 1023) = 169 <0.527522>
read(8, "1011 (lmdd) D 1004 1004 163 1025"..., 1023) = 170 <0.500054>
read(8, "1014 (lmdd) R 1004 1004 163 1025"..., 1023) = 168 <0.474706>
read(8, "1010 (lmdd) D 1004 1004 163 1025"..., 1023) = 172 <0.444925>
read(8, "1015 (lmdd) D 1004 1004 163 1025"..., 1023) = 172 <0.441973>
read(8, "1008 (lmdd) D 1004 1004 163 1025"..., 1023) = 173 <0.354971>
read(8, "1009 (lmdd) R 1004 1004 163 1025"..., 1023) = 171 <0.345651>
read(8, "1011 (lmdd) D 1004 1004 163 1025"..., 1023) = 175 <0.298617>
read(8, "1009 (lmdd) R 1004 1004 163 1025"..., 1023) = 173 <0.249967>
read(8, "1016 (lmdd) D 1004 1004 163 1025"..., 1023) = 173 <0.139786>
read(8, "1011 (lmdd) D 1004 1004 163 1025"..., 1023) = 175 <0.057358>
[many snipped]

> Doing this on kernels before pre8 works fine. In fact, I have been
> running such a test since 2.1.115 and the MM is flying with respect to
> swap perf. and interactivity while the test is running, except for this
> new wrinkle.

This is the part that makes me think maybe it's _not_ the same thing.
I've been seeing this to greater or lesser degree for the entire round
of VM testing/modification. Heck, maybe it's normal? Performance is
very nice here also. My box is very happy with final (plus wchan fix)
except for darn scsi bugaboos.

-Mike

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.rutgers.edu
Please read the FAQ at http://www.tux.org/lkml/