Re: 2.6.1 IO lockup on SMP systems

From: Andrew Morton
Date: Mon Feb 23 2004 - 17:31:08 EST


"Sergey S. Kostyliov" <rathamahata@xxxxxxx> wrote:
>
> > > OK, so everything is stuck trying to allocate memory. Perhaps you ran out
> > > of swapspace, or some process has gone berzerk allocating memory.
>
> The memory exhaustion is indeed possible for this box. I'll double check
> ulimit and /etc/security/limits.conf stuff. The only thing which worries
> me that this box had been running for months without any problems with
> 2.4.23aa1.

It is conceivable that you have some application which runs OK on 2.4.x but
has some subtle bug which causes the app to go crazy on a 2.6 kernel
consuming lots of memory. Or there's a bug in the 2.6 kernel ;)

> I have added another 2Gb to swap space (hope this give enough time
> to find the memory hungry process(es)).
>
> > >
> > > How much memory does the machine have, and how much swap space?
> > >
> > # free
> > total used free shared buffers cached
> > Mem: 2073868 2067508 6360 0 232708 897828
> > -/+ buffers/cache: 936972 1136896
> > Swap: 1535976 5228 1530748
> >
> > > I suggest that you run a `vmstat 30' trace on a terminal somewhere, see
> > > what it says prior to the hangs.
> > Ok.We'll try to get it next time.
>
> Here it is:
> procs -----------memory---------- ---swap-- -----io---- --system-- ----cpu----
> r b swpd free buff cache si so bi bo in cs us sy id wa
> 1 0 551920 8108 203744 933532 0 0 4 68 1214 426 5 1 92 2
> 0 0 551928 7140 203756 930316 0 0 17 61 1240 529 8 1 89 2
> 0 0 551976 5788 203772 928224 1 6 360 139 1297 317 7 2 83 8
> 0 0 551968 7588 203812 923504 0 0 19 125 1303 308 8 2 87 4
> 0 1 551976 10444 203892 914100 0 0 25 127 1433 438 10 3 85 3
> 0 0 551976 9220 204004 914804 0 0 123 126 1278 325 6 1 88 5
> 0 0 551976 8108 204044 912248 0 0 38 69 1279 291 6 1 91 2
> 0 1 551976 11828 204144 912320 1 0 135 94 1249 296 6 1 89 3
> 0 5 562204 3280 203952 157084 1 566 305 674 1281 313 6 4 73 17
> 0 18 598224 4276 1888 33356 91 2734 233 2761 1090 199 0 2 0 97
> 1 38 662520 2760 2104 30520 110 3721 261 3738 1161 831 1 2 0 97
> 10 41 699936 2772 1920 28716 123 2924 249 2946 1103 1273 0 3 0 97
> 0 39 748588 2956 1956 22668 160 3313 245 3331 1056 1047 0 2 0 98
> 0 38 796100 3108 1888 21348 321 3191 430 3206 1045 1002 0 2 0 97
> 4 43 844532 3308 1956 17644 518 3719 670 3733 1357 999 0 2 0 98
> 0 51 882596 2940 2052 13960 520 2796 705 2810 1048 1182 0 2 0 98
> 3 59 913392 2456 2048 10900 1013 2524 1308 2542 1144 601 0 2 0 98
> 5 71 937816 2760 2072 8584 1534 2681 1860 2702 1234 607 0 2 0 97

OK, so it's doing a lot of swapping and your swap utilisation is
continuously increasing. I would suspect an application or kernel memory
leak.

I suggest you keep that `vmstat 30' running all the time. When the machine
dies, take a look at the final 20 lines.

Also, run

while true
do
cat /proc/meminfo
sleep 10
done

and record the info which that leaves behind when the machine locks up.
This should tell us whether it is an application or kernel memory leak. If
it is indeed a leak.

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/