Re: Probablem with dropping caches

From: Johannes Weiner
Date: Mon Jun 08 2009 - 19:32:21 EST


[adding Ccs]

On Thu, Jun 04, 2009 at 12:21:22AM -0600, Bruce Guenter wrote:
> Hello.
>
> I am having a problem with a system that appears to be spontaneously
> dropping large parts of its caches. The work load on this system is
> primarily I/O bound (it's a mailbox server), and as such the loss of
> cache memory is causing severe performance degradation.
>
> For example, here is some output from vmstat 1:
>
> procs -----------memory---------- ---swap-- -----io---- -system-- ----cpu----
> r b swpd free buff cache si so bi bo in cs us sy id wa
> 1 1 0 87164 683212 1069748 0 0 544 164 856 730 3 2 91 4
> 0 1 0 81124 689100 1069508 0 0 5880 104 1070 834 2 1 74 23
> 0 1 0 89956 691588 1057408 0 0 5288 0 1163 915 0 2 72 25
> 0 1 0 138020 690652 1012444 0 0 5724 0 1136 831 2 0 75 23
> 0 0 0 243384 690460 906500 0 0 4716 0 1282 844 0 2 61 36
> 0 1 0 294704 690152 854232 0 0 1108 428 1123 1093 2 2 81 15
> 0 0 0 285984 690380 854504 0 0 252 0 721 671 3 1 92 3
> 0 1 0 426844 690780 722408 0 0 3096 1748 1197 846 1 2 84 13
> 0 1 0 579684 691232 568344 0 0 4228 156 1300 1083 2 2 69 27
> 1 1 0 676312 691832 467244 0 0 5256 0 1072 741 0 2 75 23
>
> As far as I can tell from df and similar reporting, there are not
> hundreds of MB of files being deleted, which would have similar
> behavior. It is not swapping, nor is memory actually leaking (since
> free memory + cache is nearly constant). All of the active programs run
> with small memory ulimits and as such are not consuming and then
> releasing hundreds of MB of memory.
>
> There are also intervals where the system is reading several MB per
> second but the caches do not grow significantly:
>
> procs -----------memory---------- ---swap-- -----io---- -system-- ----cpu----
> r b swpd free buff cache si so bi bo in cs us sy id wa
> 0 35 0 960396 749544 62416 0 0 7396 0 1424 1154 2 2 0 96
> 1 34 0 963868 750536 62252 0 0 8596 208 1695 1463 4 3 0 93
> 0 38 0 967452 752800 62980 0 0 7176 20 1378 972 4 1 0 95
> 0 38 0 968100 751308 61400 0 0 7260 180 1423 1109 3 2 0 95
> 2 42 0 966252 751540 61872 0 0 8196 0 1404 1328 1 1 0 97
> 0 43 0 955440 751956 60520 0 0 8692 0 1846 1925 5 3 0 92
> 2 49 0 943644 752836 61412 0 0 9324 200 1783 1582 5 3 0 92
> 1 39 0 959368 751892 62104 0 0 7836 64 1874 1855 9 5 0 86
>
> This system has 2GB RAM and 4 72GB drives in a 3Ware RAID10 array. The
> active filesystem is ext4 with the following mount options:
>
> noatime,nodiratime,data=journal
>
> The data=journal option comes from benchmarking I did a while back that
> indicated it was best for sync+unlink heavy work loads such as this one
> has. I have remounted with data=ordered but that did not solve the
> problem.
>
> The kernel (as of now) is 2.6.29.4 compiled with gcc 3.4.6 on Gentoo.
>
> I also have another system, which is similarly configured but is using
> the ext3 filesystem. It does not exhibit this behavior which leads me
> to suspect some difference between ext3 and ext4 is causing the problem.
> I however have no other evidence to point a finger at ext4, and am at a
> loss as to what else to investigate.
>
> Has anybody else seen this behavior before? What other details can I
> investigate to figure out what is causing this problem? What other information
> would be useful to diagnose this?
>
> Thank you.
>
> --
> Bruce Guenter <bruce@xxxxxxxxxxxxxx> http://untroubled.org/


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/