Re: System CPU increasing on idle 2.6.36

From: Simon Kirby
Date: Fri Jan 07 2011 - 19:52:17 EST


On Fri, Jan 07, 2011 at 10:05:39AM -0800, Mark Moseley wrote:

> NOTE: NFS/Kernel guys: I've left the 2.6.36.2 box still thrashing
> about in case there's something you'd like me to look at.
>
> Ok, both my 2.6.36.2 kernel box and my 2.6.35->2.6.36 bisect box (was
> bisected to ce7db282a3830f57f5b05ec48288c23a5c4d66d5 -- this is my
> first time doing bisect, so I'll preemptively apologize for doing

That's a merge commit, so that doesn't sound right (it doesn't contain
changes by itself). How were you running the bisect? Were you
definitely seeing a good case since v2.6.35? Could you post your
"git bisect log" so I can see if I can follow it?

> anything silly) both went berserk within 15 mins of each other, after
> an uptime of around 63 hours for 2.6.36.2 and 65 hours for the
> bisected box. The 2.6.36.2 one is still running with all the various
> flush-0:xx threads spinning wildly. The bisected box just keeled over
> and died, but is back up now. The only kernel messages logged are just
> of the "task kworker/4:1:359 blocked for more than 120 seconds"
> variety, all with _raw_spin_lock_irq at the top of the stack trace.
>
> Looking at slabinfo, specifically, this: stats=$(grep nfs_inode_cache
> /proc/slabinfo | awk '{ print $2, $3 }')
> the active_objs and num_objs both increase to over a million (these
> boxes are delivering mail to NFS-mounted mailboxes, so that's
> perfectly reasonable). On both boxes, looking at sar, things start to
> go awry around 10am today EST. At that time on the 2.6.36.2 box, the
> NFS numbers look like this:
>
> Fri Jan 7 09:58:00 2011: 1079433 1079650
> Fri Jan 7 09:59:00 2011: 1079632 1080300
> Fri Jan 7 10:00:00 2011: 1080196 1080300
> Fri Jan 7 10:01:01 2011: 1080599 1080716
> Fri Jan 7 10:02:01 2011: 1081074 1081288
>
> on the bisected, like this:
>
> Fri Jan 7 09:59:34 2011: 1162786 1165320
> Fri Jan 7 10:00:34 2011: 1163301 1165320
> Fri Jan 7 10:01:34 2011: 1164369 1165450
> Fri Jan 7 10:02:35 2011: 1164179 1165450
> Fri Jan 7 10:03:35 2011: 1165795 1166958
>
> When the bisected box finally died, the last numbers were:
>
> Fri Jan 7 10:40:33 2011: 1177156 1177202
> Fri Jan 7 10:42:21 2011: 1177157 1177306
> Fri Jan 7 10:44:55 2011: 1177201 1177324
> Fri Jan 7 10:45:55 2011: 1177746 1177826

Yeah, I was seeing load proportional to these, but I don't think their
growth is anything wrong. It just seems to be something that is taking
up CPU from flush proceses as they grow, and something which is causing a
thundering herd of reclaim with lots of spinlock contention.

Hmmm.. I tried to see if drop_caches would clear nfs_inode_cache and thus
the CPU usage. The top of slabtop before doing anything:

OBJS ACTIVE USE OBJ SIZE SLABS OBJ/SLAB CACHE SIZE NAME
1256343 1221799 97% 0.95K 38071 33 1218272K nfs_inode_cache

# echo 2 > /proc/sys/vm/drop_caches

this took about 40 seconds to execute, and perf top showed this for
pretty much the whole time:

samples pcnt function DSO
_______ _____ ________________________ ___________

1188.00 45.1% _raw_spin_lock [kernel]
797.00 30.2% rpcauth_cache_shrinker [kernel]
132.00 5.0% shrink_slab [kernel]
60.00 2.3% queue_io [kernel]
59.00 2.2% _cond_resched [kernel]
44.00 1.7% nfs_writepages [kernel]
35.00 1.3% writeback_single_inode [kernel]
34.00 1.3% writeback_sb_inodes [kernel]
23.00 0.9% do_writepages [kernel]
20.00 0.8% bit_waitqueue [kernel]
15.00 0.6% redirty_tail [kernel]
11.00 0.4% __iget [kernel]
11.00 0.4% __GI_strstr libc-2.7.so

slabtop then showed:

OBJS ACTIVE USE OBJ SIZE SLABS OBJ/SLAB CACHE SIZE NAME
1256211 531720 42% 0.95K 38067 33 1218144K nfs_inode_cache

and this doesn't change further with drop_caches=3 or 1. It's strange
that most of them were freed, but not all... We do actually have
RPC_CREDCACHE_DEFAULT_HASHBITS set to 10 at build time, which might
explain a bigger credcache and time to free that. Did you change your
auth_hashtable_size at all? Anyway, even with all of that, system CPU
doesn't change.

Ok, weird, Running "sync" actually caused "vmstat 1" to go from 25% CPU
spikes every 5 seconds with nfs_writepages in the "perf top" every 5
seconds to being idle! /proc/vmstat showed only nr_dirty 8 and
nr_writeback 0 before I ran "sync". Either that, or it was just good
timing.

procs -----------memory---------- ---swap-- -----io---- -system-- ----cpu----
r b swpd free buff cache si so bi bo in cs us sy id wa
0 0 0 6767892 9696 41088 0 0 5 180 15 70 62 7 29 1
0 0 0 6767892 9696 41088 0 0 0 0 1124 1978 0 0 100 0
0 0 0 6767892 9696 41088 0 0 0 0 1105 1967 0 0 100 0
0 0 0 6767892 9696 41088 0 0 0 0 1180 1984 0 2 98 0
2 0 0 6767892 9696 41088 0 0 0 0 2102 1968 0 22 78 0
0 0 0 6767768 9704 41084 0 0 0 52 1504 2022 0 8 91 1
0 0 0 6767768 9704 41084 0 0 0 0 1059 1965 0 0 100 0
0 0 0 6767768 9704 41088 0 0 0 0 1077 1982 0 0 100 0
0 0 0 6767768 9704 41088 0 0 0 0 1166 1972 1 2 97 0
0 0 0 6767768 9704 41088 0 0 0 0 1353 1990 0 6 94 0
0 0 0 6767644 9704 41100 0 0 0 0 2531 2050 0 27 72 0
0 0 0 6767644 9704 41100 0 0 0 0 1065 1974 0 0 100 0
1 0 0 6767644 9704 41100 0 0 0 0 1090 1967 0 0 100 0
0 0 0 6767644 9704 41100 0 0 0 0 1175 1971 0 2 98 0
0 0 0 6767644 9704 41100 0 0 0 0 1321 1967 0 6 94 0
1 0 0 6767644 9712 41100 0 0 0 52 1995 1995 0 18 81 1
0 0 0 6767644 9712 41100 0 0 0 0 1342 1964 0 6 94 0
0 0 0 6767644 9712 41100 0 0 0 0 1064 1974 0 0 100 0
0 0 0 6767644 9712 41100 0 0 0 0 1163 1968 0 2 98 0
0 0 0 6767644 9712 41100 0 0 0 0 1390 1976 1 6 94 0
0 0 0 6767768 9712 41112 0 0 0 0 1282 2018 0 1 99 0
0 0 0 6767768 9712 41112 0 0 0 0 2166 1972 0 24 76 0
0 0 0 6786260 9712 41112 0 0 0 0 1092 1989 0 0 100 0
0 0 0 6786260 9712 41112 0 0 0 0 1039 2003 0 2 98 0
0 0 0 6774588 9712 41112 0 0 0 0 1524 2015 3 9 89 0
0 1 0 6770124 9716 41112 0 0 0 48 1315 1990 3 0 87 11
1 0 0 6769628 9720 41112 0 0 0 4 2196 1966 0 20 79 0
0 0 0 6768364 9724 41108 0 0 36 0 1403 2031 1 5 94 1
0 0 0 6768388 9724 41144 0 0 0 0 1185 1969 0 2 98 0
0 0 0 6768388 9724 41144 0 0 0 0 1351 1975 0 6 94 0
0 0 0 6768140 9724 41156 0 0 0 0 1261 2020 0 1 99 0
0 0 0 6768140 9724 41156 0 0 0 0 1053 1973 0 0 100 0
1 0 0 6768140 9724 41156 0 0 0 0 2194 1971 1 26 73 0
0 0 0 6768140 9732 41148 0 0 0 52 1283 2088 0 4 96 0
2 0 0 6768140 9732 41156 0 0 0 0 1184 1965 0 2 98 0
0 0 0 6786368 9732 41156 0 0 0 0 1297 2000 0 4 97 0
0 0 0 6786384 9732 41156 0 0 0 0 998 1949 0 0 100 0
(stop vmstat 1, run sync, restart vmstat 1)
1 0 0 6765304 9868 41736 0 0 0 0 1239 1967 0 0 100 0
0 0 0 6765304 9868 41736 0 0 0 0 1174 2035 0 0 100 0
0 0 0 6765304 9868 41736 0 0 0 0 1084 1976 0 0 100 0
1 0 0 6765172 9868 41736 0 0 0 0 1100 1975 0 0 100 0
0 0 0 6765304 9868 41760 0 0 0 0 1501 1997 1 1 98 0
0 0 0 6765304 9868 41760 0 0 0 0 1092 1964 0 0 100 0
1 0 0 6765304 9868 41760 0 0 0 0 1096 1963 0 0 100 0
0 0 0 6765304 9868 41760 0 0 0 0 1083 1967 0 0 100 0
0 0 0 6765304 9876 41752 0 0 0 76 1101 1980 0 0 99 1
0 0 0 6765304 9876 41760 0 0 0 4 1199 2097 0 2 98 0
0 0 0 6765304 9876 41760 0 0 0 0 1055 1964 0 0 100 0
0 0 0 6765304 9876 41760 0 0 0 0 1064 1977 0 0 100 0
0 0 0 6765304 9876 41760 0 0 0 0 1073 1968 0 0 100 0
0 0 0 6765304 9876 41760 0 0 0 0 1080 1976 0 0 100 0
0 0 0 6765156 9884 41772 0 0 0 20 1329 2034 1 1 98 1
0 0 0 6765180 9884 41780 0 0 0 0 1097 1971 0 1 99 0
0 0 0 6765180 9884 41780 0 0 0 0 1073 1960 0 0 100 0
0 0 0 6765180 9884 41780 0 0 0 0 1079 1976 0 0 100 0
0 0 0 6765172 9884 41780 0 0 0 0 1120 1980 0 0 100 0
0 0 0 6765180 9892 41772 0 0 0 60 1097 1982 0 0 100 0
0 0 0 6765180 9892 41780 0 0 0 0 1067 1969 0 0 100 0
0 0 0 6765180 9892 41780 0 0 0 0 1063 1973 0 0 100 0
0 0 0 6765180 9892 41780 0 0 0 0 1064 1969 0 1 100 0

Hmmm. I'll watch it and see if it climbs back up again. That would mean
that for now, echo 2 > /proc/sys/vm/drop_caches ; sync is a workaround
for the problem we're seeing with system CPU slowly creeping up since
2.6.36. I have no idea if this affects the sudden flush process spinlock
contention and attempted writeback without much progress during normal
load, but it certainly changes things here.

http://0x.ca/sim/ref/2.6.37/cpu_after_drop_caches_2_and_sync.png

Simon-
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/