Re: Unusually high system CPU usage with recent kernels
From: Paul E. McKenney
Date: Mon Sep 09 2013 - 16:44:25 EST
On Mon, Sep 09, 2013 at 09:47:37PM +0200, Tibor Billes wrote:
> > From: Paul E. McKenney Sent: 09/08/13 08:43 PM
> > On Sun, Sep 08, 2013 at 07:22:45PM +0200, Tibor Billes wrote:
> > > > From: Paul E. McKenney Sent: 09/07/13 02:23 AM
> > > > On Tue, Sep 03, 2013 at 03:16:07PM -0700, Paul E. McKenney wrote:
> > > > > On Tue, Sep 03, 2013 at 11:11:01PM +0200, Tibor Billes wrote:
> > > > > > > From: Paul E. McKenney Sent: 08/30/13 03:24 AM
> > > > > > > On Tue, Aug 27, 2013 at 10:05:42PM +0200, Tibor Billes wrote:
> > > > > > > > From: Paul E. McKenney Sent: 08/26/13 06:28 AM
> > > > > > > > > Here is a patch that is more likely to help. I am testing it in parallel,
> > > > > > > > > but figured I should send you a sneak preview.
> > > > > > > >
> > > > > > > > I tried it, but I don't see any difference in overall performance. The dstat
> > > > > > > > also shows the same as before.
> > > > > > > >
> > > > > > > > But I did notice something. Occasionally there is an increase in userspace
> > > > > > > > CPU usage, interrupts and context switches are dropping, and it really gets
> > > > > > > > more work done (scons printed commands more frequently). I checked that
> > > > > > > > this behaviour is present without your patch, I just didn't notice this
> > > > > > > > before. Maybe you can make some sense out of it.
> > > > > > > >
> > > > > > > > ----system---- ----total-cpu-usage---- -dsk/total- -net/total- ---paging-- ---system-- ----swap--- ------memory-usage----- -----virtual-memory----
> > > > > > > > time |usr sys idl wai hiq siq| read writ| recv send| in out | int csw | used free| used buff cach free|majpf minpf alloc free
> > > > > > > > 27-08 20:51:53| 23 62 5 0 11 0| 0 0 | 0 0 | 0 0 |1274 3102k| 0 7934M| 549M 56.0M 491M 6698M| 0 28 156 159
> > > > > > > > 27-08 20:51:54| 24 64 1 0 11 0| 0 0 | 0 0 | 0 0 |1317 3165k| 0 7934M| 549M 56.0M 491M 6698M| 0 53 189 182
> > > > > > > > 27-08 20:51:55| 33 50 6 2 9 0| 192k 1832k| 0 0 | 0 0 |1371 2442k| 0 7934M| 544M 56.0M 492M 6702M| 0 30k 17k 17k
> > > > > > > > 27-08 20:51:56| 24 64 0 0 12 0| 0 0 | 0 0 | 0 0 |1313 3220k| 0 7934M| 544M 56.0M 492M 6701M| 0 21 272 232
> > > > > > > > 27-08 20:51:57| 24 64 0 0 12 0| 0 0 | 0 0 | 0 0 |1319 3226k| 0 7934M| 544M 56.0M 492M 6701M| 0 8 96 112
> > > > > > > > 27-08 20:51:58| 25 63 0 0 12 0| 0 0 | 0 0 | 0 0 |1317 3224k| 0 7934M| 544M 56.0M 492M 6701M| 0 12 145 141
> > > > > > > > 27-08 20:51:59| 24 64 0 0 12 0| 0 0 | 0 0 | 0 0 |1317 3223k| 0 7934M| 544M 56.0M 492M 6701M| 0 54 193 191
> > > > > > > > 27-08 20:52:00| 25 63 0 0 12 0| 0 24k| 0 0 | 0 0 |1336 3216k| 0 7934M| 544M 56.0M 492M 6701M| 0 36 161 172
> > > > > > > > 27-08 20:52:01| 24 64 0 0 12 0| 0 0 | 0 0 | 0 0 |1313 3225k| 0 7934M| 544M 56.0M 492M 6701M| 0 9 107 107
> > > > > > > > 27-08 20:52:02| 24 64 0 0 12 0| 0 0 | 0 0 | 0 0 |1327 3224k| 0 7934M| 545M 56.0M 492M 6701M| 0 13 193 200
> > > > > > > > 27-08 20:52:03| 24 64 0 0 12 0| 0 0 | 0 0 | 0 0 |1311 3226k| 0 7934M| 545M 56.0M 492M 6701M| 0 13 114 114
> > > > > > > > 27-08 20:52:04| 25 63 0 0 12 0| 0 0 | 0 0 | 0 0 |1331 3223k| 0 7934M| 544M 56.0M 492M 6701M| 0 41 190 178
> > > > > > > > 27-08 20:52:05| 24 64 0 0 12 0| 0 8192B| 0 0 | 0 0 |1315 3222k| 0 7934M| 544M 56.0M 492M 6701M| 0 30 123 122
> > > > > > > > 27-08 20:52:06| 24 64 0 0 12 0| 0 0 | 0 0 | 0 0 |1314 3223k| 0 7934M| 544M 56.0M 492M 6701M| 0 16 187 195
> > > > > > > > 27-08 20:52:07| 25 63 1 0 12 0|2212k 192k| 0 0 | 0 0 |1637 3194k| 0 7934M| 544M 56.2M 494M 6699M| 0 1363 2590 1947
> > > > > > > > 27-08 20:52:08| 17 33 18 26 6 0|3208k 0 | 0 0 | 0 0 |1351 1766k| 0 7934M| 561M 56.3M 499M 6678M| 4 10k 7620 2055
> > > > > > > > 27-08 20:52:09| 47 21 16 13 4 0|4332k 628k| 0 0 | 0 0 |1400 1081k| 0 7934M| 647M 56.3M 504M 6587M| 10 24k 25k 1151
> > > > > > > > 27-08 20:52:10| 36 34 13 11 6 0|2636k 2820k| 0 0 | 0 0 |1451 1737k| 0 7934M| 598M 56.3M 507M 6632M| 5 19k 16k 28k
> > > > > > > > 27-08 20:52:11| 46 17 10 25 3 0|4288k 536k| 0 0 | 0 0 |1386 868k| 0 7934M| 613M 56.3M 513M 6611M| 24 13k 8908 3616
> > > > > > > > 27-08 20:52:12| 53 33 5 4 5 0|4740k 3992k| 0 0 | 0 0 |1773 1464k| 0 7934M| 562M 56.6M 521M 6654M| 0 36k 29k 40k
> > > > > > > > 27-08 20:52:13| 60 34 0 1 6 0|4228k 1416k| 0 0 | 0 0 |1642 1670k| 0 7934M| 593M 56.6M 526M 6618M| 0 36k 26k 17k
> > > > > > > > 27-08 20:52:14| 53 37 1 3 7 0|3008k 1976k| 0 0 | 0 0 |1513 1972k| 0 7934M| 541M 56.6M 529M 6668M| 0 10k 9986 23k
> > > > > > > > 27-08 20:52:15| 55 34 1 4 6 0|3636k 1284k| 0 0 | 0 0 |1645 1688k| 0 7934M| 581M 56.6M 535M 6622M| 0 43k 30k 18k
> > > > > > > > 27-08 20:52:16| 57 30 5 2 5 0|4404k 2320k| 0 0 | 0 0 |1715 1489k| 0 7934M| 570M 56.6M 541M 6627M| 0 39k 24k 26k
> > > > > > > > 27-08 20:52:17| 50 35 3 7 6 0|2520k 1972k| 0 0 | 0 0 |1699 1598k| 0 7934M| 587M 56.6M 554M 6596M| 0 65k 40k 32k
> > > > > > > > 27-08 20:52:18| 52 40 2 1 7 0|1556k 1732k| 0 0 | 0 0 |1582 1865k| 0 7934M| 551M 56.6M 567M 6619M| 0 35k 26k 33k
> > > > > > > > 27-08 20:52:19| 26 62 0 0 12 0| 0 0 | 0 0 | 0 0 |1351 3240k| 0 7934M| 551M 56.6M 568M 6619M| 0 86 440 214
> > > > > > > > 27-08 20:52:20| 26 63 0 0 11 0| 0 108k| 0 0 | 0 0 |1392 3162k| 0 7934M| 555M 56.6M 560M 6623M| 0 1801 1490 2672
> > > > > > > > 27-08 20:52:21| 25 63 0 0 12 0| 0 0 | 0 0 | 0 0 |1332 3198k| 0 7934M| 555M 56.6M 560M 6623M| 0 50 245 255
> > > > > > > > 27-08 20:52:22| 25 63 1 0 12 0| 0 0 | 0 0 | 0 0 |1350 3220k| 0 7934M| 556M 56.6M 560M 6622M| 0 755 544 286
> > > > > > > > 27-08 20:52:23| 27 62 1 0 11 0| 0 272k| 0 0 | 0 0 |1323 3092k| 0 7934M| 551M 56.6M 558M 6628M| 0 341 1464 3085
> > > > > > > > 27-08 20:52:24| 25 63 0 0 12 0| 0 0 | 0 0 | 0 0 |1334 3197k| 0 7934M| 551M 56.6M 559M 6627M| 0 63 514 273
> > > > > > > > 27-08 20:52:25| 25 63 0 0 12 0| 0 40k| 0 0 | 0 0 |1329 3243k| 0 7934M| 546M 56.6M 558M 6633M| 0 321 160 1679
> > > > > > > > 27-08 20:52:26| 39 50 1 1 9 0| 48k 644k| 0 0 | 0 0 |1500 2556k| 0 7934M| 552M 56.6M 560M 6625M| 0 30k 14k 12k
> > > > > > > > 27-08 20:52:27| 26 62 1 0 11 0| 0 192k| 0 0 | 0 0 |1380 3152k| 0 7934M| 553M 56.6M 560M 6624M| 0 2370 808 718
> > > > > > > > 27-08 20:52:28| 23 55 12 0 10 0| 0 0 | 0 0 | 0 0 |1247 2993k| 0 7934M| 553M 56.6M 561M 6624M| 0 1060 428 241
> > > > > > > > 27-08 20:52:29| 25 63 1 0 11 0| 0 0 | 0 0 | 0 0 |1318 3142k| 0 7934M| 554M 56.6M 561M 6623M| 0 663 442 198
> > > > > > > > 27-08 20:52:30| 25 64 0 0 12 0| 0 100k| 0 0 | 0 0 |1316 3212k| 0 7934M| 554M 56.6M 561M 6623M| 0 42 187 186
> > > > > > > > 27-08 20:52:31| 24 64 0 0 12 0| 0 4096B| 0 0 | 0 0 |1309 3222k| 0 7934M| 554M 56.6M 561M 6623M| 0 9 85 85
> > > > > > > > 27-08 20:52:32| 24 64 0 0 12 0| 0 0 | 0 0 | 0 0 |1297 3219k| 0 7934M| 554M 56.6M 561M 6623M| 0 23 95 84
> > > > > > > > 27-08 20:52:33| 25 63 0 0 11 0| 0 0 | 0 0 | 0 0 |1339 3101k| 0 7934M| 555M 56.6M 557M 6625M| 0 923 1566 2176
> > > > > > > > 27-08 20:52:34| 25 63 0 0 11 0| 0 0 | 0 0 | 0 0 |1333 3095k| 0 7934M| 555M 56.6M 559M 6623M| 0 114 701 195
> > > > > > > > 27-08 20:52:35| 26 63 0 0 11 0| 0 0 | 0 0 | 0 0 |1342 3036k| 0 7934M| 557M 56.6M 568M 6613M| 0 957 3225 516
> > > > > > > > 27-08 20:52:36| 26 60 2 0 11 0| 0 28k| 0 0 | 0 0 |1340 3091k| 0 7934M| 555M 56.6M 568M 6614M| 0 5304 5422 5609
> > > > > > > > 27-08 20:52:37| 25 64 0 0 12 0| 0 0 | 0 0 | 0 0 |1348 3260k| 0 7934M| 556M 56.6M 565M 6617M| 0 30 156 1073
> > > > > > > > 27-08 20:52:38| 24 64 0 0 12 0| 0 0 | 0 0 | 0 0 |1314 3211k| 0 7934M| 556M 56.6M 549M 6633M| 0 11 105 4285
> > > > > > > > 27-08 20:52:39| 26 63 0 0 11 0| 0 0 | 0 0 | 0 0 |1353 3031k| 0 7934M| 558M 56.6M 559M 6620M| 0 847 3866 357
> > > > > > > > 27-08 20:52:40| 26 63 0 0 12 0| 0 0 | 0 0 | 0 0 |1309 3135k| 0 7934M| 569M 56.6M 566M 6602M| 0 3940 5727 1288
> > > > > > >
> > > > > > > Interesting. The number of context switches drops during the time
> > > > > > > that throughput improves. It would be good to find out what task(s)
> > > > > > > are doing all of the context switches. One way to find the pids of the
> > > > > > > top 20 context-switching tasks should be something like this:
> > > > > > >
> > > > > > > grep ctxt /proc/*/status | sort -k2nr | head -20
> > > > > > >
> > > > > > > You could use any number of methods to map back to the command.
> > > > > > > When generating my last patch, I was assuming that ksoftirqd would be
> > > > > > > the big offender. Of course, if it is something else, I should be
> > > > > > > taking a different approach.
> > > > > >
> > > > > > I did some measuring of these context switches. I used the above command
> > > > > > line to get the number of context switches before compiling and about 40
> > > > > > seconds later.
> > > > > >
> > > > > > The good case is:
> > > > > > file before after process name
> > > > > > /proc/10/status:voluntary_ctxt_switches: 38108 54327 [rcu_sched]
> > > > > > /proc/19/status:voluntary_ctxt_switches: 31022 38682 [ksoftirqd/2]
> > > > > > /proc/14/status:voluntary_ctxt_switches: 12947 16914 [ksoftirqd/1]
> > > > > > /proc/3/status:voluntary_ctxt_switches: 11061 14885 [ksoftirqd/0]
> > > > > > /proc/194/status:voluntary_ctxt_switches: 10044 11660 [kworker/2:1H]
> > > > > > /proc/1603/status:voluntary_ctxt_switches: 7626 9593 /usr/bin/X
> > > > > > /proc/24/status:voluntary_ctxt_switches: 6415 9571 [ksoftirqd/3]
> > > > > > /proc/20/status:voluntary_ctxt_switches: 5317 8879 [kworker/2:0]
> > > > > >
> > > > > > The bad case is:
> > > > > > file before after process name
> > > > > > /proc/3/status:voluntary_ctxt_switches: 82411972 98542227 [ksoftirqd/0]
> > > > > > /proc/23/status:voluntary_ctxt_switches: 79592040 94206823 [ksoftirqd/3]
> > > > > > /proc/13/status:voluntary_ctxt_switches: 79053870 93654161 [ksoftirqd/1]
> > > > > > /proc/18/status:voluntary_ctxt_switches: 78136449 92288688 [ksoftirqd/2]
> > > > > > /proc/2293/status:nonvoluntary_ctxt_switches: 29038881 29038881 mate-panel
> > > > > > /proc/2308/status:nonvoluntary_ctxt_switches: 26607661 26607661 nm-applet
> > > > > > /proc/2317/status:nonvoluntary_ctxt_switches: 15494474 15494474 /usr/lib/polkit-mate/polkit-mate-authentication-agent-1
> > > > > > /proc/2148/status:nonvoluntary_ctxt_switches: 13763674 13763674 x-session-manager
> > > > > > /proc/2985/status:nonvoluntary_ctxt_switches: 12062706 python /usr/bin/scons -j4
> > > > > > /proc/2323/status:nonvoluntary_ctxt_switches: 11581510 11581510 mate-volume-control-applet
> > > > > > /proc/2353/status:nonvoluntary_ctxt_switches: 9213436 9213436 mate-power-manager
> > > > > > /proc/2305/status:nonvoluntary_ctxt_switches: 8328471 8328471 /usr/lib/matecomponent/matecomponent-activation-server
> > > > > > /proc/3041/status:nonvoluntary_ctxt_switches: 7638312 sh -c "/opt/arm-2011.03/bin/arm-none-eabi-gcc" [...]
> > > > > > /proc/3055/status:nonvoluntary_ctxt_switches: 4843977 sh -c "/opt/arm-2011.03/bin/arm-none-eabi-gcc" [...]
> > > > > > /proc/2253/status:nonvoluntary_ctxt_switches: 4449918 4449918 mate-keyring-daemon
> > > > >
> > > > > Thank you for the info!
> > > > >
> > > > > > The processes for the tables were partly collected by hand so they may
> > > > > > not contain every relevant process. But what I see is that there are a
> > > > > > lot more context switches in the bad case, quite many of these are non
> > > > > > voluntary. The context switch increase is pretty much everywhere, not
> > > > > > just ksoftirqd. I suspect the Mate related processes did their context
> > > > > > switches during bootup/login, which is consistent with the bootup/login
> > > > > > procedure being significantly slower (just like the compilation).
> > > > >
> > > > > Also, the non-ksoftirq context switches are nonvoluntary, which likely
> > > > > means that some of their context switches were preemptions by ksoftirqd.
> > > > >
> > > > > OK, time to look at whatever else might be causing this...
> > > >
> > > > I am still unable to reproduce this, so am reduced to shooting in the
> > > > dark. The attached patch could in theory help. If it doesn't, would
> > > > you be willing to do a bit of RCU event tracing?
> > > >
> > > > Thanx, Paul
> > > >
> > > > ------------------------------------------------------------------------
> > > >
> > > > rcu: Throttle invoke_rcu_core() invocations due to non-lazy callbacks
> > > >
> > > > If a non-lazy callback arrives on a CPU that has previously gone idle
> > > > with no non-lazy callbacks, invoke_rcu_core() forces the RCU core to
> > > > run. However, it does not update the conditions, which could result
> > > > in several closely spaced invocations of the RCU core, which in turn
> > > > could result in an excessively high context-switch rate and resulting
> > > > high overhead.
> > > >
> > > > This commit therefore updates the ->all_lazy and ->nonlazy_posted_snap
> > > > fields to prevent closely spaced invocations.
> > > >
> > > > Reported-by: Tibor Billes <tbilles@xxxxxxx>
> > > > Signed-off-by: Paul E. McKenney <paulmck@xxxxxxxxxxxxxxxxxx>
> > > >
> > > > diff --git a/kernel/rcutree_plugin.h b/kernel/rcutree_plugin.h
> > > > index dc09fb0..9d445ab 100644
> > > > --- a/kernel/rcutree_plugin.h
> > > > +++ b/kernel/rcutree_plugin.h
> > > > @@ -1745,6 +1745,8 @@ static void rcu_prepare_for_idle(int cpu)
> > > > */
> > > > if (rdtp->all_lazy &&
> > > > rdtp->nonlazy_posted != rdtp->nonlazy_posted_snap) {
> > > > + rdtp->all_lazy = false;
> > > > + rdtp->nonlazy_posted_snap = rdtp->nonlazy_posted;
> > > > invoke_rcu_core();
> > > > return;
> > > > }
> > >
> > > Good news Paul, the above patch did solve this issue :) I see no extra
> > > context switches, no extra CPU usage and no extra compile time.
> >
> > Woo-hoo!!! ;-)
> >
> > May I add your Tested-by to the fix?
>
> Yes, please :)
Done! ;-)
> > > Any idea why couldn't you reproduce this? Why did it only hit my system?
> >
> > Timing, maybe? Another question is "Why didn't lots of people complain
> > about this?" It would be good to find out because it is quite possible
> > that there is some other bug that this patch is masking -- or even just
> > making less probable.
>
> Good point!
>
> > If you are interested, please build one of the old kernels but with
> > CONFIG_RCU_TRACE=y. Then run something like the following as root
> > concurrently with your workload:
> >
> > sleep 10
> > echo 1 > /sys/kernel/debug/tracing/events/rcu/enable
> > sleep 0.01
> > echo 0 > /sys/kernel/debug/tracing/events/rcu/enable
> > cat /sys/kernel/debug/tracing/trace > /tmp/trace
> >
> > Send me the /tmp/trace file, which will probably be a few megabytes in
> > size, so please compress it before sending. ;-) A good compression
> > tool should be able to shrink it by a factor of 20 or thereabouts.
>
> Ok, I did that. Twice! The first is with commit
> 910ee45db2f4837c8440e770474758493ab94bf7, which was the first bad commit
> according to the bisection I did initially. Second with the current
> mainline 3.11. I have little idea of what the fields and lines mean in
> the RCU trace files, so I'm not going to guess if they are essentially
> the same or not, but it may provide more information to you. Both files
> were created by using a shell script containing the commands you
> suggested.
So traces both correspond to bad cases, correct? They are both quite
impressive -- looks like you have quite the interrupt rate going on there!
Almost looks like interrupts are somehow getting enabled on the path
to/from idle.
Could you please also send along a trace with the fix applied?
> I'm not sure about LKML policies about attaching not-so-small files to
> emails, so I've dropped LKML from the CC list. Please CC the mailing
> list in your reply.
Done!
Another approach is to post the traces on the web and send the URL to
LKML. But whatever works for you is fine by me.
Thanx, Paul
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/