Re: Unusually high system CPU usage with recent kernels
From: Paul E. McKenney
Date: Sat Aug 24 2013 - 17:07:22 EST
On Sat, Aug 24, 2013 at 09:59:45PM +0200, Tibor Billes wrote:
> From: Paul E. McKenney Sent: 08/22/13 12:09 AM
> > On Wed, Aug 21, 2013 at 11:05:51PM +0200, Tibor Billes wrote:
> > > > From: Paul E. McKenney Sent: 08/21/13 09:12 PM
> > > > On Wed, Aug 21, 2013 at 08:14:46PM +0200, Tibor Billes wrote:
> > > > > > From: Paul E. McKenney Sent: 08/20/13 11:43 PM
> > > > > > On Tue, Aug 20, 2013 at 10:52:26PM +0200, Tibor Billes wrote:
> > > > > > > > From: Paul E. McKenney Sent: 08/20/13 04:53 PM
> > > > > > > > On Tue, Aug 20, 2013 at 08:01:28AM +0200, Tibor Billes wrote:
> > > > > > > > > Hi,
> > > > > > > > >
> > > > > > > > > I was using the 3.9.7 stable release and tried to upgrade to the 3.10.x series.
> > > > > > > > > The 3.10.x series was showing unusually high (>75%) system CPU usage in some
> > > > > > > > > situations, making things really slow. The latest stable I tried is 3.10.7.
> > > > > > > > > I also tried 3.11-rc5, they both show this behaviour. This behaviour doesn't
> > > > > > > > > show up when the system is idling, only when doing some CPU intensive work,
> > > > > > > > > like compiling with multiple threads. Compiling with only one thread seems not
> > > > > > > > > to trigger this behaviour.
> > > > > > > > >
> > > > > > > > > To be more precise I did a `perf record -a` while compiling a large C++ program
> > > > > > > > > with scons using 4 threads, the result is appended at the end of this email.
> > > > > > > >
> > > > > > > > New one on me! You are running a mainstream system (x86_64), so I am
> > > > > > > > surprised no one else noticed.
> > > > > > > >
> > > > > > > > Could you please send along your .config file?
> > > > > > >
> > > > > > > Here it is
> > > > > >
> > > > > > Interesting. I don't see RCU stuff all that high on the list, but
> > > > > > the items I do see lead me to suspect RCU_FAST_NO_HZ, which has some
> > > > > > relevance to the otherwise inexplicable group of commits you located
> > > > > > with your bisection. Could you please rerun with CONFIG_RCU_FAST_NO_HZ=n?
> > > > > >
> > > > > > If that helps, there are some things I could try.
> > > > >
> > > > > It did help. I didn't notice anything unusual when running with CONFIG_RCU_FAST_NO_HZ=n.
> > > >
> > > > Interesting. Thank you for trying this -- and we at least have a
> > > > short-term workaround for this problem. I will put a patch together
> > > > for further investigation.
> > >
> > > I don't specifically need this config option so I'm fine without it in
> > > the long term, but I guess it's not supposed to behave like that.
> >
> > OK, good, we have a long-term workload for your specific case,
> > even better. ;-)
> >
> > But yes, there are situations where RCU_FAST_NO_HZ needs to work
> > a bit better. I hope you will bear with me with a bit more
> > testing...
> >
> > > > In the meantime, could you please tell me how you were measuring
> > > > performance for your kernel builds? Wall-clock time required to complete
> > > > one build? Number of builds completed per unit time? Something else?
> > >
> > > Actually, I wasn't all this sophisticated. I have a system monitor
> > > applet on my top panel (using MATE, Linux Mint), four little graphs,
> > > one of which shows CPU usage. Different colors indicate different kind
> > > of CPU usage. Blue shows user space usage, red shows system usage, and
> > > two more for nice and iowait. During a normal compile it's almost
> > > completely filled with blue user space CPU usage, only the top few
> > > pixels show some iowait and system usage. With CONFIG_RCU_FAST_NO_HZ
> > > set, about 3/4 of the graph was red system CPU usage, the rest was
> > > blue. So I just looked for a pile of red on my graphs when I tested
> > > different kernel builds. But also compile speed was horrible I couldn't
> > > wait for the build to finish. Even the UI got unresponsive.
> >
> > We have been having problems with CPU accounting, but this one looks
> > quite real.
> >
> > > Now I did some measuring. In the normal case a compile finished in 36
> > > seconds, compiled 315 object files. Here are some output lines from
> > > dstat -tasm --vm during compile:
> > > ----system---- ----total-cpu-usage---- -dsk/total- -net/total- ---paging-- ---system-- ----swap--- ------memory-usage----- -----virtual-memory----
> > > time |usr sys idl wai hiq siq| read writ| recv send| in out | int csw | used free| used buff cach free|majpf minpf alloc free
> > > 21-08 21:48:05| 91 8 2 0 0 0| 0 5852k| 0 0 | 0 0 |1413 1772 | 0 7934M| 581M 58.0M 602M 6553M| 0 71k 46k 54k
> > > 21-08 21:48:06| 93 6 1 0 0 0| 0 2064k| 137B 131B| 0 0 |1356 1650 | 0 7934M| 649M 58.0M 604M 6483M| 0 72k 47k 28k
> > > 21-08 21:48:07| 86 11 4 0 0 0| 0 5872k| 0 0 | 0 0 |2000 2991 | 0 7934M| 577M 58.0M 627M 6531M| 0 99k 67k 79k
> > > 21-08 21:48:08| 87 9 3 0 0 0| 0 2840k| 0 0 | 0 0 |2558 4164 | 0 7934M| 597M 58.0M 632M 6507M| 0 96k 57k 51k
> > > 21-08 21:48:09| 93 7 1 0 0 0| 0 3032k| 0 0 | 0 0 |1329 1512 | 0 7934M| 641M 58.0M 626M 6469M| 0 61k 48k 39k
> > > 21-08 21:48:10| 93 6 0 0 0 0| 0 4984k| 0 0 | 0 0 |1160 1146 | 0 7934M| 572M 58.0M 628M 6536M| 0 50k 40k 57k
> > > 21-08 21:48:11| 86 9 6 0 0 0| 0 2520k| 0 0 | 0 0 |2947 4760 | 0 7934M| 605M 58.0M 631M 6500M| 0 103k 55k 45k
> > > 21-08 21:48:12| 90 8 2 0 0 0| 0 2840k| 0 0 | 0 0 |2674 4179 | 0 7934M| 671M 58.0M 635M 6431M| 0 84k 59k 42k
> > > 21-08 21:48:13| 90 9 1 0 0 0| 0 4656k| 0 0 | 0 0 |1223 1410 | 0 7934M| 643M 58.0M 638M 6455M| 0 90k 62k 68k
> > > 21-08 21:48:14| 91 8 1 0 0 0| 0 3572k| 0 0 | 0 0 |1432 1828 | 0 7934M| 647M 58.0M 641M 6447M| 0 81k 59k 57k
> > > 21-08 21:48:15| 91 8 1 0 0 0| 0 5116k| 116B 0 | 0 0 |1194 1295 | 0 7934M| 605M 58.0M 644M 6487M| 0 69k 54k 64k
> > > 21-08 21:48:16| 87 10 3 0 0 0| 0 5140k| 0 0 | 0 0 |1761 2586 | 0 7934M| 584M 58.0M 650M 6502M| 0 105k 64k 68k
> > >
> > > The abnormal case compiled only 182 object file in 6 and a half minutes,
> > > then I stopped it. The same dstat output for this case:
> > > ----system---- ----total-cpu-usage---- -dsk/total- -net/total- ---paging-- ---system-- ----swap--- ------memory-usage----- -----virtual-memory----
> > > time |usr sys idl wai hiq siq| read writ| recv send| in out | int csw | used free| used buff cach free|majpf minpf alloc free
> > > 21-08 22:10:49| 27 62 0 0 11 0| 0 0 | 210B 0 | 0 0 |1414 3137k| 0 7934M| 531M 57.6M 595M 6611M| 0 1628 1250 322
> > > 21-08 22:10:50| 25 60 4 0 11 0| 0 88k| 126B 0 | 0 0 |1337 3110k| 0 7934M| 531M 57.6M 595M 6611M| 0 91 128 115
> > > 21-08 22:10:51| 26 63 0 0 11 0| 0 184k| 294B 0 | 0 0 |1411 3147k| 0 7934M| 531M 57.6M 595M 6611M| 0 1485 814 815
> > > 21-08 22:10:52| 26 63 0 0 11 0| 0 0 | 437B 239B| 0 0 |1355 3160k| 0 7934M| 531M 57.6M 595M 6611M| 0 24 94 97
> > > 21-08 22:10:53| 26 63 0 0 11 0| 0 0 | 168B 0 | 0 0 |1397 3155k| 0 7934M| 531M 57.6M 595M 6611M| 0 479 285 273
> > > 21-08 22:10:54| 26 63 0 0 11 0| 0 4096B| 396B 324B| 0 0 |1346 3154k| 0 7934M| 531M 57.6M 595M 6611M| 0 27 145 145
> > > 21-08 22:10:55| 26 63 0 0 11 0| 0 60k| 0 0 | 0 0 |1353 3148k| 0 7934M| 531M 57.6M 595M 6610M| 0 93 117 36
> > > 21-08 22:10:56| 26 63 0 0 11 0| 0 0 | 0 0 | 0 0 |1341 3172k| 0 7934M| 531M 57.6M 595M 6610M| 0 158 87 74
> > > 21-08 22:10:57| 26 62 1 0 11 0| 0 0 | 42B 60B| 0 0 |1332 3162k| 0 7934M| 531M 57.6M 595M 6610M| 0 56 82 78
> > > 21-08 22:10:58| 26 63 0 0 11 0| 0 0 | 0 0 | 0 0 |1334 3178k| 0 7934M| 531M 57.6M 595M 6610M| 0 26 56 56
> > > 21-08 22:10:59| 26 63 0 0 11 0| 0 0 | 0 0 | 0 0 |1336 3179k| 0 7934M| 531M 57.6M 595M 6610M| 0 3 33 32
> > > 21-08 22:11:00| 26 63 0 0 11 0| 0 24k| 90B 108B| 0 0 |1347 3172k| 0 7934M| 531M 57.6M 595M 6610M| 0 41 73 71
> > >
> > > I have four logical cores so 25% makes up 1 core. I don't know if the ~26% user CPU usage has anthing to do with this fact or just coincidence. The rest is ~63% system and ~11% hardware interrupt. Do these support what you suspect?
> >
> > The massive increase in context switches does come as a bit of a surprise!
> > It does rule out my initial suspicion of lock contention, but then again
> > the fact that you have only four CPUs made that pretty unlikely to begin
> > with.
> >
> > 2.4k average context switches in the good case for the full run vs. 3,156k
> > for about half of a run in the bad case. That is an increase of more
> > than three orders of magnitude!
> >
> > Yow!!!
> >
> > Page faults are actually -higher- in the good case. You have about 6.5GB
> > free in both cases, so you are not running out of memory. Lots more disk
> > writes in the good case, perhaps consistent with its getting more done.
> > Networking is negligible in both cases.
> >
> > Lots of hardware interrupts in the bad case as well. Would you be willing
> > to take a look at /proc/interrupts before and after to see which one you
> > are getting hit with? (Or whatever interrupt tracking tool you prefer.)
>
> Here are the results.
>
> Good case before:
> CPU0 CPU1 CPU2 CPU3
> 0: 17 0 0 0 IO-APIC-edge timer
> 1: 356 1 68 4 IO-APIC-edge i8042
> 8: 0 0 1 0 IO-APIC-edge rtc0
> 9: 330 14 449 71 IO-APIC-fasteoi acpi
> 12: 10 108 269 2696 IO-APIC-edge i8042
> 16: 36 10 111 2 IO-APIC-fasteoi ehci_hcd:usb1
> 17: 20 3 25 4 IO-APIC-fasteoi mmc0
> 21: 3 0 34 0 IO-APIC-fasteoi ehci_hcd:usb2
> 40: 0 1 12 11 PCI-MSI-edge mei_me
> 41: 10617 173 9959 292 PCI-MSI-edge ahci
> 42: 862 11 186 26 PCI-MSI-edge xhci_hcd
> 43: 107 77 27 102 PCI-MSI-edge i915
> 44: 5322 20 434 22 PCI-MSI-edge iwlwifi
> 45: 180 0 183 86 PCI-MSI-edge snd_hda_intel
> 46: 0 3 0 0 PCI-MSI-edge eth0
> NMI: 1 0 0 0 Non-maskable interrupts
> LOC: 16312 15177 10840 8995 Local timer interrupts
> SPU: 0 0 0 0 Spurious interrupts
> PMI: 1 0 0 0 Performance monitoring interrupts
> IWI: 1160 523 1031 481 IRQ work interrupts
> RTR: 3 0 0 0 APIC ICR read retries
> RES: 14976 16135 9973 10784 Rescheduling interrupts
> CAL: 482 457 151 370 Function call interrupts
> TLB: 70 106 352 230 TLB shootdowns
> TRM: 0 0 0 0 Thermal event interrupts
> THR: 0 0 0 0 Threshold APIC interrupts
> MCE: 0 0 0 0 Machine check exceptions
> MCP: 2 2 2 2 Machine check polls
> ERR: 0
> MIS: 0
>
> Good case after:
> CPU0 CPU1 CPU2 CPU3
> 0: 17 0 0 0 IO-APIC-edge timer
> 1: 367 1 81 4 IO-APIC-edge i8042
> 8: 0 0 1 0 IO-APIC-edge rtc0
> 9: 478 14 460 71 IO-APIC-fasteoi acpi
> 12: 10 108 269 2696 IO-APIC-edge i8042
> 16: 36 10 111 2 IO-APIC-fasteoi ehci_hcd:usb1
> 17: 20 3 25 4 IO-APIC-fasteoi mmc0
> 21: 3 0 34 0 IO-APIC-fasteoi ehci_hcd:usb2
> 40: 0 1 12 11 PCI-MSI-edge mei_me
> 41: 16888 173 9959 292 PCI-MSI-edge ahci
> 42: 1102 11 186 26 PCI-MSI-edge xhci_hcd
> 43: 107 132 27 136 PCI-MSI-edge i915
> 44: 6943 20 434 22 PCI-MSI-edge iwlwifi
> 45: 180 0 183 86 PCI-MSI-edge snd_hda_intel
> 46: 0 3 0 0 PCI-MSI-edge eth0
> NMI: 4 3 3 3 Non-maskable interrupts
> LOC: 26845 24780 19025 17746 Local timer interrupts
> SPU: 0 0 0 0 Spurious interrupts
> PMI: 4 3 3 3 Performance monitoring interrupts
> IWI: 1637 751 1287 695 IRQ work interrupts
> RTR: 3 0 0 0 APIC ICR read retries
> RES: 26511 26673 18791 20194 Rescheduling interrupts
> CAL: 510 480 151 370 Function call interrupts
> TLB: 361 292 575 461 TLB shootdowns
> TRM: 0 0 0 0 Thermal event interrupts
> THR: 0 0 0 0 Threshold APIC interrupts
> MCE: 0 0 0 0 Machine check exceptions
> MCP: 2 2 2 2 Machine check polls
> ERR: 0
> MIS: 0
>
> Bad case before:
> CPU0 CPU1 CPU2 CPU3
> 0: 17 0 0 0 IO-APIC-edge timer
> 1: 172 3 78 3 IO-APIC-edge i8042
> 8: 0 1 0 0 IO-APIC-edge rtc0
> 9: 1200 148 395 81 IO-APIC-fasteoi acpi
> 12: 1625 2 348 10 IO-APIC-edge i8042
> 16: 26 23 115 0 IO-APIC-fasteoi ehci_hcd:usb1
> 17: 16 3 12 21 IO-APIC-fasteoi mmc0
> 21: 2 2 33 0 IO-APIC-fasteoi ehci_hcd:usb2
> 40: 0 0 14 10 PCI-MSI-edge mei_me
> 41: 15776 374 8497 687 PCI-MSI-edge ahci
> 42: 1297 829 115 24 PCI-MSI-edge xhci_hcd
> 43: 103 149 9 212 PCI-MSI-edge i915
> 44: 13151 101 511 91 PCI-MSI-edge iwlwifi
> 45: 153 159 0 122 PCI-MSI-edge snd_hda_intel
> 46: 0 1 1 0 PCI-MSI-edge eth0
> NMI: 32 31 31 31 Non-maskable interrupts
> LOC: 82504 82732 74172 75985 Local timer interrupts
> SPU: 0 0 0 0 Spurious interrupts
> PMI: 32 31 31 31 Performance monitoring interrupts
> IWI: 17816 16278 13833 13282 IRQ work interrupts
> RTR: 3 0 0 0 APIC ICR read retries
> RES: 18784 21084 13313 12946 Rescheduling interrupts
> CAL: 393 422 306 356 Function call interrupts
> TLB: 231 176 235 191 TLB shootdowns
> TRM: 0 0 0 0 Thermal event interrupts
> THR: 0 0 0 0 Threshold APIC interrupts
> MCE: 0 0 0 0 Machine check exceptions
> MCP: 3 3 3 3 Machine check polls
> ERR: 0
> MIS: 0
>
> Bad case after:
> CPU0 CPU1 CPU2 CPU3
> 0: 17 0 0 0 IO-APIC-edge timer
> 1: 415 3 85 3 IO-APIC-edge i8042
> 8: 0 1 0 0 IO-APIC-edge rtc0
> 9: 1277 148 428 81 IO-APIC-fasteoi acpi
> 12: 1625 2 348 10 IO-APIC-edge i8042
> 16: 26 23 115 0 IO-APIC-fasteoi ehci_hcd:usb1
> 17: 16 3 12 21 IO-APIC-fasteoi mmc0
> 21: 2 2 33 0 IO-APIC-fasteoi ehci_hcd:usb2
> 40: 0 0 14 10 PCI-MSI-edge mei_me
> 41: 17814 374 8497 687 PCI-MSI-edge ahci
> 42: 1567 829 115 24 PCI-MSI-edge xhci_hcd
> 43: 103 177 9 242 PCI-MSI-edge i915
> 44: 14956 101 511 91 PCI-MSI-edge iwlwifi
> 45: 153 159 0 122 PCI-MSI-edge snd_hda_intel
> 46: 0 1 1 0 PCI-MSI-edge eth0
> NMI: 36 35 34 34 Non-maskable interrupts
> LOC: 92429 92708 81714 84071 Local timer interrupts
> SPU: 0 0 0 0 Spurious interrupts
> PMI: 36 35 34 34 Performance monitoring interrupts
> IWI: 22594 19658 17439 14257 IRQ work interrupts
> RTR: 3 0 0 0 APIC ICR read retries
> RES: 21491 24670 14618 14569 Rescheduling interrupts
> CAL: 441 439 306 356 Function call interrupts
> TLB: 232 181 274 465 TLB shootdowns
> TRM: 0 0 0 0 Thermal event interrupts
> THR: 0 0 0 0 Threshold APIC interrupts
> MCE: 0 0 0 0 Machine check exceptions
> MCP: 3 3 3 3 Machine check polls
> ERR: 0
> MIS: 0
Lots more local timer interrupts, which is consistent with the higher
time in interrupt handlers for the bad case.
> > One hypothesis is that your workload and configuration are interacting
> > with RCU_FAST_NO_HZ to force very large numbers of RCU grace periods.
> > Could you please check for this by building with CONFIG_RCU_TRACE=y,
> > mounting debugfs somewhere and dumping rcu/rcu_sched/rcugp before and
> > after each run?
>
> Good case before:
> completed=8756 gpnum=8757 age=0 max=21
> after:
> completed=14686 gpnum=14687 age=0 max=21
>
> Bad case before:
> completed=22970 gpnum=22971 age=0 max=21
> after:
> completed=26110 gpnum=26111 age=0 max=21
In the good case, (14686-8756)/40=148.25 grace periods per second, which
is a fast but reasonable rate given your HZ=250. Not a large difference
in the number of grace periods, but extrapolating for the longer runtime,
maybe ten times as much. But not much change in grace-period rate per
unit time.
> The test scenario was the following in both cases (mixed english and pseudo-bash):
> reboot, login, start terminal
> cd project
> rm -r build
> cat /proc/interrupts >> somefile ; cat /sys/kernel/debug/rcu/rcu_sched/rcugp >> somefile
> scons -j4
> wait ~40 sec (good case finished, Ctrl-C in bad case)
> cat /proc/interrupts >> somefile ; cat /sys/kernel/debug/rcu/rcu_sched/rcugp >> somefile
>
> I stopped the build in the bad case after about the same time the good
> case finished, so the extra interrupts and RCU grace periods because of the
> longer runtime don't fake the results.
That procedure works for me, thank you for laying it out carefully.
I believe I see what is going on and how to fix it, though it may take
me a bit to work things through and get a good patch.
Thank you very much for your testing efforts!
Thanx, Paul
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/