Performance of 'forky' workloads

From: Richard Purdie
Date: Mon Dec 07 2015 - 08:34:55 EST


This may be a silly question however its puzzling me and I don't have
any answer I'm happy with so I'm going to ask.

I have a 'stupid' workload which does a lot of forking, basically
pathologically stupid configure scripts. Its easy to replicate:

$ wget http://ftp.gnu.org/pub/gnu/gettext/gettext-0.19.4.tar.xz
$ tar -xf gettext-0.19.4.tar.xz
$ cd gettext-0.19.4/
$ time ./configure

I'm picking gettext just as it shows the problem "nicely". I work on
build systems and whether I like it or not, we have to deal with
configure scripts. Running this in the same directory again does use
some cached values but not enough to really change the problem I'm
illustrating.

I see something like:

gl_cv_func_sleep_works=yes gl_cv_header_working_fcntl_h=yes taskset -c 0,36 time ./configure

15.02user 2.65system 0:34.76elapsed 50%CPU (0avgtext+0avgdata 34692maxresident)k
119464inputs+109816outputs (369major+8195438minor)pagefaults 0swaps
[cold caches]

14.87user 2.78system 0:32.58elapsed 54%CPU (0avgtext+0avgdata 34700maxresident)k
0inputs+109816outputs (0major+8196962minor)pagefaults 0swaps
[hot caches]

What is puzzling me is that 34.76 != 15.02 + 2.65. A little bit of
fuzziness in the accounting, fine, I could understand but 50% missing?
Where is that time spent?

I have done a few bits of tuning already. I've discovered that cache
thrashing was horrid with the intel_pstate governor in powersave mode,
doubling the time it took. In performance mode the cores perform equally
as it appears not to migrate the process and hence works better. Its
slightly faster if I give it access to the HT core of the main CPU core
I give it, hence the taskset above is optimal for my system.

I've also patched out the "sleep 1" in the configure scripts (there are
5 of them), before I did this the time was more like 40s. This is also
why there are the cached entries on the commandline to configure, those
disable a couple of tests which sleep. We can tweak the build system
macros to make those disappear, but my question stands.

Even with all my tweaks there is still 50% disappearing and my CPUs are
allegedly idling a lot of the time.

I have tried looking at what perf shows, for active cpu usage, for
schedule switching and also for lock analysis. I didn't find anything
obvious, it is spending a lot of time around pipe locks, equally perf's
own locks distorted the lock stats badly as perf is heavy around new
process creation. The sched wait times shows switching around do_wait()
on the most part. Running this load under a kernel with LOCK_STAT
enabled is around 10s slower.

I have also tried putting the whole thing on tmpfs and pointing TMPDIR
at a tmpfs, all with no gain, its doesn't appear to be IO bound.

I even had a look around the scheduler for fun, tried changing
SCHED_FEAT(START_DEBIT, true) to SCHED_FEAT(START_DEBIT, false) and the
systl sched_child_runs_first entry but neither seemed to help.

So my question is simply where is this spending time? How can I show on
some kind of performance measurement where it is? Obviously that might
lead to the question "What can I do about it?" but one step at a time.

Cheers,

Richard


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/