Re: [GIT PULL] adaptive spinning mutexes

From: Ingo Molnar
Date: Wed Jan 14 2009 - 18:24:29 EST



* Andrew Morton <akpm@xxxxxxxxxxxxxxxxxxxx> wrote:

> > I also checked Fedora and it has SCHED_DEBUG=y
> > in its kernel rpms.
>
> If all distros set SCHED_DEBUG=y then fine.

95% of the distros and significant majority of the lkml traffic.

And no, we dont generally dont provide knobs for essential performance
features of core Linux kernel primitives - so the existence of SPIN_OWNER
in /sys/debug/sched_features is an exception already.

We dont have any knob to switch ticket spinlocks to old-style spinlocks.
We dont have any knob to switch the page allocator from LIFO to FIFO. We
dont have any knob to turn off the coalescing of vmas in the MM. We dont
have any knob to turn the mmap_sem from an rwsem to a mutex to a spinlock.

Why? Beacause such design and implementation details are what make Linux
Linux, and we stand by those decisions for better or worse. And we do try
to eliminate as many 'worse' situations as possible, but we dont provide
knobs galore. We offer flexibility in our willingness to fix any genuine
performance issues in our source code.

The thing is that apps tend to gravitate towards solutions with the least
short-term cost. If a super important enterprise app can solve their
performance problem by either redesigning their broken code, or by turning
off a feature we have in the kernel in their install scripts (which we
made so easy to tune via a stable sysctl), guess which variant they will
chose? Even if they hurt all other apps in the process.

> > note that there's also a performance issue here: we generally _dont
> > want_ a debug sysctl overhead in the mutex code or in any fastpath for
> > that matter. So making it depend on SCHED_DEBUG is useful.
> >
> > sched_feat() features get optimized out at build time when SCHED_DEBUG
> > is disabled. So it gives us the best of two worlds: the utility of
> > sysctls in the SCHED_DEBUG=y, and they get compiled out in the
> > !SCHED_DEBUG case.
>
> I'm not detecting here a sufficient appreciation of the number of
> sched-related regressions we've seen in recent years, nor of the
> difficulty encountered in diagnosing and fixing them. Let alone the
> difficulty getting those fixes propagated out a *long* time after the
> regression was added.

The bugzilla you just dug out in another thread does not seem to apply, so
i'm not sure what you are referring to.

Regarding historic tendencies, we have numbers like:

[v2.6.14] [v2.6.29]

Semaphores | Mutexes
----------------------------------------------
| no-spin spin
|
[tmpfs] ops/sec: 50713 | 291038 392865 (+34.9%)
[ext3] ops/sec: 45214 | 283291 435674 (+53.7%)

10x performance improvement on ext3, compared to 2.6.14.

I'm sure there will be other numbers that go down - but the thing is,
we've _never_ been good at finding the worst-possible workload cases
during development.

> You're taking a whizzy new feature which drastically changes a critical
> core kernel feature and jamming it into mainline with a vestigial amount
> of testing coverage without giving sufficient care and thought to the
> practical lessons which we have learned from doing this in the past.

If you look at the whole existence of /sys/debug/sched_feature you'll see
how careful we've been about performance regressions. We made it a
sched_feat() exactly because if a change goes wrong and becomes a step
backwards then it's a oneliner to turn it default-off.

We made use of that facility in the past and we have a number of debug
knobs there right now:

# cat /debug/sched_features
NEW_FAIR_SLEEPERS NORMALIZED_SLEEPER WAKEUP_PREEMPT START_DEBIT
AFFINE_WAKEUPS CACHE_HOT_BUDDY SYNC_WAKEUPS NO_HRTICK NO_DOUBLE_TICK
ASYM_GRAN LB_BIAS LB_WAKEUP_UPDATE ASYM_EFF_LOAD NO_WAKEUP_OVERLAP
LAST_BUDDY OWNER_SPIN

All of those ~16 scheduler knobs were done out of caution, to make sure
that if we change some scheduling aspect there's a convenient way to debug
performance or interactivity regressions, without forcing people into
bisection and/or reboots, etc.

> This is a highly risky change. It's not that the probability of failure
> is high - the problem is that the *cost* of the improbable failure is
> high. We should seek to minimize that cost.

It never mattered much to the efficiency of finding performance
regressions whether a feature sat tight for 4 kernel releases in -mm or
went upstream in a week. It _does_ matter to stability - but not
significantly to performance.

What matteres most to getting performance right is testing exposure and
variance, not length of the integration period. Easy revertability helps
too - and that is a given here - it's literally a oneliner to disable it.
See that oneliner below.

Ingo

Index: linux/kernel/sched_features.h
===================================================================
--- linux.orig/kernel/sched_features.h
+++ linux/kernel/sched_features.h
@@ -13,4 +13,4 @@ SCHED_FEAT(LB_WAKEUP_UPDATE, 1)
SCHED_FEAT(ASYM_EFF_LOAD, 1)
SCHED_FEAT(WAKEUP_OVERLAP, 0)
SCHED_FEAT(LAST_BUDDY, 1)
-SCHED_FEAT(OWNER_SPIN, 1)
+SCHED_FEAT(OWNER_SPIN, 0)
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/