Re: The performance and behaviour of the anti-fragmentation relatedpatches
From: Andrew Morton
Date: Fri Mar 02 2007 - 19:55:37 EST
On Fri, 2 Mar 2007 16:33:19 -0800
William Lee Irwin III <wli@xxxxxxxxxxxxxx> wrote:
> On Fri, Mar 02, 2007 at 02:22:56PM -0800, Andrew Morton wrote:
> > Opterons seem to be particularly prone to lock starvation where a cacheline
> > gets captured in a single package for ever.
>
> AIUI that phenomenon is universal to NUMA. Maybe it's time we
> reexamined our locking algorithms in the light of fairness
> considerations.
>
It's also a multicore thing. iirc Kiran was seeing it on Intel CPUs.
I expect the phenomenon would be observeable on a number of locks in the
kernel, give the appropriate workload. We just hit it first on lru_lock.
I'd have thought that increasing SWAP_CLUSTER_MAX by two or four orders of
magnitude would plug it, simply by decreasing the acquisition frequency but
I think Kiran fiddled with that to no effect.
See below for Linus's thoughts, forwarded without permission..
Begin forwarded message:
Date: Mon, 22 Jan 2007 13:49:02 -0800 (PST)
From: Linus Torvalds <torvalds@xxxxxxxxxxxxxxxxxxxx>
To: Andrew Morton <akpm@xxxxxxxx>
Cc: Nick Piggin <nickpiggin@xxxxxxxxxxxx>, Ravikiran G Thirumalai <kiran@xxxxxxxxxxxx>
Subject: Re: High lock spin time for zone->lru_lock under extreme conditions
On Mon, 22 Jan 2007, Andrew Morton wrote:
>
> Please review the whole thread sometime. I think we're pretty screwed, and
> the problem will only become worse as more cores get rolled out and I don't
> know what to do about it apart from whining to Intel, but that won't fix
> anything.
I think people need to realize that spinlocks are always going to be
unfair, and *extremely* so under some conditions. And yes, multi-core
brought those conditions home to roost for some people (two or more cores
much closer to each other than others, and able to basically ping-pong the
spinlock to each other, with nobody else ever able to get it).
There's only a few possible solutions:
- use the much slower semaphores, which actually try to do fairness.
- if you cannot sleep, introduce a separate "fair spinlock" type. It's
going to be appreciably slower (and will possibly have a bigger memory
footprint) than a regular spinlock, though. But it's certainly a
possible thing to do.
- make sure no lock that you care about ever has high enough contention
to matter. NOTE! back-off etc simply will not help. This is not a
back-off issue. Back-off helps keep down coherency traffic, but it
doesn't help fairness.
If somebody wants to play with fair spinlocks, go wild. I looked at it at
one point, and it was not wonderful. It's pretty complicated to do, and
the best way I could come up with was literally a list of waiting CPU's
(but you only need one static list entry per CPU). I didn't bother to
think a whole lot about it.
The "never enough contention" is the real solution. For example, anything
that drops and just re-takes the lock again (which some paths do for
latency reduction) won't do squat. The same CPU that dropped the lock will
basically always be able to retake it (and multi-core just means that is
even more true, with the lock staying within one die even if some other
core can get it).
Of course, "never enough contention" may not be possible for all locks.
Which is why a "fair spinlock" may be the solution - use it for the few
locks that care (and the VM locks could easily be it).
What CANNOT work: timeouts. A watchdog won't work. If you have workloads
with enough contention, once you have enough CPU's, there's no upper bound
on one of the cores not being able to get the lock.
On the other hand, what CAN work is: not caring. If it's ok to not be
fair, and it only happens under extreme load, then "we don't care" is a
perfectly fine option.
In the "it could work" corner, I used to hope that cache coherency
protocols in hw would do some kind of fairness thing, but I've come to the
conclusion that it's just too hard. It's hard enough for software, it's
likely really painful for hw too. So not only does hw generally not do it
today (although certain platforms are better at it than others), I don't
really expect this to change.
If anything, we'll see more of it, since multicore is one thing that makes
things worse (as does multiple levels of caching - NUMA machines tend to
have this problem even without multi-core, simply because they don't have
a shared bus, which happens to hide many cases).
I'm personally in the "don't care" camp, until somebody shows a real-life
workload. I'd often prefer to disable a watchdog if that's the biggest
problem, for example. But if there's a real load that shows this as a real
problem...
Linus
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/