Re: [PATCH] cpuset semaphore depth check optimize

From: Paul Jackson
Date: Mon Sep 12 2005 - 09:56:25 EST

Next message: John W. Linville: "[patch 2.6.13 1/6] swiotlb: move from arch/ia64/lib to lib"
Previous message: John W. Linville: "[patch 2.6.13 2/6] swiotlb: cleanup some code duplication cruft"
In reply to: Paul Jackson: "Re: [PATCH] cpuset semaphore depth check optimize"
Next in thread: Roman Zippel: "Re: [PATCH] cpuset semaphore depth check optimize"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]

Andrew wrote:
> Better, but still hacky. The rest of the kernel manages to avoid the need
> for nestable semaphores by getting the locking design sorted out. Can
> cpusets do that sometime?

Short answer:
=============
Yup - absolutely a hack. So far it's working better than what I had
before. I have no plans to replace it. I'm open to suggestions.

Long answer:
============
I had this without such a hack for a year, with the code that required
calling refresh_mems() after getting cpuset_sem in the top level
cpuset code, before any path that might get to __alloc_pages().
It was *the* primary source of serious bugs in the cpuset code this
last year. You've been lucky enough to only see a couple of them.

I've got paths in __alloc_pages() and below that acquire cpuset_sem,
but only rarely. I am not aware of any other global kernel lock
that routinely shows up both _above_ and _below_ __alloc_pages on
the stack. Normally such would be a big problem for numa scaling.
This one avoids the scaling problem by only being on rare paths when
called within __alloc_pages. But that means I can't test for bugs;
I have to code so it is obviously right in the first place (not a
bad idea in general ;).

I have *never* actually recreated such a deadlock myself, on a stock
kernel. I've seen exactly one live kernel backtrace showing such
a deadlock from a customer site in my life, and reliable reports of
a few more. I've never seen such a backtrace or heard evidence of a
such a deadlock on an internal test machine running a stock kernel.

I've spent days trying to concoct test scenarios that would trip these
deadlocks on a stock kernel, with no success. The specific low memory
conditions required, along with other timing races, are too difficult
(so far anyway) to recreate on demand.

But for one time, everyone of these bugs (double tripping on
cpuset_sem) was found, and fixed, solely by inspection. And the bugs
kept coming.

The final straw came when I took a vacation, came back, forgot how my
scheme worked and started writing totally broken code and stupid
comments, based on my newly acquired misunderstanding.

Something had to give.

It was time to cast elegance aside and find a locking scheme that even
a stupid person could understand and get right.

This hack has two serious advantages:
1) It's as obvious as a big old scar on one's face. In just a
couple days, it has generated more informed discussion than the
old one ever did, by readers who had no trouble seeing what it
did, commenting without hesitation on alternatives. The rare
comments I got on the previous scheme came only after hours of
head scratching, and with some uncertainty.
2) It works. Every variation of this I've coded has, so far as I
know, been rock solid. The discussion has only been over style
and data space issues. It looks like it will continue to work
if per chance I add more cpuset features that allocate memory
somewhere in the code. The only rule is that every up/down on
cpuset_sem has to be done by the wrappers, which is why I have
insisted on having the wrapper calls look like regular calls
to "up(&cpuset_sem)" and "down(&cpuset_sem)", so this convention
would be rather obvious.

"If you're going to hack, at least do it in plain sight."

If someone would like to recommend a better scheme, I'm all ears.
I am well aware that there are others in the kernel who know locking
better than I do. And I am well aware that the experts in such stuff
don't end up coding hacks such as this.

Barring some inspiration however, probably requiring the help of input
from others, it is unlikely that I will be posting a patch to remove
this hack anytime soon.

--
I won't rest till it's the best ...
Programmer, Linux Scalability
Paul Jackson <pj@xxxxxxx> 1.925.600.0401
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

Next message: John W. Linville: "[patch 2.6.13 1/6] swiotlb: move from arch/ia64/lib to lib"
Previous message: John W. Linville: "[patch 2.6.13 2/6] swiotlb: cleanup some code duplication cruft"
In reply to: Paul Jackson: "Re: [PATCH] cpuset semaphore depth check optimize"
Next in thread: Roman Zippel: "Re: [PATCH] cpuset semaphore depth check optimize"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]