On Tue, 22 Jul 2014, Peter Zijlstra wrote:
On Tue, Jul 22, 2014 at 10:39:17AM +0200, Thomas Gleixner wrote:Which works fine as long as you only have the futex_q on the stack of
On Tue, 22 Jul 2014, Peter Zijlstra wrote:You don't and that should work just as well, just slower. But since the
Anyway, there is one big fail in the entire futex stack that we 'need'So you want per node hash buckets, right? Fair enough, but how do you
to sort some day and that is NUMA. Some people (again database people)
explicitly do not use futexes and instead use sysvsem because of this.
The problem with numa futexes is that because they're vaddr based there
is no (persistent) node information. You always end up having to fall
back to looking in all nodes before you can guarantee there is no
matching futex.
One way to achieve it is by extending the futex value to include a node
number, but that's obviously a complete ABI break. Then again, it should
be pretty straight fwd, since the node number doesn't need to be part of
the actual atomic update part, just part of the userspace storage.
make sure, that no thread/process on a different node is fiddling with
that "node bound" futex as well?
node id is in the futex 'value' we'll always end up in the right
node-hash, even if its a remote one.
So yes, per node hashes, and a persistent futex->node map.
the blocked task. If user space is lying to you, then you just end up
with a bunch of threads sleeping forever. Who cares?
But if you create independent kernel state, which we have with
pi_state and which you need for finegrained locking and further
spinning fun, you open up another can of worms. Simply because this
would enable rogue user space to create multiple instances of the
kernel internal state. I can predict the CVEs resulting from that
even without using a crystal ball.
Thanks,
tglx