Re: [patch V2 2/7] futex: Hash private futexes per process

From: Sebastian Andrzej Siewior
Date: Mon May 30 2016 - 09:38:08 EST

On 05/30/2016 02:06 PM, Peter Zijlstra wrote:
>> We
>> never rehash the hash. To rehash the hash on runtime we would need an
>> empty futex hash and some locking in the fast path.
> Nah; you can do lockless rehashing. Little tricky, but entirely doable.
> At worst we'd need an rcu_read_lock(), but with a little extra work we
> can extend an existing preempt_disable section to cover the hash lookup
> if it isn't already covered.

you also need ensure that all new users use the new hash but all old
users go for the old one until you can switch them over. And it is
likely we miss a corner case because it is futex after all.

And still, we all that? You need an upper limit how often / until which
size you do resize. And if you want to avoid collisions at all costs
(because) go for the max value if you think you need it so there is no
need to resize. And if the task does not pre-allocate why care at all?

>> And rehash seems
>> not to be required since we tried to come up with a sane default value
>> and the user/RT task can set it to the current max value.
> I'm not sure this statement is true; given that the hash function
> reduces the entire address space down to a few bits you're guaranteed to
> have collisions at some point. A good hash will not have more than given
> by the birthday paradox, but you cannot have less.
> So while the sizing can certainly reduce the chance on collisions,
> nothing guarantees you will not have them, quite the opposite.

Yes, I did not try to discuss the hash collision away. From RT's point
of view the problems I am aware are of the following scenario:
Task #1 pinned to CPU1 task #2 pinned to CPU2 but share the same
bucket. Task #2 got a wakeup and should run but is blocked on the
bucket lock - otherwise it could run. Usually it would PI-boost task#1
but task#1 got preempted itself by task and since task#1 prio is lower
it can't boost its way free towards the lock and so so CPU #2 may
remain idle for some time.

The same thing can happen within a Task if you take my story from above
and replace task with thread. Completely understood.

>> So back to when does the collision happen. Since glibc visits the
>> kernel only on contention we might learn about the collision when it is
>> too late. We could have a lock operation by thread1 followed by lock
>> operation by thread2 on different uaddr resulting in the same bucket.
>> In this case we learn about this once the spin_lock() operation blocks.
> Yep, so if this is two ondemand futexes hitting the same bucket, we
> don't care. But if this is an ondemand futex hitting a prealloc one, we
> do care.
> And yes, this is late, but its the only possible time. And notification
> is better than silent weird behaviour.

We don't distinguish right now between "user asked to preallocate the
hash" and "we created this hash because we had a locking job in
kernel". The latter is the behaviour we used to have and the former is
something like "this new preallocate thingy does not always work".

>> Also marking a bucket as taken (on contention) might give false
>> positive results since we know nothing about lock's lifetime (i.e. the
>> lock might have been free()ed).
>> But if I may bring some ideas from v1. In v1 we had "tickets / IDs" for
>> the futex per thread. In v2 we don't have them anymore. We still have
>> the "private" futex hash buckets but per process this time.
>> We could introduce the "tickets / IDs" back and make them process wide.
>> We could hide them in pthread_mutex_init() and pthread_mutex_destroy()
>> since their IDs are no longer thread unique. I think I had something in
>> glibc's pthread variable where we could store 16bit if I split another
>> 32bit variable.
> Not sure you need a ticket (and I never got around to reading v1), a

so v1 in short. To lock you do:
futex(uaddr, FUTEX_LOCK_PI | PRIVATE, …)

instead uaddr, we used tickets and called the mode attached:
futex(ticket, FUTEX_LOCK_PI | PRIVATE | ATTACHED, …)

and the ticket was the index in our array where we kept the hash
buckets. That means each user space futex has its own hash bucket
without contention. But since we had the tickets unique in every thread
it was hard to handle. Keeping them the same process wide should make
things simple again.

> simple twin to PREALLOCATE_HASH to call on mutex_destroy would be
> sufficient I think. It could free the hash bucket.

You do PREALLOCATE_HASH _once_ on startup not for every mutex. But yes,
you would mark each mutex in mutex_destroy() as gone.

So this would "work", you would find a collision during a slot lookup.
You could increase the coverage if you add another marker in
pthread_mutex_init() (well, except for static initializes). That means
we need glibc changes for this to work.
And what do you suggest to report such a collision? A trace point, a
file in proc? WARN_ON_ONCE() would not result in CVE but is not that
pretty and informative. Not sure if glibc is all for a return code
which is not part of POSIX.

Also I'm not sure what to do with this information unless this happens
during development. That is why I suggest the above which is guaranteed
hash clash free :)

>> That would be guaranteed collision free and hidden in glibc.
> I'm still not seeing how you can guarantee anything with it; ondemand
> local hashing can still collide, right?

Since we have a process local hash table for private futex, yes
collisions can happen.

> Or are you thinking two local hashtables, one for ondemand things and
> one for prealloc/free type thingies?

I think one hash table for private futexes as we have now and if people
still complain about collisions get the dedicated "struct
futex_hash_bucket" for each pthread_mutex_t which we had in v1 and then
use the returned ticket for every lock operation. But this time keep it
ticket number the same within a process.
This would guarantee that no collision happens within a task and
requires changes in glibc. So the entry barrier is a little higher but
we could merge this (more or less) as is and the the collision-free
extension on top if there is demand for it.

>> My point was during development / testing to figure out which initial
>> value is sane / reasonable. Start your APP with 32 slot. Collisions.
>> Again with 128 slots. Oh better.
> Ah, right, but given the above and ASLR, you cannot exhaustively test
> this.

correct. You could disable ASLR but security wise you might not want do
that (but then if everything runs as root anyway, disabling ASLR is the
next logical step :) ).