Re: [patch V2 2/7] futex: Hash private futexes per process

From: Sebastian Andrzej Siewior
Date: Mon May 30 2016 - 07:09:41 EST

On 05/30/2016 10:58 AM, Peter Zijlstra wrote:
> On Fri, May 27, 2016 at 07:10:01PM +0200, Sebastian Andrzej Siewior wrote:
>> On 2016-05-19 14:24:06 [+0200], Peter Zijlstra wrote:
>>> On Thu, May 05, 2016 at 08:44:04PM -0000, Thomas Gleixner wrote:
>>>> +static struct futex_hash_bucket *hash_futex(union futex_key *key)
>>>> +{
>>>> + struct mm_struct *mm = current->mm;
>>>> + unsigned int slot;
>>>> +
>>>> + /*
>>>> + * Futexes which use the per process hash have the lower bits cleared
>>>> + */
>>>> + if (key->both.offset & (FUT_OFF_INODE | FUT_OFF_MMSHARED))
>>>> + return hash_global_futex(key);
>>>> +
>>>> + slot = hash_long(key->private.address, mm->futex_hash.hash_bits);
>>>> + return &mm->futex_hash.hash[slot];
>>> Do we want the option to WARN if we get collisions in this per-process
>>> hash?
>>> Because afaiu there is no guarantee what so ever this doesn't happen,
>>> and collisions here can create the very same priority inversions as are
>>> possible in the global hash.
>>> Less likely etc.. more contained since its only the threads of the one
>>> process that get tangled up, but still possible.
>> Since the collision is contained the same process it is less dramatic.
> Right, but can still cause significant malfunction inside the process.
> So its not something to completely ignore. If your room sized CNC
> machine gets the priorities of the logging thread and the motor control
> thread confused bad things could happen.

>> But how do you want to warn the user? A trace-event would be handy to
>> dump the uaddr and slot.
> So I think there's a number of cases:
> - PREALLOC_HASH finds a taken bucket; in this case we can simply return
> an error.
> - PREALLOC_HASH succeeds, but an on demand hash later hits the same
> bucket. This is harder; we could maybe mark all buckets taken by
> PREALLOC_HASH and allow for a signal when this collision hits. Dunno.

PREALLOC_HASH happens once before any (contended) lock operation. We
never rehash the hash. To rehash the hash on runtime we would need an
empty futex hash and some locking in the fast path. And rehash seems
not to be required since we tried to come up with a sane default value
and the user/RT task can set it to the current max value.

So back to when does the collision happen. Since glibc visits the
kernel only on contention we might learn about the collision when it is
too late. We could have a lock operation by thread1 followed by lock
operation by thread2 on different uaddr resulting in the same bucket.
In this case we learn about this once the spin_lock() operation blocks.

Also marking a bucket as taken (on contention) might give false
positive results since we know nothing about lock's lifetime (i.e. the
lock might have been free()ed).

But if I may bring some ideas from v1. In v1 we had "tickets / IDs" for
the futex per thread. In v2 we don't have them anymore. We still have
the "private" futex hash buckets but per process this time.
We could introduce the "tickets / IDs" back and make them process wide.
We could hide them in pthread_mutex_init() and pthread_mutex_destroy()
since their IDs are no longer thread unique. I think I had something in
glibc's pthread variable where we could store 16bit if I split another
32bit variable.

That would be guaranteed collision free and hidden in glibc. But it
would take some time to get it used since it does no longer work out of
the box by updating the kernel. We could also add it later if people
scream for it since we can't change the behavior of "PRIVATE" futex
(uaddr vs ticket number).

>> But due to ASLR
>> the same application might result in a different behaviour on each run.
> Yeah, ASLR makes this all somewhat non deterministic, which is why you
> really don't want a silent collision for your PREALLOC_HASH buckets.
> Because once every 100 runs it does weird,..

I think that you might learn about your collision too late and there
is nothing you can do about it. Logging in a logfile or restart the
application after 5 minutes of runtime? Not to mention the locks which
are contended in an error situation.

A futex op returning the kernel's hash value might be sane. You would
expose some implementation detail but the application could check its
mutex for collision before starting (doing the real work). On collision
it would have to restart and hope for the best if the collision from
But since you don't know about all mutexes, like those which are part
of library, you wouldn't never be 100% collision free here. So it is
probably a bad idea.

>> However, it might be good for a indication about the size of the private
>> hashâ
> Yeah, now if online resize wasn't such a pain ;-)

My point was during development / testing to figure out which initial
value is sane / reasonable. Start your APP with 32 slot. Collisions.
Again with 128 slots. Oh better.