Re: [RFC PATCH 3/4] futex: Throughput-optimized (TO) futexes

From: Waiman Long
Date: Thu Sep 22 2016 - 22:44:54 EST

On 09/22/2016 09:32 AM, Thomas Gleixner wrote:
On Tue, 6 Sep 2016, Waiman Long wrote:
+enum futex_type {
+ TYPE_PI = 0,
Please introduce the futex_type magic and the related changes to the pi
code in a seperate patch so it can be verified independently.

It's sad that one has to explain that to you over and over ....

I didn't break it out because the changes to the PI code was pretty small. I will break it out in the next version.

@@ -836,10 +859,10 @@ static void put_futex_state(struct futex_state *state)

- * If state->owner is NULL, the owner is most probably dying
- * and has cleaned up the futex state already
+ * If state->owner is NULL and the type is TYPE_PI, the owner
+ * is most probably dying and has cleaned up the state already
- if (state->owner) {
+ if (state->owner&& (state->type == TYPE_PI)) {
@@ -847,6 +870,11 @@ static void put_futex_state(struct futex_state *state)
rt_mutex_proxy_unlock(&state->pi_mutex, state->owner);

+ /*
+ * Dequeue it from the HB futex state list.
+ */
+ list_del_init(&state->hb_list);
The comment above this list_del() is really pointless. I can see that from
the code itself.

Aside of that: Why do you need seperate list heads? You explain the
seperate list somewhere in that big comment below, but it should be
explained at the point where you add it to the state and the hash bucket.

Sure. Will fix the comment.

if (current->pi_state_cache)
else {
@@ -919,13 +947,24 @@ void exit_pi_state_list(struct task_struct *curr)

- WARN_ON(pi_state->owner != curr);
+ if (pi_state->type == TYPE_PI) {
+ WARN_ON(pi_state->owner != curr);
+ pi_state->owner = NULL;
+ }
- pi_state->owner = NULL;

- rt_mutex_unlock(&pi_state->pi_mutex);
+ if (pi_state->type == TYPE_PI)
lacks curly braces

Yes, you are right.

+ rt_mutex_unlock(&pi_state->pi_mutex);
+ else if (pi_state->type == TYPE_TO) {
+ /*
+ * Need to wakeup the mutex owner.
+ */
Another completely useless comment. Because you tell what you do, but not

Will elaborate on why the wakeup here.

+ WARN_ON(!pi_state->owner);
+ if (pi_state->owner)
+ wake_up_process(pi_state->owner);
And what handles or sanity checks the state->hb_list ???

The exit_pi_state_list() function doesn't need to deal with state->hb_list. The hb_list is used to locate the futex state, but the futex owner doesn't have a reference to the futex state. So it won't need to decrement it and potentially free it.

+ * Try to lock the userspace futex word (0 => vpid).
+ *
+ * Return: 1 if lock acquired or an error happens, 0 if not.
+ * The status code will be 0 if no error, or< 0 if an error happens.
+ * *puval will contain the latest futex value when trylock fails.
+ *
+ * The waiter flag, if set, will make it ignore the FUTEX_WAITERS bit.
+ * The HB spinlock should NOT be held while calling this function. A
+ * successful lock acquisition will clear the waiter and died bits.
+ */
+static inline int futex_trylock_to(u32 __user *uaddr, u32 vpid, u32 *puval,
+ const bool waiter, int *status)
+ u32 uval;
+ *status = 0;
+ if (unlikely(get_user(uval, uaddr)))
+ goto efault;
+ *puval = uval;
+ if (waiter ? (uval& FUTEX_TID_MASK) : uval)
+ return 0; /* Trylock fails */
Please do not use tail comments. They are hard to parse.

OK, will move the comment up.

+ if (unlikely(futex_atomic_cmpxchg_inatomic(puval, uaddr, uval, vpid)))
+ goto efault;
+ return *puval == uval;
+ *status = -EFAULT;
+ return 1;
Do we really need another variant of cmpxchg and why do you need that extra
status? What's wrong in doing the magic in the return value?

This is not another variant of cmpxchg. It is the cmpxchg used by cmpxchg_futex_value_locked(). The only difference is that page fault was disabled with the locked version. I call futex_atomic_cmpxchg_inatomic() directly because it is called without the HB spinlock. So I don't need to disable page fault. I will add a separate patch to introduce the helper function cmpxchg_futex_value_unlocked() to indicate that the cmpxchg is done without lock.

I will remove the extra status parameter.

+ * Spinning threshold for futex word without setting FUTEX_WAITERS.
+ */
+#define FUTEX_SPIN_THRESHOLD (1<< 10)
That number is pulled out of thin air or what is the rationale for chosing

It is kind of arbitrary. I want a value large enough to encourage lock stealing, while not too large that it may take too long to get the lock. Will elaborate more about the choice in the comment.

+ * Spin on the futex word while the futex owner is active. Otherwise, set
+ * the FUTEX_WAITERS bit and go to sleep. As we take a reference to the futex
+ * owner's task structure, we don't need to use RCU to ensure that the task
+ * structure is valid. The function will directly grab the lock if the
+ * owner is dying or the pid is invalid. That should take care of the problem
+ * of dead lock owners unless the pid wraps around and the preceived owner is
+ * not the real owner.
+ *
+ * Return: 0 if futex acquired,< 0 if an error happens.
If you document functions then please follow the style of this file which
uses kerneldoc comments.

Sorry, I forgot to use the kerneldoc notation. It is an RFC patch, and my focus was to make it work. I haven't spent much time in cleaning up the comment. I will do that in the next version.

+ */
+static int futex_spin_on_owner(u32 __user *uaddr, u32 vpid,
+ struct futex_state *state)
+ int ret, loop = FUTEX_SPIN_THRESHOLD;
+ u32 uval, curval;
+ u32 opid = 0; /* Futex owner task ID */
+ struct task_struct *otask = NULL; /* Futex owner task struct */
Please use understandable variable names instead of this horrible tail
comments. What's wrong with using owner and owner_pid?

Yes, I will use more descriptive variable name.

+ bool on_owner_list = false;
+ preempt_disable();
+ WRITE_ONCE(state->owner, current);
+ for (;; loop--) {
+ if (futex_trylock_to(uaddr, vpid,&uval, true,&ret))
+ break;
And here you are. What the hell is wrong with:

ret = futex_trylock(uaddr, vpid,&uval, true);
if (ret< 0)

Nothing is wrong. It's just way simpler to read and understand....

Will do that.

+ if ((uval& FUTEX_TID_MASK) != opid) {
+ /*
+ * Get the new task structure
+ */
+ if (otask)
+ put_task_struct(otask);
+ WARN_ON_ONCE(on_owner_list);
+ opid = uval& FUTEX_TID_MASK;
+ otask = futex_find_get_task(opid);
+ }
Can you pretty please use split out functions for this otask handling? This
for loop does not fit on a single screen and can't be understood in one go.

Yes, this is the largest function in the new code, I will try to add helper functions to make it easier to understand.

+ if (unlikely(!otask || (otask->flags& PF_EXITING) ||
+ (uval& FUTEX_OWNER_DIED))) {
+ /*
+ * PID invalid or exiting/dead task, try to grab
+ * the lock now.
+ */
+ ret = futex_atomic_cmpxchg_inatomic(&curval,
+ uaddr, uval, vpid);
+ if (unlikely(ret))
+ goto efault;
+ if (curval != uval)
+ continue; /* Futex value changed */
This code flow is completely non parseable. Stop this and put proper
comments above decisions which explain why and not what.

I will spend more time cleaning up the comment and streamline the code.

+ pr_info("futex_spin_on_owner: pid %d grabs futex from pid %d (%s)!\n",
+ vpid, opid, otask ? "dying" : "invalid");
Really useful information to confuse sysadmins. A proper comment explaining
the issue and the implications and how we deal with it would have been the
right thing to do...

Yes, the message may be too cryptic. I will make it more clear in the next version.

+ break;
+ }
+ while ((loop<= 0)&& !(uval& FUTEX_WAITERS)) {
Groan. This is more than horrible to read.

if (loop<= 0) {
if (futex_set_waiters_bit(....))
goto fault;

If at all. This loop<= 0 thing is utterly confusing.

I will improve this code.

+ /*
+ * Need to set the FUTEX_WAITERS bit.
+ */
+ if (futex_atomic_cmpxchg_inatomic(&curval, uaddr, uval,
+ uval | FUTEX_WAITERS))
+ goto efault;
+ if (curval == uval) {
+ uval |= FUTEX_WAITERS;
+ break;
I had to look five times to figure out to which loop this break belongs. I
really wonder how you manage to keep track of this mess.

Because I wrote it :-)

In the v2 patch, the FUTEX_WAITERS bit setting was put into a helper function which should clarify the code. I will add a few more to simplify the main loop.

+ }
+ uval = curval;
+ }
+ if (!(uval& ~FUTEX_TID_MASK))
+ continue; /* Do trylock again */
Gah. I can see that you go over to start and do trylock, but WHY ? I know
it, but for the casual reader it's fcking non obvious.

+ if (need_resched()) {
+ __set_current_state(TASK_RUNNING);
+ schedule_preempt_disabled();
+ continue;
+ }
+ if (signal_pending(current)) {
+ ret = -EINTR;
+ break;
+ }
+ /*
+ * If the owner isn't active, we need to go to sleep after
+ * making sure that the FUTEX_WAITERS bit is set. We also
+ * need to put the futex state into the futex owner's
+ * pi_state_list to prevent deadlock when the owner dies.
+ */
+ if (!otask->on_cpu) {
+ if (!(uval& FUTEX_WAITERS)) {
+ loop = 0;
+ continue;
This is completely fucked, really.

if (owner->on_cpu) {

ret = futex_sleep();
if (ret< 0)
goto fault;
if (ret == FUTEX_AQUIRED)

Your attempt to resue code which is in the loop above is just making this
completely unreadable. Hell no! Futexes are complex enough already. We
really can do without obfuscation. This not the obfuscated C-code contest.

As said above, I will add more helper functions and streamline the code to make it more readable.

+ if (futex_trylock_to(uaddr, vpid,&uval, false,&ret))
+ /* Lock acquired or an error happens */
+ return ret;
This lacks curly braces. See:

Got it.

+ /*
+ * Detect deadlocks.
+ */
+ if (unlikely(((uval& FUTEX_TID_MASK) == vpid) ||
+ should_fail_futex(true)))
+ return -EDEADLK;
+ if (refill_futex_state_cache())
+ return -ENOMEM;
+ futex_set_timer(time,&to,&timeout, flags, current->timer_slack_ns);
+ ret = get_futex_key(uaddr, flags& FLAGS_SHARED,&key, VERIFY_WRITE);
+ if (unlikely(ret))
+ goto out;
+ hb = hash_futex(&key);
+ spin_lock(&hb->lock);
Why are you using the global hash for this? As we have shown the global
hash has horrible performance due to cross node access and potential hash
bucket lock contention of unrelated processes. If we add something new
which is performance optimized then we pretty please make it use a seperate
process private storage. I don't see any point in making this optimized for
process shared futexes.

I used the hash lock for managing the futex state objects only. I don't need it for other purpose. It is true that if there is a hash collision that another wait-wake futex is using the same hash. It may cause performance problem. I think I will add another spinlock in the hash bucket just for TO futexes. In this way, a collision with wait-wake futex won't cause unexpected spinlock contention, though a bit of extra hash bucket cachline contention may still happen.

BTW, running my microbenchmark with wait-wake futex, over 90% of the CPU time were in the spinlock contention:

96.23% futex_test [kernel.kallsyms] [k] native_queued_spin_lock_slowpath
- queued_spin_lock_slowpath
- _raw_spin_lock
+ 51.81% futex_wake
+ 48.08% futex_wait_setup

For the TO futexes, the perf profile was:

97.33% 0.97% futex_test [kernel.kallsyms] [k] futex_lock_to
92.45% 92.45% futex_test [kernel.kallsyms] [k] osq_lock
1.65% 1.65% futex_test [kernel.kallsyms] [k]
0.83% 0.83% futex_test [kernel.kallsyms] [k] __get_user_4

The %cpu time in spinlock was about 0.5%. So spinlock contention wasn't an issue for the TO futexes.

+ /*
+ * Locate the futex state by looking up the futex state list in the
+ * hash bucket. If it isn't found, create a new one and put it into
+ * the list.
+ */
+ state = lookup_futex_state(hb,&key, true);
+ /*
+ * We don't need to hold the HB lock after looking up the futex state
+ * as we have incremented the reference count.
+ */
+ spin_unlock(&hb->lock);
+ BUG_ON(!state);
+ /*
+ * Acquiring the serialization mutex.
+ */
+ if (state->type != TYPE_TO) {
+ ret = -EINVAL;
+ } else {
+ if (to)
+ hrtimer_start_expires(&to->timer, HRTIMER_MODE_ABS);
+ ret = mutex_lock_interruptible(&state->mutex);
+ }
+ if (unlikely(ret))
+ /*
+ * We got a signal or has some other error, need to abort
+ * the lock operation and return.
+ */
+ goto out_put_state_key;
+ /*
+ * As the mutex owner, we can now spin on the futex word as well as
+ * the active-ness of the futex owner.
+ */
+ ret = futex_spin_on_owner(uaddr, vpid, state);
So if futex_spin_on_owner() faults, then you return -EFAULT to user space
w/o trying to page in the stuff? That's just wrong.

I am not so sure about the proper fault handling in futex. However, futex_atomic_cmpxchg_inatomic() was doing cmpxchg without disabling page fault. So does that mean if I get a fault, it is probably other problem and not because of a lock of page fault?

+static int futex_unlock_to(u32 __user *uaddr, unsigned int flags)
+ u32 uval, pid, vpid = task_pid_vnr(current);
+ union futex_key key = FUTEX_KEY_INIT;
+ struct futex_hash_bucket *hb;
+ struct futex_state *state;
+ struct task_struct *owner;
+ int ret;
+ if (get_user(uval, uaddr))
+ return -EFAULT;
+ /*
+ * If there is a new lock owner, we can exit now.
+ * If uval is 0, another task may have acquired and release the
+ * lock in the mean time, so we should also exit.
If there is a new lock owner or the lock has been released? Voodoo magic or
what? How can user space take over the lock when the current task owns it?

That's just broken. The only reason to get here is that user space called
in because it was not able to release the lock atomically, i.e. because the
waiters bit was set. If anything fiddled with the uval then returning 0 is
beyond stupid.

I had changed it in the v2 patch to not allow that and return error if that happens.

+ */
+ pid = uval& FUTEX_TID_MASK;
+ if ((pid&& (pid != vpid)) || !uval)
+ return 0;
Crap. It's legit to call here even if the waiters bit is not set.

Again, it was changed in the v2 patch to return error in this case.

+ /*
+ * If the TID isn't cleared in the userspace, clear it here to
+ * encourage faster lock transfer to the mutex owner.
What do you encourage here? How is the mutex transfer accelerated?

It was removed in the v2 patch because of the need to do lock handoff.

+ */
+ if (pid == vpid) {
+ futex_atomic_cmpxchg_inatomic(&uval, uaddr, uval,
+ uval& ~FUTEX_TID_MASK);
+ WARN_ON((uval& FUTEX_TID_MASK) != vpid);
+ }
+ ret = get_futex_key(uaddr, flags& FLAGS_SHARED,&key, VERIFY_WRITE);
+ if (ret)
+ return ret;
+ hb = hash_futex(&key);
+ spin_lock(&hb->lock);
+ /*
+ * Check the hash bucket only for matching futex state.
+ */
+ state = lookup_futex_state(hb,&key, false);
+ if (!state)
+ goto out_unlock;
+ if (state->type != TYPE_TO) {
+ ret = -EINVAL;
+ goto out_unlock;
+ }
+ /*
+ * We don't need to do the wakeup if the futex isn't equal to
+ * FUTEX_WAITERS as it implies that someone else has taken over
+ * the futex.
+ */
+ if (get_user(uval, uaddr)) {
+ ret = -EFAULT;
+ goto out_unlock;
+ }
+ owner = READ_ONCE(state->owner);
+ if ((uval == FUTEX_WAITERS)&& owner)
+ ret = wake_up_process(owner);
What guarantees that owner cannot exit between reading owner state and
calling wake_up_process?

I used the wake_q function in the v2 patch which increment the reference count.

Dammit. If you read the source in this file carefully, then you notice that
it contains a lot of documentation about life time rules, protection and

If you think, that it's the reviewers problem to figure that out from your
w/o proper comments in place, then you are completely on the wrong track.

I'm not even trying to think about the concept itself as long as this is
presented as a pile of unpenetrable clusterfuck.

I am sorry that patch wasn't as well-documented as it should. I will spend more time cleaning up the comments and streamline the code before sending out the next v3 version.

Thanks for the review.