Re: [RFC PATCH 3/4] futex: Throughput-optimized (TO) futexes

From: Thomas Gleixner
Date: Fri Sep 23 2016 - 06:19:29 EST

On Thu, 22 Sep 2016, Waiman Long wrote:
> On 09/22/2016 09:32 AM, Thomas Gleixner wrote:
> > > + WARN_ON(!pi_state->owner);
> > > + if (pi_state->owner)
> > > + wake_up_process(pi_state->owner);
> > And what handles or sanity checks the state->hb_list ???
> >
> The exit_pi_state_list() function doesn't need to deal with state->hb_list.
> The hb_list is used to locate the futex state, but the futex owner doesn't
> have a reference to the futex state. So it won't need to decrement it and
> potentially free it.

Ok. Though that wants comments as well.

> > > +/*
> > > + * Spinning threshold for futex word without setting FUTEX_WAITERS.
> > > + */
> > > +#define FUTEX_SPIN_THRESHOLD (1<< 10)
> > That number is pulled out of thin air or what is the rationale for chosing
> > 1024?
> It is kind of arbitrary. I want a value large enough to encourage lock
> stealing, while not too large that it may take too long to get the lock. Will
> elaborate more about the choice in the comment.

I'm not really happy about these heuristics. The chosen value fits a
particular machine and scenario. Now try to do the same on some slow ARM
SoC and the 1024 loops are going to hurt badly.

> Sorry, I forgot to use the kerneldoc notation. It is an RFC patch, and my
> focus was to make it work. I haven't spent much time in cleaning up the
> comment. I will do that in the next version.

RFC is not an excuse for a unreadable mess. This stuff is complex enough
that it should be in your own interest to make it well documented and
readable. If it takes more time for a reviewer to distangle the code jungle
than spending time on thinking about the concept, verifying the lifetime
and serialization rules, etc. then there is something very wrong. And it's
wasting my time and yours. Not that I care much about your time waste :)

When Sebastian and myself worked on the attached futexes we spent more time
tidying up the code and writing documentation than on actually coding,
debugging and optimizing it.

Futexes have a gazillion of corner cases and pitfalls and your stuff even
if it seems to be simpler has the same issues.

> > > + WARN_ON_ONCE(!(uval& FUTEX_TID_MASK));
> > > +
> > > + if ((uval& FUTEX_TID_MASK) != opid) {
> > > + /*
> > > + * Get the new task structure
> > > + */
> > > + if (otask)
> > > + put_task_struct(otask);
> > > +
> > > + WARN_ON_ONCE(on_owner_list);
> > > + opid = uval& FUTEX_TID_MASK;
> > > + otask = futex_find_get_task(opid);
> > > + }
> > Can you pretty please use split out functions for this otask handling? This
> > for loop does not fit on a single screen and can't be understood in one go.
> Yes, this is the largest function in the new code, I will try to add helper
> functions to make it easier to understand.

You should have done that in the first place. Because even when you develop
and experiment it's insane to try to follow a loop with several nest levels
which occupies _two_ screens.

It's your choice how you develop, but offering something like this for
review is just an affront.

> > > + pr_info("futex_spin_on_owner: pid %d grabs futex from
> > > pid %d (%s)!\n",
> > > + vpid, opid, otask ? "dying" : "invalid");
> > Really useful information to confuse sysadmins. A proper comment explaining
> > the issue and the implications and how we deal with it would have been the
> > right thing to do...
> Yes, the message may be too cryptic. I will make it more clear in the next
> version.

No, this message is not too cryptic. It's entirely useless and it needs not
to be made more clear. It needs to be replaced by a comment explaining the
issue. This is a legit thing to happen and it has absolutely no value to
spam dmesg that something happened which is expected and dealt with.

> > > + ret = get_futex_key(uaddr, flags& FLAGS_SHARED,&key, VERIFY_WRITE);
> > > + if (unlikely(ret))
> > > + goto out;
> > > +
> > > + hb = hash_futex(&key);
> > > + spin_lock(&hb->lock);
> > Why are you using the global hash for this? As we have shown the global
> > hash has horrible performance due to cross node access and potential hash
> > bucket lock contention of unrelated processes. If we add something new
> > which is performance optimized then we pretty please make it use a seperate
> > process private storage. I don't see any point in making this optimized for
> > process shared futexes.
> I used the hash lock for managing the futex state objects only. I don't need
> it for other purpose. It is true that if there is a hash collision that
> another wait-wake futex is using the same hash. It may cause performance
> problem. I think I will add another spinlock in the hash bucket just for TO
> futexes. In this way, a collision with wait-wake futex won't cause unexpected
> spinlock contention, though a bit of extra hash bucket cachline contention may
> still happen.

The real question is whether we want to use the global hash at all. We've
shown that using a process private hash has massive advantages and I don't
see any point in providing support for this optimized futex for anything
else than process private. process shared is insanely slow anyway.

> BTW, running my microbenchmark with wait-wake futex, over 90% of the CPU time
> were in the spinlock contention:
> 96.23% futex_test [kernel.kallsyms] [k] native_queued_spin_lock_slowpath
> - queued_spin_lock_slowpath
> - _raw_spin_lock
> + 51.81% futex_wake
> + 48.08% futex_wait_setup

That's not really helpful if you don't tell me what the microbenchmark
actually does. If it's a gazillion of threads whacking on the same futex,
then yes, this is expected. If there are different futexes then it's
important to know whether part of this is contention is hash collision.

We've shown that hash collisons happen and they cause a massive slowdown.

> For the TO futexes, the perf profile was:
> 97.33% 0.97% futex_test [kernel.kallsyms] [k] futex_lock_to
> 92.45% 92.45% futex_test [kernel.kallsyms] [k] osq_lock
> 1.65% 1.65% futex_test [kernel.kallsyms] [k]
> 0.83% 0.83% futex_test [kernel.kallsyms] [k] __get_user_4
> The %cpu time in spinlock was about 0.5%. So spinlock contention wasn't an
> issue for the TO futexes.

Well. You traded the spinlock contention with burning cpu cycles in
osq_lock. Your comparison metrics simply suck.

> > > + ret = futex_spin_on_owner(uaddr, vpid, state);
> > So if futex_spin_on_owner() faults, then you return -EFAULT to user space
> > w/o trying to page in the stuff? That's just wrong.
> I am not so sure about the proper fault handling in futex. However,
> futex_atomic_cmpxchg_inatomic() was doing cmpxchg without disabling page
> fault. So does that mean if I get a fault, it is probably other problem and
> not because of a lock of page fault?

Sigh. You do that inside a preempt disabled region. So if you really get a
page fault then this will just explode because the fault handler will
either trip over might_sleep() or end up calling schedule() with preemption
disabled. IOW, you must disable pagefaults and you have to deal with the
fallout yourself. Rename cmpxchg_futex_value_locked() to
cmpxchg_futex_value_inatomic() and use it.

About 90% of the futex code is about handling corner cases and sanity
checking user space values. Even with the simplified scheme of your
optimized futexes it's not going to be any different.