[PATCH] futex: Support smaller futexes of one byte or two byte size.

From: Malte Skarupke
Date: Fri Dec 20 2019 - 14:10:03 EST


Hello again,

here are two patches for sized futexes. Option 1 is the same behavior as my
first patch, just with the fixes suggested above. Option 2 requires a size
flag for WAKE and REQUEUE.

The reason for sending two changelists is that after thinking about it for a
while, and after implementing both options, I realized that I do care which
option we pick. I still prefer option 1. I would also be OK with going with
option 2, but let me try one final time to convince you with three reasons:

1. The consistent API conversation. There were two reasons voiced for this:
a) In a private email Thomas Gleixner voiced that just having two different
calls, WAIT and WAKE, where one takes the size and the other does not,
would be inconsistent. I would like to think about this in terms of another
popular API where one takes a size and the other does not: malloc() and
free(). Over the years many allocator implementers would have hoped that
free() also takes a size because it can save a few instructions, and the
user usually knows what they are freeing and could pass in the size. But
what if there was a version of free() that doesn't need the size, but it
still takes a size argument and only uses it to verify that the correct
size was passed in: If you pass an incorrect size it errors out and doesn't
actually free the memory. Would that be a better interface?

Because that's essentially what I'm implementing here: WAKE doesn't need
the size, and the only thing it uses the size for is to verify that it is
correct, and if it's not, WAKE errors out and doesn't actually wake. I
don't think that that makes it a better API.

b) On this mailing list Peter Zijlstra voiced that my implementation does not
implement the MONITOR/MWAIT pattern. This wasn't a problem when there was
only one size of futex, but it is once you can have overlapping futexes of
different sizes. I can see that it might be nice to implement this pattern,
but I chose to not do it mainly because mutexes and condition variables
don't actually need it. They just need the same semantics we had before,
except for smaller types. There is no need to handle overlapping futexes,
so I propose a different mental model: Comparisons+hashtable lookup. Under
that mental model only an exact match on the address between WAIT and WAKE
will do anything, because otherwise you won't find anything in the
hashtable. And under that mental model the size only applies to the
comparison. That's what I implemented and what I prefer. Option 1 is a more
internally consistent implementation in that mental model.

2. Precedent. When Windows added support for futexes, they supported different
sizes of futexes from the beginning. In their API WaitOnAddress() is the
equivalent to our FUTEX_WAIT, and WakeByAddressSingle() and
WakeByAddressAll() are the equivalents to our FUTEX_WAKE. In the Windows
API WaitOnAddress() takes a size, but the WakeByAddress calls do not take a
size. Obviously there may be reasons to do things differently from how
Windows does it, but it does show that at least some other programmers came
to the same conclusions that I did.

3. Odd behavior. I made one change compared to what I wrote earlier in the
discussion. Earlier I said that calling WAIT with two different sizes on
the same address should probably not be allowed. The reason for that is
that it can lead to unexpected behavior. If you do a 8 bit WAIT followed by
a 16 bit WAIT on the same address, then call 8 bit WAKE on the same address
twice, the first WAKE call will work and return 1, but the second WAKE call
will return -EINVAL because of a size mismatch.

I thought that that's just a bit too weird so I said that I wouldn't allow
WAITs with different sizes. The reason why I changed my mind on that is
that I would have to change what WAIT does. Instead of just inserting
itself into a linked list, it would have to iterate through that linked
list to check if anybody else already went to sleep with a different size.
So a O(1) operation would turn into a O(n) operation. Since there can be
condition variables where many threads sleep on the same futex, this would
actually be fairly common. So I thought that that would be an unacceptable
change in behavior. (and a similar change would have to be made to REQUEUE)
So this does mean that in the version where WAKE verifies the size, you can
get weird behavior if WAIT was called with two different sizes. (or maybe
if somebody wasn't consistent with their usage of the flags in REQUEUE) I
don't think this is a huge concern because I don't expect people to have
different sized futexes on the same address, but it is a weird API quirk.
If this does cause bugs, they should also be fairly obvious and easy to
fix. But they would be bugs that you couldn't have with option 1.

So now that I have tried to convince you one final time, you're free to choose
whichever option that you like. And obviously if you have more feedback for
the code, I would also implement those fixes. Also if you want a patch without
the "option 1" or "option 2" text in the description, I can provide that.

Thanks,

Malte