Sorry for posting out-of-tree, I just subscribed to the list to reply to a post that was already sent.
André Almeida wrote:
** "And what's about FUTEX_64?"
By supporting 64 bit futexes, the kernel structure for futex would
need to have a 64 bit field for the value, and that could defeat one of
the purposes of having different sized futexes in the first place:
supporting smaller ones to decrease memory usage. This might be
something that could be disabled for 32bit archs (and even for
CONFIG_BASE_SMALL).
Which use case would benefit for FUTEX_64? Does it worth the trade-offs?
I strongly believe that 64-bit futex must be supported. I have a few use cases in mind:
1. Cooperative robust futexes.
I have a real-world case where multiple processes need to communicate via shared memory and synchronize via a futex. The processes run under a supervisor parent process, which can detect termination of its children and also has access to the shared memory. In order to make the communication more or less safe in face of one of the child process crashing, the futex currently contains a portion of pid of the process that locked it. The parent supervisor is then able to tell that the crashed child was holding the futex locked and then marke the futex as "broken" and notify any other threads blocked on it.
Given that pid can be up to 32-bits in size, and we also need some bits in the futex to implement its logic (i.e. at least "locked" and "broken" bits, some bits for the ABA counter, etc.), the pid can be truncated and the above logic may be broken. In the real application, only 15 bits are left for the pid, which is already less than the actual pid range on the system.
Note: We're not using the proper pthread robust mutexes because we also need a condition variable, and condition variables contain a non-robust mutex internally, which basically nullifies robustness. One could argue to fix pthread instead, but I view that as a more difficult task as pthread interface is standardized. We would rather use futex directly anyway because of more flexibility and less performance overhead.
2. Parity with WaitOnAddress[1] on Windows.
WaitOnAddress is explicitly documented to support 8-byte states, and its interface allows for further extension. I'm not a Wine developer, but I would guess that having a 8-byte futex support to match would be useful there.
Besides Wine, having a 64-bit futex would be important for std::atomic[2] and Boost.Atomic in C++, which support waiting and notifying operations (for std::atomic, introduced in C++20). Waiting and notifying operations are normally implemented using futex API on Linux and WaitOnAddress on Windows, and can be emulated with a process-wide global mutex pool if such API is unavailable for a given atomic size on the target platform. This means that 64-bit atomics on Linux currently must be implemented with a lock and therefore cannot be used in process-shared memory, while there is no such limitation on Windows.
I'm not sure how much memory is saved by not having 64-bit state in the kernel futex structures, but this doesn't look like a huge deal on modern systems - server, desktop or mobile. It may make sense for extremely low memory embedded systems, and for those targets the support may be disabled with a switch. In fact, such systems would probably not support 64-bit atomics anyway. For any other targets I would prefer 64-bit futex to be available by default.
My main issue with 64-bit being optional though is that applications and libraries like Boost.Atomic would like (or even require) to know if the feature is available at compile time rather than run time. std::atomic, for example, is supposed to be a thin abstraction over atomic instructions and OS primitives like futex, so performing runtime detection of the available features in the kernel would be detrimental there. I'm not sure if this is possible in the current kernel infrastructure, but it would be best if the lack of 64-bit atomics in the kernel was detectable through kernel headers (e.g. by a macro for 64-bit futexes not being defined or something like that), which means the headers must be generated at kernel configuration time.
[1]: https://docs.microsoft.com/en-us/windows/win32/api/synchapi/nf-synchapi-waitonaddress
[2]: https://en.cppreference.com/w/cpp/atomic/atomic