Re: [RFC 0/4] futex request support

From: Andres Freund
Date: Thu Jun 03 2021 - 14:59:51 EST


On 2021-06-01 15:58:25 +0100, Pavel Begunkov wrote:
> Should be interesting for a bunch of people, so we should first
> outline API and capabilities it should give. As I almost never
> had to deal with futexes myself, would especially love to hear
> use case, what might be lacking and other blind spots.

I did chat with Jens about how useful futex support would be in io_uring, so I
should outline our / my needs. I'm off work this week though, so I don't think
I'll have much time to experiment.

For postgres's AIO support (which I am working on) there are two, largely
independent, use-cases for desiring futex support in io_uring.

The first is the ability to wait for locks (queued r/w locks, blocking
implemented via futexes) and IO at the same time, within one task. Quickly and
efficiently processing IO completions can improve whole-system latency and
throughput substantially in some cases (journalling, indexes and other
high-contention areas - which often have a low queue depth). This is true
*especially* when there also is lock contention, which tends to make efficient
IO scheduling harder.

The second use case is the ability to efficiently wait in several tasks for
one IO to be processed. The prototypical example here is group commit/journal
flush, where each task can only continue once the journal flush has
completed. Typically one of waiters has to do a small amount of work with the
completion (updating a few shared memory variables) before the other waiters
can be released. It is hard to implement this efficiently and race-free with
io_uring right now without adding locking around *waiting* on the completion
side (instead of just consumption of completions). One cannot just wait on the
io_uring, because of a) the obvious race that another process could reap all
completions between check and wait b) there is no good way to wake up other
waiters once the userspace portion of IO completion is through.

All answers for postgres:

> 1) Do we need PI?

Not right now.

Not related to io_uring: I do wish there were a lower overhead (and lower
guarantees) version of PI futexes. Not for correctness reasons, but
performance. Granting the waiter's timeslice to the lock holder would improve
common contention scenarios with more runnable tasks than cores.

> 2) Do we need requeue? Anything else?

I can see requeue being useful, but I haven't thought it through fully.

Do the wake/wait ops as you have them right now support bitsets?

> 3) How hot waits are? May be done fully async avoiding io-wq, but
> apparently requires more changes in futex code.

The waits can be quite hot, most prominently on low latency storage, but not


Andres Freund