Re: futex+io_uring: futex_q::task can maybe be dangling (but is not actually accessed, so it's fine)

From: Jens Axboe
Date: Fri Jan 10 2025 - 22:36:51 EST


On 1/10/25 3:26 PM, Jann Horn wrote:
> Hi!
>
> I think there is some brittle interaction between futex and io_uring;
> but to be clear, I don't think that there is actually a bug here.
>
> In io_uring, when a IORING_OP_FUTEX_WAIT SQE is submitted with
> IOSQE_ASYNC, an io_uring worker thread can queue up futex waiters via
> the following path:
>
> ret_from_fork -> io_wq_worker -> io_worker_handle_work ->
> io_wq_submit_work[called as ->do_work] -> io_issue_sqe ->
> io_futex_wait[called as .issue] -> futex_queue -> __futex_queue
>
> futex_q instances normally describe synchronously waiting tasks, and
> __futex_queue() records the identity of the calling task (which is
> normally the waiter) in futex_q::task. But io_uring waits on futexes
> asynchronously instead; from io_uring's perspective, a pending futex
> wait is not tied to the task that called into futex_queue(), it is
> just tied to the userspace task on behalf of which the io_uring worker
> is acting (I think). So when a futex wait operation is started by an
> io_uring worker task, I think that worker task could go away while the
> futex_q is still queued up on the futex, and so I think we can end up
> with a futex_q whose "task" member points to a freed task_struct.
>
> The good part is that (from what I understand) that "task" member is
> only used for two purposes:
>
> 1. futexes that are either created through the normal futex syscalls
> use futex_wake_mark as their .wake callback, which needs the task
> pointer to know which task should be woken.
> 2. PI futexes use it for priority inheritance magic (and AFAICS there
> is no way for io_uring to interface with PI futexes)
>
> I'm not sure what is the best thing to do is here - maybe the current
> situation is fine, and I should just send a patch that adds a comment
> describing this to the definition of the "task" member? Or maybe it
> would be better for robustness to ensure that the "task" member is
> NULLed out in those cases, though that would probably make the
> generated machine code a little bit more ugly? (Or maybe I totally
> misunderstand what's going on and there isn't actually a dangling
> pointer...)

Good find. And yes the io-wq task could go away, if there's more of
them.

While it isn't an issue, dangling pointers make me nervous. We could do
something like the (totally untested) below patch, where io_uring just
uses __futex_queue() and make __futex_queue() take a task_struct as
well. Other callers pass in 'current'.

It's quite possible that it'd be safe to just use __futex_queue() and
clear ->task after the queue, but if we pass in NULL from the get-go,
then there's definitely not a risk of anything being dangling.


diff --git a/io_uring/futex.c b/io_uring/futex.c
index 30139cc150f2..985ad10cc6d5 100644
--- a/io_uring/futex.c
+++ b/io_uring/futex.c
@@ -338,7 +338,12 @@ int io_futex_wait(struct io_kiocb *req, unsigned int issue_flags)
hlist_add_head(&req->hash_node, &ctx->futex_list);
io_ring_submit_unlock(ctx, issue_flags);

- futex_queue(&ifd->q, hb);
+ /*
+ * Don't pass in current as the task associated with the
+ * futex queue, that only makes sense for sync syscalls.
+ */
+ __futex_queue(&ifd->q, hb, NULL);
+ spin_unlock(&hb->lock);
return IOU_ISSUE_SKIP_COMPLETE;
}

diff --git a/kernel/futex/core.c b/kernel/futex/core.c
index ebdd76b4ecbb..3db8567f5a44 100644
--- a/kernel/futex/core.c
+++ b/kernel/futex/core.c
@@ -532,7 +532,8 @@ void futex_q_unlock(struct futex_hash_bucket *hb)
futex_hb_waiters_dec(hb);
}

-void __futex_queue(struct futex_q *q, struct futex_hash_bucket *hb)
+void __futex_queue(struct futex_q *q, struct futex_hash_bucket *hb,
+ struct task_struct *task)
{
int prio;

@@ -548,7 +549,7 @@ void __futex_queue(struct futex_q *q, struct futex_hash_bucket *hb)

plist_node_init(&q->list, prio);
plist_add(&q->list, &hb->chain);
- q->task = current;
+ q->task = task;
}

/**
diff --git a/kernel/futex/futex.h b/kernel/futex/futex.h
index 99b32e728c4a..2d3f1d53c854 100644
--- a/kernel/futex/futex.h
+++ b/kernel/futex/futex.h
@@ -285,7 +285,8 @@ static inline int futex_get_value_locked(u32 *dest, u32 __user *from)
}

extern void __futex_unqueue(struct futex_q *q);
-extern void __futex_queue(struct futex_q *q, struct futex_hash_bucket *hb);
+extern void __futex_queue(struct futex_q *q, struct futex_hash_bucket *hb,
+ struct task_struct *task);
extern int futex_unqueue(struct futex_q *q);

/**
@@ -303,7 +304,7 @@ extern int futex_unqueue(struct futex_q *q);
static inline void futex_queue(struct futex_q *q, struct futex_hash_bucket *hb)
__releases(&hb->lock)
{
- __futex_queue(q, hb);
+ __futex_queue(q, hb, current);
spin_unlock(&hb->lock);
}

diff --git a/kernel/futex/pi.c b/kernel/futex/pi.c
index d62cca5ed8f4..635c7d5d4222 100644
--- a/kernel/futex/pi.c
+++ b/kernel/futex/pi.c
@@ -982,7 +982,7 @@ int futex_lock_pi(u32 __user *uaddr, unsigned int flags, ktime_t *time, int tryl
/*
* Only actually queue now that the atomic ops are done:
*/
- __futex_queue(&q, hb);
+ __futex_queue(&q, hb, current);

if (trylock) {
ret = rt_mutex_futex_trylock(&q.pi_state->pi_mutex);

--
Jens Axboe