Re: wait_on_page_bit_common(TASK_KILLABLE, EXCLUSIVE) can miss wakeup?

From: Nicholas Piggin
Date: Sun Jun 28 2020 - 23:30:34 EST


Excerpts from Linus Torvalds's message of June 28, 2020 3:39 pm:
> On Fri, Jun 26, 2020 at 8:43 AM Peter Zijlstra <peterz@xxxxxxxxxxxxx> wrote:
>>
>> I ended up with something like the below.. but it is too warm to think
>> properly.
>>
>> I don't particularly like WQ_FLAG_PAGEWAITERS, but I liked open-coding
>> all that even less.
>
> Ugh.
>
> I think I have a much simpler approach, actually.
>
> So the *problem* is purely around that
>
> if (behavior == EXCLUSIVE) {
> if (!test_and_set_bit_lock(bit_nr, &page->flags))
> break;
> } else ..
>
> and in particular it is purely *after* that test_and_set_bit_lock()
> case. We have two cases:
>
> (a) *If* we get the lock there, we're good, and all done, and we have
> the lock. We don't care about any other wakeups, because they'll be
> stale anyway (the thing that released the lock that we just took) and
> because we got the lock, no other exclusive waiters should be woken up
> anyway (and we will in turn wake up any waiters when we release it)
>
> (b) we did *not* get the lock, because somebody else got it and is
> about to immediately unlock again. And that _future_ wakeup that they
> send might get lost because it might end up targeting us (but we might
> just exit and not care).
>
> Agreed?
>
> So the only case that really matters is that we didn't get the lock,
> but we must *not* be woken up afterwards.
>
> So how about the attached trivial two-liner? We solve the problem by
> simply marking ourselves TASK_RUNNING, which means that we won't be
> counted as an exclusive wakeup.
>
> Ok, so the "one" line to do that is that is actually two lines:
>
> __set_current_state(TASK_RUNNING);
> smp_mb__before_atomic();
>
> and there's four lines of comments to go with it, but it really is
> very simple: if we do that before we do the test_and_set_bit_lock(),
> no wakeups will be lost, because we won't be sleeping for that wakeup.
>
> I'm not entirely happy about that "smp_mb__before_atomic()". I think
> it's right in practice that test_and_set_bit_lock() (when it actually
> does a write) has at LEAST atomic seqmantics, so I think it's good.
> But it's not pretty.
>
> But I don't want to use a heavy
>
> set_current_state(TASK_RUNNING);
> if (!test_and_set_bit_lock(bit_nr, &page->flags)) ..
>
> sequence, because at least on x86, that test_and_set_bit_lock()
> already has a memory barrier in it, so the extra memory barrier from
> set_current_state() is all kinds of pointless.
>
> Hmm?

Wow good catch. Does bit_is_set even have to be true? If it's woken due
to a signal, it may still be on the waitqueue right? I think your patch
works, but it looks like a pretty standard variant of "don't lose
wakeups" bug.

prepare_to_wait_event() has a pretty good pattern (and comment), I would
favour using that (test the signal when inserting on the waitqueue).

@@ -1133,6 +1133,15 @@ static inline int wait_on_page_bit_common(wait_queue_head_t *q,
for (;;) {
spin_lock_irq(&q->lock);

+ if (signal_pending_state(state, current)) {
+ /* Must not lose an exclusive wake up, see
+ * prepare_to_wait_event comment */
+ list_del_init(&wait->entry);
+ spin_unlock_irq(&q->lock);
+ ret = -EINTR;
+ break;
+ }
+
if (likely(list_empty(&wait->entry))) {
__add_wait_queue_entry_tail(q, wait);
SetPageWaiters(page);
@@ -1157,11 +1166,6 @@ static inline int wait_on_page_bit_common(wait_queue_head_t *q,
break;
}

- if (signal_pending_state(state, current)) {
- ret = -EINTR;
- break;
- }
-
if (behavior == DROP) {
/*
* We can no longer safely access page->flags:

- mutex_lock_common does the signal check under its wait queue lock.

- rwsem looks like it does it backward and checks if it was woken if it
got a signal and tries to handle it that way (hopefully okay, I prefer
the standard pattern).

- futex unqueues and tests for wakeup before testing signal.

Etc. And it's not even exclusive to signals of course, those are just
the typical asynchronous thing that would wake us without removing from
the wait queue. Bit of a shame there is no standard pattern to follow
though.

I wonder how you could improve that. finish_wait could WARN_ON an
exclusive waiter being found on the queue?

@@ -377,6 +377,7 @@ void finish_wait(struct wait_queue_head *wq_head, struct wait_queue_entry *wq_en
* the list).
*/
if (!list_empty_careful(&wq_entry->entry)) {
+ WARN_ON(wq_entry->flags & WQ_FLAG_EXCLUSIVE);
spin_lock_irqsave(&wq_head->lock, flags);
list_del_init(&wq_entry->entry);
spin_unlock_irqrestore(&wq_head->lock, flags);

That doesn't catch a limited count of wake ups, maybe if you passed in
success value to finish_wait, it could warn in case a failure has
WQ_FLAG_WOKEN. That doesn't help things that invent their own waitqueues
mind you. I wonder if we could do some kind of generic annotations for
anyone implementing wait queues to call, which could have debug checks
implemented?

Thanks,
Nick