Re: wait_on_page_bit_common(TASK_KILLABLE, EXCLUSIVE) can miss wakeup?

From: Nicholas Piggin
Date: Mon Jun 29 2020 - 15:14:24 EST


Excerpts from Nicholas Piggin's message of June 29, 2020 1:28 pm:
> Excerpts from Linus Torvalds's message of June 28, 2020 3:39 pm:
>> On Fri, Jun 26, 2020 at 8:43 AM Peter Zijlstra <peterz@xxxxxxxxxxxxx> wrote:
>>>
>>> I ended up with something like the below.. but it is too warm to think
>>> properly.
>>>
>>> I don't particularly like WQ_FLAG_PAGEWAITERS, but I liked open-coding
>>> all that even less.
>>
>> Ugh.
>>
>> I think I have a much simpler approach, actually.
>>
>> So the *problem* is purely around that
>>
>> if (behavior == EXCLUSIVE) {
>> if (!test_and_set_bit_lock(bit_nr, &page->flags))
>> break;
>> } else ..
>>
>> and in particular it is purely *after* that test_and_set_bit_lock()
>> case. We have two cases:
>>
>> (a) *If* we get the lock there, we're good, and all done, and we have
>> the lock. We don't care about any other wakeups, because they'll be
>> stale anyway (the thing that released the lock that we just took) and
>> because we got the lock, no other exclusive waiters should be woken up
>> anyway (and we will in turn wake up any waiters when we release it)
>>
>> (b) we did *not* get the lock, because somebody else got it and is
>> about to immediately unlock again. And that _future_ wakeup that they
>> send might get lost because it might end up targeting us (but we might
>> just exit and not care).
>>
>> Agreed?
>>
>> So the only case that really matters is that we didn't get the lock,
>> but we must *not* be woken up afterwards.
>>
>> So how about the attached trivial two-liner? We solve the problem by
>> simply marking ourselves TASK_RUNNING, which means that we won't be
>> counted as an exclusive wakeup.
>>
>> Ok, so the "one" line to do that is that is actually two lines:
>>
>> __set_current_state(TASK_RUNNING);
>> smp_mb__before_atomic();
>>
>> and there's four lines of comments to go with it, but it really is
>> very simple: if we do that before we do the test_and_set_bit_lock(),
>> no wakeups will be lost, because we won't be sleeping for that wakeup.
>>
>> I'm not entirely happy about that "smp_mb__before_atomic()". I think
>> it's right in practice that test_and_set_bit_lock() (when it actually
>> does a write) has at LEAST atomic seqmantics, so I think it's good.
>> But it's not pretty.
>>
>> But I don't want to use a heavy
>>
>> set_current_state(TASK_RUNNING);
>> if (!test_and_set_bit_lock(bit_nr, &page->flags)) ..
>>
>> sequence, because at least on x86, that test_and_set_bit_lock()
>> already has a memory barrier in it, so the extra memory barrier from
>> set_current_state() is all kinds of pointless.
>>
>> Hmm?
>
> Wow good catch. Does bit_is_set even have to be true? If it's woken due
> to a signal, it may still be on the waitqueue right?

No, ignore this part (which you explained well it was just a thinko,
and your patch of course would not have worked if this was the case):
the exclusive wake up doesn't get lost if schedule() was called because
state goes back to running regardless of what woke it.

I still prefer if it can be changed to the below fix though.

> works, but it looks like a pretty standard variant of "don't lose
> wakeups" bug.
>
> prepare_to_wait_event() has a pretty good pattern (and comment), I would
> favour using that (test the signal when inserting on the waitqueue).
>
> @@ -1133,6 +1133,15 @@ static inline int wait_on_page_bit_common(wait_queue_head_t *q,
> for (;;) {
> spin_lock_irq(&q->lock);
>
> + if (signal_pending_state(state, current)) {
> + /* Must not lose an exclusive wake up, see
> + * prepare_to_wait_event comment */
> + list_del_init(&wait->entry);
> + spin_unlock_irq(&q->lock);> + ret = -EINTR;
> + break;
> + }
> +
> if (likely(list_empty(&wait->entry))) {
> __add_wait_queue_entry_tail(q, wait);
> SetPageWaiters(page);
> @@ -1157,11 +1166,6 @@ static inline int wait_on_page_bit_common(wait_queue_head_t *q,
> break;
> }
>
> - if (signal_pending_state(state, current)) {
> - ret = -EINTR;
> - break;
> - }
> -
> if (behavior == DROP) {
> /*
> * We can no longer safely access page->flags:
>
> - mutex_lock_common does the signal check under its wait queue lock.
>
> - rwsem looks like it does it backward and checks if it was woken if it
> got a signal and tries to handle it that way (hopefully okay, I prefer
> the standard pattern).
>
> - futex unqueues and tests for wakeup before testing signal.
>
> Etc. And it's not even exclusive to signals of course, those are just
> the typical asynchronous thing that would wake us without removing from
> the wait queue. Bit of a shame there is no standard pattern to follow
> though.
>
> I wonder how you could improve that. finish_wait could WARN_ON an
> exclusive waiter being found on the queue?
>
> @@ -377,6 +377,7 @@ void finish_wait(struct wait_queue_head *wq_head, struct wait_queue_entry *wq_en
> * the list).
> */
> if (!list_empty_careful(&wq_entry->entry)) {
> + WARN_ON(wq_entry->flags & WQ_FLAG_EXCLUSIVE);
> spin_lock_irqsave(&wq_head->lock, flags);
> list_del_init(&wq_entry->entry);
> spin_unlock_irqrestore(&wq_head->lock, flags);
>
> That doesn't catch a limited count of wake ups, maybe if you passed in
> success value to finish_wait, it could warn in case a failure has
> WQ_FLAG_WOKEN. That doesn't help things that invent their own waitqueues
> mind you. I wonder if we could do some kind of generic annotations for
> anyone implementing wait queues to call, which could have debug checks
> implemented?
>
> Thanks,
> Nick
>
>