Re: wait_on_page_bit_common(TASK_KILLABLE, EXCLUSIVE) can miss wakeup?

From: Linus Torvalds
Date: Tue Jun 30 2020 - 14:02:47 EST


On Tue, Jun 30, 2020 at 4:51 AM Oleg Nesterov <oleg@xxxxxxxxxx> wrote:
>
> On 06/30, Oleg Nesterov wrote:
> >
> > may be someting like this
>
> or this ...

Guys, I'd suggest we either

(a) do the minimal ugly one-liner workaround

or

(b) do something *much* more radical.

What would that "radical" thing be? Just move the "behavior" bit into
the "struct wait_page_key" itself, and handle it at wakeup time!

Right now we have

- wake_page_function() basically tests that it's the right bit and
page, and that the bit has actually cleared

- then it calls "autoremove_wake_function()" to wake things up and
remove you from the wait list if that wqoke you up.

I would suggest that if we want to clean this up, we do

- add "behavior" to "struct wait_page_key"

- now, for the "exclusive" case, instead of doing

if (test_bit(key->bit_nr, &key->page->flags))
return -1;

the wake_page_function() would actually do

if (test_and_set_bit(key->bit_nr, &key->page->flags))
return -1;
/*
* Careful, careful: we need to make sure that the *last*
* thing that touches 'wq_entry' is setting WQ_FLAG_WOKEN
*/
list_del_init(&wq_entry->entry);
default_wake_function(wq_entry, mode, sync, key);
smp_store_release(&wq_entry->flags, WQ_FLAG_WOKEN);

/* No point in waking up anybody else */
return -1;

while for the SHARED and DROP cases it would do

if (test_bit(key->bit_nr, &key->page->flags))
return -1;
list_del_init(&wq_entry->entry);
default_wake_function(wq_entry, mode, sync, key);
smp_store_release(&wq_entry->flags, WQ_FLAG_WOKEN);

return 0;

and now the actual _waiting_ side can basically just do

/* All the wait-queue setup is done outside the loop */
wait_page.behavior = behavior;

spin_lock_irq(&q->lock);
.. test the big again here, do the rigth thing for mode, exit if done ..
__add_wait_queue_entry_tail(q, wait);
SetPageWaiters(page);
set_current_state(state);
spin_unlock_irq(&q->lock);

for (;;) {
set_current_state(state);

/*
* Have we already been woken and are all done?
* We don't even need to remove ourselves from the
* wait-queues, that's been done for us.
*/
if (wait->flags & WQ_FLAG_WOKEN)
return 0;

if (signal_pending_state(state, current))
break;

io_schedule();
}
/* We got interrupted by a signal? */
finish_wait(q, wait);

/* But maybe we raced with being woken and are all done? */
if (wait->flags & WQ_FLAG_WOKEN)
return 0;

return -EINTR;

NOTE! The above doesn't do the thrashing/delayacct handling, and that
was intentional for clarity, because I actually think that should be
done in an outer function that just calls this actual "do the
lock/wait"

Look, doesn't that seem *much* simpler and clearer? Having that flag
in the wait-queue (that stays around even after it's been removed from
the list) shows that the state we're testing for - or the state we
wanted - is already ours.

Note that we have a "woken_wake_function()" that does that
WQ_FLAG_WOKEN thing, but I don't think it's careful about that final
WQ_FLAG_WOKEN bit setting, and the above function cares about that
(notice how it checks WQ_FLAG_WOKEN without ever taking the spinlock
that protects the other members of it, so the stack space can get
deallocated immediately once it sees that flag.

The above code has been written in my MUA, it may be entirely broken,
but it _feels_ right to me.

And yes, it's a lot bigger change than my one-liner. But honestly,
this is the kind of thing that the whole "struct wait_page_key" thing
is all about and exists for.

What do people think? I think it clarifies the logic and makes things
potentially much clearer. In fact, now the actual _waiting_ code never
looks at the struct page at all, so it can drop the reference to the
page early, outside the loop, adn that "drop" thing doesn't even have
to be visible on the waker side either (the wakeup logic is the same
as SHARED)

Hmm?

Linus