Re: wait_on_page_bit_common(TASK_KILLABLE, EXCLUSIVE) can miss wakeup?

From: Linus Torvalds
Date: Tue Jun 30 2020 - 14:57:58 EST


On Tue, Jun 30, 2020 at 11:36 AM Oleg Nesterov <oleg@xxxxxxxxxx> wrote:
>
> I swear, I too thought about changing wake_page_function() but got lost ;)
>
> The very fact it returns -1 if test_bit() is true suggests it can do more.

wake functions can indeed do more.

You can return -1 to say "stop traversing the list, nobody else on
this waitlist is relevant" (and that's separate from the "I did the
exclusive one".

Returning "0" is a success, but implies it shouldn't count as an
exclusive wakeup (generally because "try_to_wake_up()" didn't return
1, so the entry on the list wasn't actually sleeping and might already
have passed the critical region).

And returning "1" means that you _actually_ woke something up, so that
the exclusive counting triggers.

So that 0-vs-1 is very much exactly because of the race with "I'm an
exclusive waiter, and if an event comes in while I'm _actively_
waiting for it, then I promise I will take it". And the difference is
exactly about whether try_to_wake_up() actually changed the state of
the process or not.

But in addition to these "should I continue and/or count exclusive
waiters", wake functions can _do_ things too. The common one is
"autoremove", which is basically "if you actually woke something up,
then remove it from the list" (so that one generally goes along with a
successful try_to_wake_up() and returning 1.

Doing that can help scalability enormously, because if you stay on the
list and only remove yourself in the final "finish_waikt()", then if
there are lots of waiters and lots of wakeup events, the wakups will
traverse your entry over and over and over again. So autoremove is
generally a good thing.

An autoremove() thing can also be used at finish_wait() time to decide
whether you were actually woken up by that list or not (like the
WQ_FLAG_WOKEN, just implicitly). Remember: you can be on _many_ wait
lists, and you can be woken up by non-waitlist events like signals etc
too, so you can use WQ_FLAG_WOKEN (if you use woken_wake_function) or
the "am I still on the list" (if you use autoremove) to decide that
_that_ particular wakeup source triggered.

But some code wants to _stay_ on the list, because maybe they want to
be notified about multiple things and don't want to re-insert
themselves onto the list in between. That's the default behavior if
you don't do anything at all and just use the normal wake queue init
things. It's not generally the best thing to do, but it _can_ be
useful, and I think it's the default one purely for historical reasons
- ie that's how they all used to work, and the "autoremove" behavior
came in because of the scalability concerns.

But the waiting side is the one that decides which wakeup function it
wants for the wakeup event. So you can have the "please wake me up and
remove me from the wait list when you do". _and_ then add your own
logic as well, if you want to.

Yes, our wait queues are complicated. They are very powerful, though,
and they have been designed to easily avoid races many different ways
(and to also very naturally work with waiting for multiple very
different events, where select and poll are the obvious cases, but
even a single "read()" function _could_ decide to use multiple
wait-queues if there's another wait-queue for exceptional
circumstances, for example).

Linus