Re: [RFC PATCH] mm: silence soft lockups from unlock_page

From: Michal Hocko
Date: Mon Aug 03 2020 - 09:14:58 EST


Hi,
sorry for being late in discussion (was offline or very busy with other
stuff).

I hope I got it right and this is the latest version of your patches. Btw.
do you still think that increasing PAGE_WAIT_TABLE_BITS is reasonable.
In the meantime I have learned that the customer suffering from the
issue is very unlikely to reboot the machine anytime soon or even
willing to test a non-trivial patch. We do not own any machine which
exhibit this problem unfortunately. So it is quite unlikely I can
help with testing.

Also does it make sense to put this into mmotm tree for a while to get a
larger testing coverage?

I didn't get to review these patch and it will take some time to do that
because this is really subtle area.

Thanks for helping out and sorry that I cannot really help much more
now.

On Fri 24-07-20 16:25:56, Linus Torvalds wrote:
> From 0bccb60841cc52a9aa6e9cc6b7eff59d1983e8fa Mon Sep 17 00:00:00 2001
> From: Linus Torvalds <torvalds@xxxxxxxxxxxxxxxxxxxx>
> Date: Thu, 23 Jul 2020 10:16:49 -0700
> Subject: [PATCH 1/2] mm: rewrite wait_on_page_bit_common() logic
>
> It turns out that wait_on_page_bit_common() had several problems,
> ranging from just unfair behavioe due to re-queueing at the end of the
> wait queue when re-trying, and an outright bug that could result in
> missed wakeups (but probably never happened in practice).
>
> This rewrites the whole logic to avoid both issues, by simply moving the
> logic to check (and possibly take) the bit lock into the wakeup path
> instead.
>
> That makes everything much more straightforward, and means that we never
> need to re-queue the wait entry: if we get woken up, we'll be notified
> through WQ_FLAG_WOKEN, and the wait queue entry will have been removed,
> and everything will have been done for us.
>
> Link: https://lore.kernel.org/lkml/CAHk-=wjJA2Z3kUFb-5s=6+n0qbTs8ELqKFt9B3pH85a8fGD73w@xxxxxxxxxxxxxx/
> Link: https://lore.kernel.org/lkml/alpine.LSU.2.11.2007221359450.1017@eggly.anvils/
> Reported-by: Oleg Nesterov <oleg@xxxxxxxxxx>
> Reported-by: Hugh Dickins <hughd@xxxxxxxxxx>
> Cc: Michal Hocko <mhocko@xxxxxxxx>
> Reviewed-by: Oleg Nesterov <oleg@xxxxxxxxxx>
> Signed-off-by: Linus Torvalds <torvalds@xxxxxxxxxxxxxxxxxxxx>
> ---
> mm/filemap.c | 132 +++++++++++++++++++++++++++++++++------------------
> 1 file changed, 85 insertions(+), 47 deletions(-)
>
> diff --git a/mm/filemap.c b/mm/filemap.c
> index 385759c4ce4b..8c3d3e233d37 100644
> --- a/mm/filemap.c
> +++ b/mm/filemap.c
> @@ -1002,6 +1002,7 @@ struct wait_page_queue {
>
> static int wake_page_function(wait_queue_entry_t *wait, unsigned mode, int sync, void *arg)
> {
> + int ret;
> struct wait_page_key *key = arg;
> struct wait_page_queue *wait_page
> = container_of(wait, struct wait_page_queue, wait);
> @@ -1014,17 +1015,40 @@ static int wake_page_function(wait_queue_entry_t *wait, unsigned mode, int sync,
> return 0;
>
> /*
> - * Stop walking if it's locked.
> - * Is this safe if put_and_wait_on_page_locked() is in use?
> - * Yes: the waker must hold a reference to this page, and if PG_locked
> - * has now already been set by another task, that task must also hold
> - * a reference to the *same usage* of this page; so there is no need
> - * to walk on to wake even the put_and_wait_on_page_locked() callers.
> + * If it's an exclusive wait, we get the bit for it, and
> + * stop walking if we can't.
> + *
> + * If it's a non-exclusive wait, then the fact that this
> + * wake function was called means that the bit already
> + * was cleared, and we don't care if somebody then
> + * re-took it.
> */
> - if (test_bit(key->bit_nr, &key->page->flags))
> - return -1;
> + ret = 0;
> + if (wait->flags & WQ_FLAG_EXCLUSIVE) {
> + if (test_and_set_bit(key->bit_nr, &key->page->flags))
> + return -1;
> + ret = 1;
> + }
> + wait->flags |= WQ_FLAG_WOKEN;
>
> - return autoremove_wake_function(wait, mode, sync, key);
> + wake_up_state(wait->private, mode);
> +
> + /*
> + * Ok, we have successfully done what we're waiting for,
> + * and we can unconditionally remove the wait entry.
> + *
> + * Note that this has to be the absolute last thing we do,
> + * since after list_del_init(&wait->entry) the wait entry
> + * might be de-allocated and the process might even have
> + * exited.
> + *
> + * We _really_ should have a "list_del_init_careful()" to
> + * properly pair with the unlocked "list_empty_careful()"
> + * in finish_wait().
> + */
> + smp_mb();
> + list_del_init(&wait->entry);
> + return ret;
> }
>
> static void wake_up_page_bit(struct page *page, int bit_nr)
> @@ -1103,16 +1127,31 @@ enum behavior {
> */
> };
>
> +/*
> + * Attempt to check (or get) the page bit, and mark the
> + * waiter woken if successful.
> + */
> +static inline bool trylock_page_bit_common(struct page *page, int bit_nr,
> + struct wait_queue_entry *wait)
> +{
> + if (wait->flags & WQ_FLAG_EXCLUSIVE) {
> + if (test_and_set_bit(bit_nr, &page->flags))
> + return false;
> + } else if (test_bit(bit_nr, &page->flags))
> + return false;
> +
> + wait->flags |= WQ_FLAG_WOKEN;
> + return true;
> +}
> +
> static inline int wait_on_page_bit_common(wait_queue_head_t *q,
> struct page *page, int bit_nr, int state, enum behavior behavior)
> {
> struct wait_page_queue wait_page;
> wait_queue_entry_t *wait = &wait_page.wait;
> - bool bit_is_set;
> bool thrashing = false;
> bool delayacct = false;
> unsigned long pflags;
> - int ret = 0;
>
> if (bit_nr == PG_locked &&
> !PageUptodate(page) && PageWorkingset(page)) {
> @@ -1130,48 +1169,47 @@ static inline int wait_on_page_bit_common(wait_queue_head_t *q,
> wait_page.page = page;
> wait_page.bit_nr = bit_nr;
>
> - for (;;) {
> - spin_lock_irq(&q->lock);
> + /*
> + * Do one last check whether we can get the
> + * page bit synchronously.
> + *
> + * Do the SetPageWaiters() marking before that
> + * to let any waker we _just_ missed know they
> + * need to wake us up (otherwise they'll never
> + * even go to the slow case that looks at the
> + * page queue), and add ourselves to the wait
> + * queue if we need to sleep.
> + *
> + * This part needs to be done under the queue
> + * lock to avoid races.
> + */
> + spin_lock_irq(&q->lock);
> + SetPageWaiters(page);
> + if (!trylock_page_bit_common(page, bit_nr, wait))
> + __add_wait_queue_entry_tail(q, wait);
> + spin_unlock_irq(&q->lock);
>
> - if (likely(list_empty(&wait->entry))) {
> - __add_wait_queue_entry_tail(q, wait);
> - SetPageWaiters(page);
> - }
> + /*
> + * From now on, all the logic will be based on
> + * the WQ_FLAG_WOKEN flag, and the and the page
> + * bit testing (and setting) will be - or has
> + * already been - done by the wake function.
> + *
> + * We can drop our reference to the page.
> + */
> + if (behavior == DROP)
> + put_page(page);
>
> + for (;;) {
> set_current_state(state);
>
> - spin_unlock_irq(&q->lock);
> -
> - bit_is_set = test_bit(bit_nr, &page->flags);
> - if (behavior == DROP)
> - put_page(page);
> -
> - if (likely(bit_is_set))
> - io_schedule();
> -
> - if (behavior == EXCLUSIVE) {
> - if (!test_and_set_bit_lock(bit_nr, &page->flags))
> - break;
> - } else if (behavior == SHARED) {
> - if (!test_bit(bit_nr, &page->flags))
> - break;
> - }
> -
> - if (signal_pending_state(state, current)) {
> - ret = -EINTR;
> + if (signal_pending_state(state, current))
> break;
> - }
>
> - if (behavior == DROP) {
> - /*
> - * We can no longer safely access page->flags:
> - * even if CONFIG_MEMORY_HOTREMOVE is not enabled,
> - * there is a risk of waiting forever on a page reused
> - * for something that keeps it locked indefinitely.
> - * But best check for -EINTR above before breaking.
> - */
> + if (wait->flags & WQ_FLAG_WOKEN)
> break;
> - }
> +
> + io_schedule();
> }
>
> finish_wait(q, wait);
> @@ -1190,7 +1228,7 @@ static inline int wait_on_page_bit_common(wait_queue_head_t *q,
> * bother with signals either.
> */
>
> - return ret;
> + return wait->flags & WQ_FLAG_WOKEN ? 0 : -EINTR;
> }
>
> void wait_on_page_bit(struct page *page, int bit_nr)
> --
> 2.28.0.rc0.3.g1e25d3a62f
>

> From 93f0263b9b952a1c449cec56a6aadf6320e821f9 Mon Sep 17 00:00:00 2001
> From: Linus Torvalds <torvalds@xxxxxxxxxxxxxxxxxxxx>
> Date: Thu, 23 Jul 2020 12:33:41 -0700
> Subject: [PATCH 2/2] list: add "list_del_init_careful()" to go with
> "list_empty_careful()"
>
> That gives us ordering guarantees around the pair.
>
> Signed-off-by: Linus Torvalds <torvalds@xxxxxxxxxxxxxxxxxxxx>
> ---
> include/linux/list.h | 20 +++++++++++++++++++-
> kernel/sched/wait.c | 2 +-
> mm/filemap.c | 7 +------
> 3 files changed, 21 insertions(+), 8 deletions(-)
>
> diff --git a/include/linux/list.h b/include/linux/list.h
> index aff44d34f4e4..0d0d17a10d25 100644
> --- a/include/linux/list.h
> +++ b/include/linux/list.h
> @@ -282,6 +282,24 @@ static inline int list_empty(const struct list_head *head)
> return READ_ONCE(head->next) == head;
> }
>
> +/**
> + * list_del_init_careful - deletes entry from list and reinitialize it.
> + * @entry: the element to delete from the list.
> + *
> + * This is the same as list_del_init(), except designed to be used
> + * together with list_empty_careful() in a way to guarantee ordering
> + * of other memory operations.
> + *
> + * Any memory operations done before a list_del_init_careful() are
> + * guaranteed to be visible after a list_empty_careful() test.
> + */
> +static inline void list_del_init_careful(struct list_head *entry)
> +{
> + __list_del_entry(entry);
> + entry->prev = entry;
> + smp_store_release(&entry->next, entry);
> +}
> +
> /**
> * list_empty_careful - tests whether a list is empty and not being modified
> * @head: the list to test
> @@ -297,7 +315,7 @@ static inline int list_empty(const struct list_head *head)
> */
> static inline int list_empty_careful(const struct list_head *head)
> {
> - struct list_head *next = head->next;
> + struct list_head *next = smp_load_acquire(&head->next);
> return (next == head) && (next == head->prev);
> }
>
> diff --git a/kernel/sched/wait.c b/kernel/sched/wait.c
> index ba059fbfc53a..01f5d3020589 100644
> --- a/kernel/sched/wait.c
> +++ b/kernel/sched/wait.c
> @@ -389,7 +389,7 @@ int autoremove_wake_function(struct wait_queue_entry *wq_entry, unsigned mode, i
> int ret = default_wake_function(wq_entry, mode, sync, key);
>
> if (ret)
> - list_del_init(&wq_entry->entry);
> + list_del_init_careful(&wq_entry->entry);
>
> return ret;
> }
> diff --git a/mm/filemap.c b/mm/filemap.c
> index 8c3d3e233d37..991503bbf922 100644
> --- a/mm/filemap.c
> +++ b/mm/filemap.c
> @@ -1041,13 +1041,8 @@ static int wake_page_function(wait_queue_entry_t *wait, unsigned mode, int sync,
> * since after list_del_init(&wait->entry) the wait entry
> * might be de-allocated and the process might even have
> * exited.
> - *
> - * We _really_ should have a "list_del_init_careful()" to
> - * properly pair with the unlocked "list_empty_careful()"
> - * in finish_wait().
> */
> - smp_mb();
> - list_del_init(&wait->entry);
> + list_del_init_careful(&wait->entry);
> return ret;
> }
>
> --
> 2.28.0.rc0.3.g1e25d3a62f
>


--
Michal Hocko
SUSE Labs