Re: [PATCH] mm, page_alloc: Do not wake kswapd with zone lock held

From: Vlastimil Babka
Date: Fri Jan 04 2019 - 03:18:48 EST


On 1/3/19 11:57 PM, Mel Gorman wrote:
> syzbot reported the following regression in the latest merge window
> and it was confirmed by Qian Cai that a similar bug was visible from a
> different context.
>
> ======================================================
> WARNING: possible circular locking dependency detected
> 4.20.0+ #297 Not tainted
> ------------------------------------------------------
> syz-executor0/8529 is trying to acquire lock:
> 000000005e7fb829 (&pgdat->kswapd_wait){....}, at:
> __wake_up_common_lock+0x19e/0x330 kernel/sched/wait.c:120
>
> but task is already holding lock:
> 000000009bb7bae0 (&(&zone->lock)->rlock){-.-.}, at: spin_lock
> include/linux/spinlock.h:329 [inline]
> 000000009bb7bae0 (&(&zone->lock)->rlock){-.-.}, at: rmqueue_bulk
> mm/page_alloc.c:2548 [inline]
> 000000009bb7bae0 (&(&zone->lock)->rlock){-.-.}, at: __rmqueue_pcplist
> mm/page_alloc.c:3021 [inline]
> 000000009bb7bae0 (&(&zone->lock)->rlock){-.-.}, at: rmqueue_pcplist
> mm/page_alloc.c:3050 [inline]
> 000000009bb7bae0 (&(&zone->lock)->rlock){-.-.}, at: rmqueue
> mm/page_alloc.c:3072 [inline]
> 000000009bb7bae0 (&(&zone->lock)->rlock){-.-.}, at:
> get_page_from_freelist+0x1bae/0x52a0 mm/page_alloc.c:3491
>
> It appears to be a false positive in that the only way the lock
> ordering should be inverted is if kswapd is waking itself and the
> wakeup allocates debugging objects which should already be allocated
> if it's kswapd doing the waking. Nevertheless, the possibility exists
> and so it's best to avoid the problem.
>
> This patch flags a zone as needing a kswapd using the, surprisingly,
> unused zone flag field. The flag is read without the lock held to
> do the wakeup. It's possible that the flag setting context is not
> the same as the flag clearing context or for small races to occur.
> However, each race possibility is harmless and there is no visible
> degredation in fragmentation treatment.
>
> While zone->flag could have continued to be unused, there is potential
> for moving some existing fields into the flags field instead. Particularly
> read-mostly ones like zone->initialized and zone->contiguous.
>
> Reported-by: syzbot+93d94a001cfbce9e60e1@xxxxxxxxxxxxxxxxxxxxxxxxx
> Tested-by: Qian Cai <cai@xxxxxx>
> Signed-off-by: Mel Gorman <mgorman@xxxxxxxxxxxxxxxxxxx>

Fixes: 1c30844d2dfe ("mm: reclaim small amounts of memory when an
external fragmentation event occurs")
Acked-by: Vlastimil Babka <vbabka@xxxxxxx>

> ---
> include/linux/mmzone.h | 6 ++++++
> mm/page_alloc.c | 8 +++++++-
> 2 files changed, 13 insertions(+), 1 deletion(-)
>
> diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
> index cc4a507d7ca4..842f9189537b 100644
> --- a/include/linux/mmzone.h
> +++ b/include/linux/mmzone.h
> @@ -520,6 +520,12 @@ enum pgdat_flags {
> PGDAT_RECLAIM_LOCKED, /* prevents concurrent reclaim */
> };
>
> +enum zone_flags {
> + ZONE_BOOSTED_WATERMARK, /* zone recently boosted watermarks.
> + * Cleared when kswapd is woken.
> + */
> +};
> +
> static inline unsigned long zone_managed_pages(struct zone *zone)
> {
> return (unsigned long)atomic_long_read(&zone->managed_pages);
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index cde5dac6229a..d295c9bc01a8 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -2214,7 +2214,7 @@ static void steal_suitable_fallback(struct zone *zone, struct page *page,
> */
> boost_watermark(zone);
> if (alloc_flags & ALLOC_KSWAPD)
> - wakeup_kswapd(zone, 0, 0, zone_idx(zone));
> + set_bit(ZONE_BOOSTED_WATERMARK, &zone->flags);
>
> /* We are not allowed to try stealing from the whole block */
> if (!whole_block)
> @@ -3102,6 +3102,12 @@ struct page *rmqueue(struct zone *preferred_zone,
> local_irq_restore(flags);
>
> out:
> + /* Separate test+clear to avoid unnecessary atomics */
> + if (test_bit(ZONE_BOOSTED_WATERMARK, &zone->flags)) {
> + clear_bit(ZONE_BOOSTED_WATERMARK, &zone->flags);
> + wakeup_kswapd(zone, 0, 0, zone_idx(zone));
> + }
> +
> VM_BUG_ON_PAGE(page && bad_range(zone, page), page);
> return page;
>
>