Re: [PATCH] mm/hugetlb: Defer freeing of huge pages if in non-task context
From: Andrew Morton
Date: Mon Dec 16 2019 - 16:51:15 EST
On Mon, 16 Dec 2019 13:27:39 -0500 Waiman Long <longman@xxxxxxxxxx> wrote:
> The following lockdep splat was observed when a certain hugetlbfs test
> was run:
>
> ...
>
> Both the hugetbl_lock and the subpool lock can be acquired in
> free_huge_page(). One way to solve the problem is to make both locks
> irq-safe. Another alternative is to defer the freeing to a workqueue job.
>
> This patch implements the deferred freeing by adding a
> free_hpage_workfn() work function to do the actual freeing. The
> free_huge_page() call in a non-task context saves the page to be freed
> in the hpage_freelist linked list in a lockless manner.
>
> The generic workqueue is used to process the work, but a dedicated
> workqueue can be used instead if it is desirable to have the huge page
> freed ASAP.
>
> ...
>
> @@ -1199,6 +1199,73 @@ void free_huge_page(struct page *page)
> spin_unlock(&hugetlb_lock);
> }
>
> +/*
> + * As free_huge_page() can be called from a non-task context, we have
> + * to defer the actual freeing in a workqueue to prevent potential
> + * hugetlb_lock deadlock.
> + *
> + * free_hpage_workfn() locklessly retrieves the linked list of pages to
> + * be freed and frees them one-by-one. As the page->mapping pointer is
> + * going to be cleared in __free_huge_page() anyway, it is reused as the
> + * next pointer of a singly linked list of huge pages to be freed.
> + */
> +#define NEXT_PENDING ((struct page *)-1)
> +static struct page *hpage_freelist;
> +
> +static void free_hpage_workfn(struct work_struct *work)
> +{
> + struct page *curr, *next;
> + int cnt = 0;
> +
> + do {
> + curr = xchg(&hpage_freelist, NULL);
> + if (!curr)
> + break;
> +
> + while (curr) {
> + next = (struct page *)READ_ONCE(curr->mapping);
> + if (next == NEXT_PENDING) {
> + cpu_relax();
> + continue;
> + }
> + __free_huge_page(curr);
> + curr = next;
> + cnt++;
> + }
> + } while (!READ_ONCE(hpage_freelist));
> +
> + if (!cnt)
> + return;
> + pr_debug("HugeTLB: free_hpage_workfn() frees %d huge page(s)\n", cnt);
> +}
> +static DECLARE_WORK(free_hpage_work, free_hpage_workfn);
> +
> +void free_huge_page(struct page *page)
> +{
> + /*
> + * Defer freeing if in non-task context to avoid hugetlb_lock deadlock.
> + */
> + if (!in_task()) {
> + struct page *next;
> +
> + page->mapping = (struct address_space *)NEXT_PENDING;
> + next = xchg(&hpage_freelist, page);
> + WRITE_ONCE(page->mapping, (struct address_space *)next);
The NEXT_PENDING stuff could do with come commenting, I think. It's
reasonably obvious, but not obvious enough. For example, why does the
second write to page->mapping use WRITE_ONCE() but the first does not.
Please spell out the design, fully.
> + schedule_work(&free_hpage_work);
> + return;
> + }
> +
> + /*
> + * Racing may prevent some deferred huge pages in hpage_freelist
> + * from being freed. Check here and call schedule_work() if that
> + * is the case.
> + */
> + if (unlikely(hpage_freelist && !work_pending(&free_hpage_work)))
> + schedule_work(&free_hpage_work);
> +
> + __free_huge_page(page);
> +}
> +
> static void prep_new_huge_page(struct hstate *h, struct page *page, int nid)
> {
> INIT_LIST_HEAD(&page->lru);
Otherwise it looks OK to me. Deferring freeing in this way is
generally lame and gives rise to concerns about memory exhaustion in
strange situations, and to concerns about various memory accounting
stats being logically wrong for short periods. But we already do this
in (too) many places, so fingers crossed :(