Re: [PATCH] hugetlbfs: Disable IRQ when taking hugetlb_lock in set_max_huge_pages()

From: Waiman Long
Date: Mon Dec 09 2019 - 21:37:10 EST


On 12/9/19 7:46 PM, Waiman Long wrote:
> On 12/9/19 11:49 AM, Matthew Wilcox wrote:
>> On Mon, Dec 09, 2019 at 11:01:50AM -0500, Waiman Long wrote:
>>> [ 613.245273] Call Trace:
>>> [ 613.256273] <IRQ>
>>> [ 613.265273] dump_stack+0x9a/0xf0
>>> [ 613.281273] mark_lock+0xd0c/0x12f0
>>> [ 613.341273] __lock_acquire+0x146b/0x48c0
>>> [ 613.401273] lock_acquire+0x14f/0x3b0
>>> [ 613.440273] _raw_spin_lock+0x30/0x70
>>> [ 613.477273] free_huge_page+0x36f/0xaa0
>>> [ 613.495273] bio_check_pages_dirty+0x2fc/0x5c0
>> Oh, this is fun. So we kicked off IO to a hugepage, then truncated or
>> otherwise caused the page to come free. Then the IO finished and did the
>> final put on the page ... from interrupt context. Splat. Not something
>> that's going to happen often, but can happen if a process dies during
>> IO or due to a userspace bug.
>>
>> Maybe we should schedule work to do the actual freeing of the page
>> instead of this rather large patch. It doesn't seem like a case we need
>> to optimise for.
> I think that may be a good idea to try it out.

It turns out using workqueue is more complex that I originally thought.
I currently come up with the following untested changes to do that:

diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index ac65bb5e38ac..629ac000318b 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -1136,7 +1136,7 @@ static inline void ClearPageHugeTemporary(struct
page *pag
ÂÂÂÂÂÂÂ page[2].mapping = NULL;
Â}
Â
-void free_huge_page(struct page *page)
+static void __free_huge_page(struct page *page)
Â{
ÂÂÂÂÂÂÂ /*
ÂÂÂÂÂÂÂÂ * Can't pass hstate in here because it is called from the
@@ -1199,6 +1199,82 @@ void free_huge_page(struct page *page)
ÂÂÂÂÂÂÂ spin_unlock(&hugetlb_lock);
Â}
Â
+/*
+ * As free_huge_page() can be called from softIRQ context, we have
+ * to defer the actual freeing in a workqueue to prevent potential
+ * hugetlb_lock deadlock.
+ */
+struct hpage_to_free {
+ÂÂÂÂÂÂ struct pageÂÂÂÂÂÂÂÂÂÂÂÂ *page;
+ÂÂÂÂÂÂ struct hpage_to_freeÂÂÂ *next;
+};
+static struct hpage_to_free *hpage_freelist;
+#define NEXT_PENDINGÂÂ ((struct hpage_to_free *)-1)
+
+/*
+ * This work function locklessly retrieves the pages to be freed and
+ * frees them one-by-one.
+ */
+static void free_huge_page_workfn(struct work_struct *work)
+{
+ÂÂÂÂÂÂ struct hpage_to_free *curr, *next;
+
+recheck:
+ÂÂÂÂÂÂ curr = xchg(&hpage_freelist, NULL);
+ÂÂÂÂÂÂ if (!curr)
+ÂÂÂÂÂÂÂÂÂÂÂÂÂÂ return;
+
+ÂÂÂÂÂÂ while (curr) {
+ÂÂÂÂÂÂÂÂÂÂÂÂÂÂ __free_huge_page(curr->page);
+ÂÂÂÂÂÂÂÂÂÂÂÂÂÂ next = READ_ONCE(curr->next);
+ÂÂÂÂÂÂÂÂÂÂÂÂÂÂ while (next == NEXT_PENDING) {
+ÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂ cpu_relax();
+ÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂ next = READ_ONCE(curr->next);
+ÂÂÂÂÂÂÂÂÂÂÂÂÂÂ }
+ÂÂÂÂÂÂÂÂÂÂÂÂÂÂ kfree(curr);
+ÂÂÂÂÂÂÂÂÂÂÂÂÂÂ curr = next;
+ÂÂÂÂÂÂ }
+ÂÂÂÂÂÂ if (work && READ_ONCE(hpage_freelist))
+ÂÂÂÂÂÂÂÂÂÂÂÂÂÂ goto recheck;
+}
+static DECLARE_WORK(free_huge_page_work, free_huge_page_workfn);
+
+void free_huge_page(struct page *page)
+{
+ÂÂÂÂÂÂ /*
+ÂÂÂÂÂÂÂ * Defer freeing in softIRQ context to avoid hugetlb_lock deadlock.
+ÂÂÂÂÂÂÂ */
+ÂÂÂÂÂÂ if (in_serving_softirq()) {
+ÂÂÂÂÂÂÂÂÂÂÂÂÂÂ struct hpage_to_free *item, *next;
+
+ÂÂÂÂÂÂÂÂÂÂÂÂÂÂ /*
+ÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂ * We are in serious trouble if kmalloc fails. In this
+ÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂ * case, we hope for the best and do the freeing now.
+ÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂ */
+ÂÂÂÂÂÂÂÂÂÂÂÂÂÂ item = kmalloc(sizeof(*item), GFP_KERNEL);
+ÂÂÂÂÂÂÂÂÂÂÂÂÂÂ if (WARN_ON_ONCE(!item))
+ÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂ goto free_page_now;
+
+ÂÂÂÂÂÂÂÂÂÂÂÂÂÂ item->page = page;
+ÂÂÂÂÂÂÂÂÂÂÂÂÂÂ item->next = NEXT_PENDING;
+ÂÂÂÂÂÂÂÂÂÂÂÂÂÂ next = xchg(&hpage_freelist, item);
+ÂÂÂÂÂÂÂÂÂÂÂÂÂÂ WRITE_ONCE(item->next, next);
+ÂÂÂÂÂÂÂÂÂÂÂÂÂÂ schedule_work(&free_huge_page_work);
+ÂÂÂÂÂÂÂÂÂÂÂÂÂÂ return;
+ÂÂÂÂÂÂ }
+
+ÂÂÂÂÂÂ /*
+ÂÂÂÂÂÂÂ * Racing may prevent some of deferred huge pages in hpage_freelist
+ÂÂÂÂÂÂÂ * from being freed. Check here and call schedule_work() if that
+ÂÂÂÂÂÂÂ * is the case.
+ÂÂÂÂÂÂÂ */
+ÂÂÂÂÂÂ if (hpage_freelist && !work_pending(&free_huge_page_work))
+ÂÂÂÂÂÂÂÂÂÂÂÂÂÂ schedule_work(&free_huge_page_work);
+
+free_page_now:
+ÂÂÂÂÂÂ __free_huge_page(page);
+}
+
Âstatic void prep_new_huge_page(struct hstate *h, struct page *page, int
nid)
Â{
ÂÂÂÂÂÂÂ INIT_LIST_HEAD(&page->lru);

-----------------------------------------------------------------------------------------------------

I think it may be simpler and less risky to use spin_lock_bh() as you
have suggested. Also, the above code will not be good enough if more
lock taking functions are being called from softIRQ context in the future.

So what do you think?

Cheers,
Longman

Cheers,
Longman