Re: [PATCH] mm: fix hard lockup in __split_huge_page

From: Matthew Wilcox
Date: Mon Jun 17 2024 - 23:21:14 EST


On Tue, Jun 18, 2024 at 10:09:26AM +0800, zhaoyang.huang wrote:
> Hard lockup[2] is reported which should be caused by recursive
> lock contention of lruvec->lru_lock[1] within __split_huge_page.
>
> [1]
> static void __split_huge_page(struct page *page, struct list_head *list,
> pgoff_t end, unsigned int new_order)
> {
> /* lock lru list/PageCompound, ref frozen by page_ref_freeze */
> //1st lock here
> lruvec = folio_lruvec_lock(folio);
>
> for (i = nr - new_nr; i >= new_nr; i -= new_nr) {
> __split_huge_page_tail(folio, i, lruvec, list, new_order);
> /* Some pages can be beyond EOF: drop them from page cache */
> if (head[i].index >= end) {
> folio_put(tail);
> __page_cache_release
> //2nd lock here
> folio_lruvec_relock_irqsave

Why doesn't lockdep catch this?

> diff --git a/mm/huge_memory.c b/mm/huge_memory.c
> index 9859aa4f7553..ea504df46d3b 100644
> --- a/mm/huge_memory.c
> +++ b/mm/huge_memory.c
> @@ -2925,7 +2925,9 @@ static void __split_huge_page(struct page *page, struct list_head *list,
> folio_account_cleaned(tail,
> inode_to_wb(folio->mapping->host));
> __filemap_remove_folio(tail, NULL);
> + unlock_page_lruvec(lruvec);
> folio_put(tail);
> + folio_lruvec_lock(folio);

Why is it safe to drop & reacquire this lock? Is there nothing we need
to revalidate?