[RFC] futex: prevent endless loop on s390x with emulated hugepages

From: Vlastimil Babka
Date: Thu Sep 24 2015 - 11:06:16 EST


Yong Sun has reported the LTP futex_wake04 test to hang a s390x with our
kernel based on 3.12. This is reproducible on upstream 4.1.8 as well. 4.2+
is OK thanks to removal of emulated hugepages, but we should do something
about the stable kernels here.

The LTP test is a regression test for commit 13d60f4b6a ("futex: Take
hugepages into account when generating futex_key"), but it turns out that it's
sufficient to just attempt to wait for a single futex on a tail page of a
hugetlbfs page:

==== BEGIN REPRODUCER ====

static struct timespec to = {.tv_sec = 1, .tv_nsec = 0};

int main(int argc, char *argv[])
{
void *addr;
int hpsz = 1024*1024;
int pgsz = 4096;
int * futex1;

addr = mmap(NULL, hpsz, PROT_WRITE | PROT_READ,
MAP_SHARED | MAP_ANONYMOUS | MAP_HUGETLB, -1, 0);

if (addr == MAP_FAILED) {
perror("mmap()");
return 1;
}

futex1 = (int *)((char *)addr + pgsz);
*futex1 = 0;

syscall(SYS_futex, futex1, FUTEX_WAIT, *futex1, &to, 0, 0);

return 0;
}

==== END REPRODUCER ===

The problem is an endless loop in get_futex_key() when
CONFIG_TRANSPARENT_HUGEPAGE is enabled and the s390x machine has emulated
hugepages. The code tries to serialize against __split_huge_page_splitting(),
but __get_user_pages_fast() fails on the hugetlbfs tail page. This happens
because pmd_large() is false for emulated hugepages, so the code will proceed
into gup_pte_range() and fail page_cache_get_speculative() through failing
get_page_unless_zero() as the tail page count is zero. Failing __gup_fast is
supposed to be temporary due to a race, so get_futex_key() will try again
endlessly.

This attempt for a fix is a bandaid solution and probably incomplete.
Hopefully something better will emerge from the discussion. Fully fixing
emulated hugepages just for stable backports is unlikely due to them being
removed. Also THP refcounting redesign should soon remove the trickery from
get_futex_key().

This patch relies on the fact that s390x with emulated hugepages returns false
in has_transparent_hugepage(), so we don't need to do the serialization
trickery and just use the code for !CONFIG_TRANSPARENT_HUGEPAGE. We just need
an extra variable to cache the result of has_transparent_hugepage(), which is
__init and potentially expensive on some architectures.

However, __get_user_pages_fast() is still broken. The get_user_pages_fast()
wrapper will hide this in the common case. The other user of the __ variant
is kvm, which is mentioned as the reason for removal of emulated hugepages.
The call of page_cache_get_speculative() looks also broken in this scenario
on debug builds because of VM_BUG_ON_PAGE(PageTail(page), page). With
CONFIG_TINY_RCU enabled, there's plain atomic_inc(&page->_count) which also
probably shouldn't happen for a tail page...

Not-yet-signed-off-by: Vlastimil Babka <vbabka@xxxxxxx>
Reported-by: Yong Sun <yosun@xxxxxxxx>
Cc: Zhang Yi <wetpzy@xxxxxxxxx>
Cc: Mel Gorman <mgorman@xxxxxxx>
Cc: Andrea Arcangeli <aarcange@xxxxxxxxxx>
Cc: Kirill A. Shutemov <kirill.shutemov@xxxxxxxxxxxxxxx>
Cc: Hugh Dickins <hughd@xxxxxxxxxx>
Cc: Dominik Dingel <dingel@xxxxxxxxxxxxxxxxxx>
Cc: Martin Schwidefsky <schwidefsky@xxxxxxxxxx>
Cc: Heiko Carstens <heiko.carstens@xxxxxxxxxx>
Cc: Christian Borntraeger <borntraeger@xxxxxxxxxx>
---
include/linux/huge_mm.h | 1 +
kernel/futex.c | 10 ++++++++--
mm/huge_memory.c | 4 ++++
3 files changed, 13 insertions(+), 2 deletions(-)

diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h
index f10b20f..5dbaca3 100644
--- a/include/linux/huge_mm.h
+++ b/include/linux/huge_mm.h
@@ -92,6 +92,7 @@ extern bool is_vma_temporary_stack(struct vm_area_struct *vma);
#define transparent_hugepage_debug_cow() 0
#endif /* CONFIG_DEBUG_VM */

+extern bool transparent_hugepage_available;
extern unsigned long transparent_hugepage_flags;
extern int split_huge_page_to_list(struct page *page, struct list_head *list);
static inline int split_huge_page(struct page *page)
diff --git a/kernel/futex.c b/kernel/futex.c
index c4a182f..be9cd1c 100644
--- a/kernel/futex.c
+++ b/kernel/futex.c
@@ -443,6 +443,9 @@ get_futex_key(u32 __user *uaddr, int fshared, union futex_key *key, int rw)
err = 0;

#ifdef CONFIG_TRANSPARENT_HUGEPAGE
+ if (!transparent_hugepage_available)
+ goto no_thp;
+
page_head = page;
if (unlikely(PageTail(page))) {
put_page(page);
@@ -470,14 +473,17 @@ get_futex_key(u32 __user *uaddr, int fshared, union futex_key *key, int rw)
goto again;
}
}
-#else
+ goto lockpage;
+#endif
+
+no_thp:
page_head = compound_head(page);
if (page != page_head) {
get_page(page_head);
put_page(page);
}
-#endif

+lockpage:
lock_page(page_head);

/*
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 097c7a4..6aea047 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -36,6 +36,7 @@
* Defrag is invoked by khugepaged hugepage allocations and by page faults
* for all hugepage allocations.
*/
+bool transparent_hugepage_available __read_mostly = false;
unsigned long transparent_hugepage_flags __read_mostly =
#ifdef CONFIG_TRANSPARENT_HUGEPAGE_ALWAYS
(1<<TRANSPARENT_HUGEPAGE_FLAG)|
@@ -626,10 +627,13 @@ static int __init hugepage_init(void)
struct kobject *hugepage_kobj;

if (!has_transparent_hugepage()) {
+ transparent_hugepage_available = false;
transparent_hugepage_flags = 0;
return -EINVAL;
}

+ transparent_hugepage_available = true;
+
err = hugepage_init_sysfs(&hugepage_kobj);
if (err)
goto err_sysfs;
--
2.5.1

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/