Re: [RFC] futex: prevent endless loop on s390x with emulated hugepages

From: Vlastimil Babka
Date: Tue Oct 13 2015 - 09:51:31 EST


On 10/13/2015 01:48 PM, Vlastimil Babka wrote:
On 09/28/2015 01:49 PM, Martin Schwidefsky wrote:
On Thu, 24 Sep 2015 17:05:48 +0200
Vlastimil Babka <vbabka@xxxxxxx> wrote:

[...]

However, __get_user_pages_fast() is still broken. The get_user_pages_fast()
wrapper will hide this in the common case. The other user of the __ variant
is kvm, which is mentioned as the reason for removal of emulated hugepages.
The call of page_cache_get_speculative() looks also broken in this scenario
on debug builds because of VM_BUG_ON_PAGE(PageTail(page), page). With
CONFIG_TINY_RCU enabled, there's plain atomic_inc(&page->_count) which also
probably shouldn't happen for a tail page...

It boils down to __get_user_pages_fast being broken for emulated large pages,
doesn't it? My preferred fix would be to get __get_user_page_fast to work
in this case.

I agree, but didn't know enough of the architecture to attempt such fix
:) Thanks!

For 3.12 a patch would look like this (needs more testing
though):

FWIW it works for me in the particular LTP test, but as you said, it
needs more testing and breaking stable would suck.

I'm trying to break the patch on 3.12 with trinity, let's see...
Tried also to review it, although it's unlikely I'll catch some s390x-specific gotchas. For example, can't say what the effect of _SEGMENT_ENTRY_CO removal will be - before, the bit was set for non-emulated hugepages, and now the same bit is set for emulated ones?
Or if pmd_bad() was also broken before, and now isn't? But otherwise the change seems OK, besides some nitpick below.

[...]

@@ -103,7 +104,7 @@ static inline int gup_pmd_range(pud_t *pudp, pud_t pud, unsigned long addr,
unsigned long end, int write, struct page **pages, int *nr)
{
unsigned long next;
- pmd_t *pmdp, pmd;
+ pmd_t *pmdp, pmd, pmd_orig;

pmdp = (pmd_t *) pudp;
#ifdef CONFIG_64BIT
@@ -112,7 +113,7 @@ static inline int gup_pmd_range(pud_t *pudp, pud_t pud, unsigned long addr,
pmdp += pmd_index(addr);
#endif
do {
- pmd = *pmdp;
+ pmd = pmd_orig = *pmdp;
barrier();
next = pmd_addr_end(addr, end);
/*
@@ -127,8 +128,9 @@ static inline int gup_pmd_range(pud_t *pudp, pud_t pud, unsigned long addr,
if (pmd_none(pmd) || pmd_trans_splitting(pmd))
return 0;
if (unlikely(pmd_large(pmd))) {
- if (!gup_huge_pmd(pmdp, pmd, addr, next,
- write, pages, nr))
+ if (!gup_huge_pmd(pmdp, pmd_orig,
+ pmd_swlarge_deref(pmd),
+ addr, next, write, pages, nr))
return 0;
} else if (!gup_pte_range(pmdp, pmd, addr, next,
write, pages, nr))

The "pmd" variable isn't changed anywhere in this loop after the initial assignment, so the extra "pmd_orig" variable isn't needed.


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/