Re: PowerPC: massive "scheduling while atomic" reports
From: Thomas Gleixner
Date: Mon Sep 14 2015 - 18:06:13 EST
On Thu, 10 Sep 2015, Juergen Borleis wrote:
Please CC lkml on bug reports for RT.
> When running the system at least every other boot this kernel spits out
> massive "scheduling while atomic" reports.
I doubt that this only happens on every other boot. This is a
systematic failure.
> Anyone with an idea what's going wrong here? I already tried with some
> debug options enabled but the highly optimised code confuses me where and
> what the code does.
Let's look at the confusing problem.
> [c3ba1cb0] [c0352d64] rt_spin_lock+0x34/0x64
> [c3ba1cc0] [c007f144] __lru_cache_add+0x30/0x10c
> [c3ba1cd0] [c0092064] handle_mm_fault+0xbb8/0x158c
> [c3ab7dd0] [c0352d5c] rt_spin_lock+0x2c/0x64 (unreliable)
> [c3ab7de0] [c0090b28] copy_page_range+0x154/0x478
> [c3ab7e60] [c0015530] copy_process.part.62+0xb84/0x1204
> [c3be7e80] [c0352d5c] rt_spin_lock+0x2c/0x64 (unreliable)
> [c3be7e90] [c0092028] handle_mm_fault+0xb7c/0x158c
> [c3be7f00] [c000ea78] do_page_fault+0x33c/0x550
> [c3be7f40] [c000dda4] handle_page_fault+0xc/0x80
> [c2cd5bc0] [c0352d64] rt_spin_lock+0x34/0x64
> [c2cd5bd0] [c007a068] get_page_from_freelist+0x148/0x6cc
> [c2cd5c50] [c007a710] __alloc_pages_nodemask+0x124/0x5f0
> [c2cd5cc0] [c007abf8] __get_free_pages+0x1c/0x50
> [c2cd5cd0] [c008f958] __tlb_remove_page+0x6c/0xcc
> [c2cd5ce0] [c009064c] unmap_single_vma+0x2e0/0x430
> [c2cd5d60] [c0090e98] unmap_vmas+0x4c/0x5c
> [c3b5fdc0] [c0352d5c] rt_spin_lock+0x2c/0x64 (unreliable)
> [c3b5fdd0] [c008dbac] follow_page_mask+0xa8/0x388
> [c3b5fe00] [c008e000] __get_user_pages.part.26+0x174/0x358
> [c3b5fe60] [c00adad8] copy_strings+0x158/0x2a0
> [c3b5fdb0] [c0352d5c] rt_spin_lock+0x2c/0x64 (unreliable)
> [c3b5fdc0] [c0090508] unmap_single_vma+0x19c/0x430
> [c3b5fe40] [c0090e98] unmap_vmas+0x4c/0x5c
All of these:
- call rt_spin_lock with preemption disabled
- are related to mm functions
So something in the mm code is causing that issue. The interesting
part are the call chains which lead to rt_spin_lock.
#1
> [c3ba1cb0] [c0352d64] rt_spin_lock+0x34/0x64
> [c3ba1cc0] [c007f144] __lru_cache_add+0x30/0x10c
> [c3ba1cd0] [c0092064] handle_mm_fault+0xbb8/0x158c
__lru_cache_add is called via a wrapper from handle_mm_fault()
# git grep -n lru_cache mm/memory.c
mm/memory.c:2116: lru_cache_add_active_or_unevictable(new_page, vma);
mm/memory.c:2575: lru_cache_add_active_or_unevictable(page, vma);
mm/memory.c:2717: lru_cache_add_active_or_unevictable(page, vma);
mm/memory.c:3008: lru_cache_add_active_or_unevictable(new_page, vma);
Not very helpful at the first glance, so lets look at the next one:
#2
copy_page_range() looks pretty innocent unless you follow the do {}
while loop:
copy_pud_range
copy_pmd_range
copy_pte_range
That one fiddles with two spinlocks:
spinlock_t *src_ptl, *dst_ptl;
And one of them seems to be taken in
dst_pte = pte_alloc_map_lock(dst_mm, dst_pmd, addr, &dst_ptl);
That's defined in include/linux/mm.h:
#define pte_alloc_map_lock(mm, pmd, address, ptlp) \
((unlikely(pmd_none(*(pmd))) && __pte_alloc(mm, NULL, \
pmd, address))? \
NULL: pte_offset_map_lock(mm, pmd, address, ptlp))
pte_offset_map_lock() does:
#define pte_offset_map_lock(mm, pmd, address, ptlp) \
({ \
spinlock_t *__ptl = pte_lockptr(mm, pmd); \
pte_t *__pte = pte_offset_map(pmd, address); \
*(ptlp) = __ptl; \
spin_lock(__ptl); \
__pte; \
})
Let's look at the lru_cache_add_active_or_unevictable() once
more. They have a very similar construct:
spinlock_t *ptl = NULL;
...
page_table = pte_offset_map_lock(mm, pmd, address, &ptl);
So we found a commonality. Lets look at pte_offset_map(), which is
defined in arch/powerpc/include/asm/pgtable-ppc32.h:
#define pte_offset_map(dir, addr) \
((pte_t *) kmap_atomic(pmd_page(*(dir))) + pte_index(addr))
So we need to look at kmap_atomic(), which is defined in
include/linux/highmem.h:
static inline void *kmap_atomic(struct page *page)
{
preempt_disable();
pagefault_disable();
return page_address(page);
}
Now that's weird. Why is that not exploding on x86_32?
Because it's conditional:
#ifndef ARCH_HAS_KMAP
Hmm, no. ARCH_HAS_KMAP is only defined by PARISC. But it's also
conditional on:
#ifdef CONFIG_HIGHMEM
which is usually enabled on x86_32.
So now if you look at the changes to the highmem implementation on x86
and ARM, you'll notice that there is:
- preempt_disable();
+ preempt_disable_nort();
pagefault_disable();
We never converted PPC to the RT-safe variant of highmem kmaps, so
CONFIG_HIGHMEM is disabled on RT_FULL for PPC and it has to use the
!HIGHMEM variant.
Looking at older RT kernels, we never had to deal with that
preempt_disable() in the !HIGHMEM variant. Simply because that did not
exist. It got introduced via the mainline patchset which decouples
pagefault disable from preemption disable. That patchset is a generic
variant of the changes which we had in RT for a long time.
4.1-rt simply overlooked that preempt_disable/enable pair in the
!HIGHMEM variant of k[un]map_atomic. Fix is below.
If you encounter such a 'confusing' problem the next time, then look
out for commonalities, AKA patterns. 99% of all problems can be
decoded via patterns. And if you look at the other call chains you'll
find more instances of those pte_*_lock() calls, which all end up in
kmap_atomic().
Thanks,
tglx
------------->
--- a/include/linux/highmem.h
+++ b/include/linux/highmem.h
@@ -66,7 +66,7 @@ static inline void kunmap(struct page *page)
static inline void *kmap_atomic(struct page *page)
{
- preempt_disable();
+ preempt_disable_nort();
pagefault_disable();
return page_address(page);
}
@@ -75,7 +75,7 @@ static inline void *kmap_atomic(struct page *page)
static inline void __kunmap_atomic(void *addr)
{
pagefault_enable();
- preempt_enable();
+ preempt_enable_nort();
}
#define kmap_atomic_pfn(pfn) kmap_atomic(pfn_to_page(pfn))
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/