Re: [PATCH 2/2] mm/page_alloc: DEBUG_VM checks for free_list placement of CMA and RESERVE pages

From: Vlastimil Babka
Date: Mon May 12 2014 - 04:28:39 EST


On 05/08/2014 07:54 AM, Joonsoo Kim wrote:
On Wed, May 07, 2014 at 04:59:07PM +0200, Vlastimil Babka wrote:
On 05/07/2014 03:33 AM, Minchan Kim wrote:
On Mon, May 05, 2014 at 05:50:46PM +0200, Vlastimil Babka wrote:
On 05/05/2014 04:36 PM, Sasha Levin wrote:
On 05/02/2014 08:08 AM, Vlastimil Babka wrote:
On 04/30/2014 11:46 PM, Sasha Levin wrote:
On 04/03/2014 11:40 AM, Vlastimil Babka wrote:
For the MIGRATE_RESERVE pages, it is important they do not get misplaced
on free_list of other migratetype, otherwise the whole MIGRATE_RESERVE
pageblock might be changed to other migratetype in try_to_steal_freepages().
For MIGRATE_CMA, the pages also must not go to a different free_list, otherwise
they could get allocated as unmovable and result in CMA failure.

This is ensured by setting the freepage_migratetype appropriately when placing
pages on pcp lists, and using the information when releasing them back to
free_list. It is also assumed that CMA and RESERVE pageblocks are created only
in the init phase. This patch adds DEBUG_VM checks to catch any regressions
introduced for this invariant.

Cc: Yong-Taek Lee <ytk.lee@xxxxxxxxxxx>
Cc: Bartlomiej Zolnierkiewicz <b.zolnierkie@xxxxxxxxxxx>
Cc: Joonsoo Kim <iamjoonsoo.kim@xxxxxxx>
Cc: Mel Gorman <mgorman@xxxxxxx>
Cc: Minchan Kim <minchan@xxxxxxxxxx>
Cc: KOSAKI Motohiro <kosaki.motohiro@xxxxxxxxxxxxxx>
Cc: Marek Szyprowski <m.szyprowski@xxxxxxxxxxx>
Cc: Hugh Dickins <hughd@xxxxxxxxxx>
Cc: Rik van Riel <riel@xxxxxxxxxx>
Cc: Michal Nazarewicz <mina86@xxxxxxxxxx>
Signed-off-by: Vlastimil Babka <vbabka@xxxxxxx>

Two issues with this patch.

First:

[ 3446.320082] kernel BUG at mm/page_alloc.c:1197!
[ 3446.320082] invalid opcode: 0000 [#1] PREEMPT SMP DEBUG_PAGEALLOC
[ 3446.320082] Dumping ftrace buffer:
[ 3446.320082] (ftrace buffer empty)
[ 3446.320082] Modules linked in:
[ 3446.320082] CPU: 1 PID: 8923 Comm: trinity-c42 Not tainted 3.15.0-rc3-next-20140429-sasha-00015-g7c7e0a7-dirty #427
[ 3446.320082] task: ffff88053e208000 ti: ffff88053e246000 task.ti: ffff88053e246000
[ 3446.320082] RIP: get_page_from_freelist (mm/page_alloc.c:1197 mm/page_alloc.c:1548 mm/page_alloc.c:2036)
[ 3446.320082] RSP: 0018:ffff88053e247778 EFLAGS: 00010002
[ 3446.320082] RAX: 0000000000000003 RBX: ffffea0000f40000 RCX: 0000000000000008
[ 3446.320082] RDX: 0000000000000002 RSI: 0000000000000003 RDI: 00000000000000a0
[ 3446.320082] RBP: ffff88053e247868 R08: 0000000000000007 R09: 0000000000000000
[ 3446.320082] R10: ffff88006ffcef00 R11: 0000000000000000 R12: 0000000000000014
[ 3446.335888] R13: ffffea000115ffe0 R14: ffffea000115ffe0 R15: 0000000000000000
[ 3446.335888] FS: 00007f8c9f059700(0000) GS:ffff88006ec00000(0000) knlGS:0000000000000000
[ 3446.335888] CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b
[ 3446.335888] CR2: 0000000002cbc048 CR3: 000000054cdb4000 CR4: 00000000000006a0
[ 3446.335888] DR0: 00000000006de000 DR1: 00000000006de000 DR2: 0000000000000000
[ 3446.335888] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000602
[ 3446.335888] Stack:
[ 3446.335888] ffff88053e247798 ffff88006eddc0b8 0000000000000016 0000000000000000
[ 3446.335888] ffff88006ffd2068 ffff88006ffdb008 0000000100000000 0000000000000000
[ 3446.335888] ffff88006ffdb000 0000000000000000 0000000000000003 0000000000000001
[ 3446.335888] Call Trace:
[ 3446.335888] __alloc_pages_nodemask (mm/page_alloc.c:2731)
[ 3446.335888] ? __this_cpu_preempt_check (lib/smp_processor_id.c:63)
[ 3446.335888] alloc_pages_vma (include/linux/mempolicy.h:76 mm/mempolicy.c:1998)
[ 3446.335888] ? shmem_alloc_page (mm/shmem.c:881)
[ 3446.335888] ? kvm_clock_read (arch/x86/include/asm/preempt.h:90 arch/x86/kernel/kvmclock.c:86)
[ 3446.335888] shmem_alloc_page (mm/shmem.c:881)
[ 3446.335888] ? __const_udelay (arch/x86/lib/delay.c:126)
[ 3446.335888] ? __rcu_read_unlock (kernel/rcu/update.c:97)
[ 3446.335888] ? find_get_entry (mm/filemap.c:979)
[ 3446.335888] ? find_get_entry (mm/filemap.c:940)
[ 3446.335888] ? find_lock_entry (mm/filemap.c:1024)
[ 3446.335888] shmem_getpage_gfp (mm/shmem.c:1130)
[ 3446.335888] ? sched_clock_local (kernel/sched/clock.c:214)
[ 3446.335888] ? do_read_fault.isra.42 (mm/memory.c:3523)
[ 3446.335888] shmem_fault (mm/shmem.c:1237)
[ 3446.335888] ? do_read_fault.isra.42 (mm/memory.c:3523)
[ 3446.335888] __do_fault (mm/memory.c:3344)
[ 3446.335888] ? _raw_spin_unlock (arch/x86/include/asm/preempt.h:98 include/linux/spinlock_api_smp.h:152 kernel/locking/spinlock.c:183)
[ 3446.335888] do_read_fault.isra.42 (mm/memory.c:3524)
[ 3446.335888] ? get_parent_ip (kernel/sched/core.c:2485)
[ 3446.335888] ? get_parent_ip (kernel/sched/core.c:2485)
[ 3446.335888] __handle_mm_fault (mm/memory.c:3662 mm/memory.c:3823 mm/memory.c:3950)
[ 3446.335888] ? __const_udelay (arch/x86/lib/delay.c:126)
[ 3446.335888] ? __rcu_read_unlock (kernel/rcu/update.c:97)
[ 3446.335888] handle_mm_fault (mm/memory.c:3973)
[ 3446.335888] __get_user_pages (mm/memory.c:1863)
[ 3446.335888] ? preempt_count_sub (kernel/sched/core.c:2541)
[ 3446.335888] __mlock_vma_pages_range (mm/mlock.c:255)
[ 3446.335888] __mm_populate (mm/mlock.c:711)
[ 3446.335888] vm_mmap_pgoff (include/linux/mm.h:1841 mm/util.c:402)
[ 3446.335888] SyS_mmap_pgoff (mm/mmap.c:1378)
[ 3446.335888] ? syscall_trace_enter (include/linux/context_tracking.h:27 arch/x86/kernel/ptrace.c:1461)
[ 3446.335888] ia32_do_call (arch/x86/ia32/ia32entry.S:430)
[ 3446.335888] Code: 00 66 0f 1f 44 00 00 ba 02 00 00 00 31 f6 48 89 c7 e8 c1 c3 ff ff 48 8b 53 10 83 f8 03 74 08 83 f8 04 75 13 0f 1f 00 39 d0 74 0c <0f> 0b 66 2e 0f 1f 84 00 00 00 00 00 45 85 ff 75 15 49 8b 55 00
[ 3446.335888] RIP get_page_from_freelist (mm/page_alloc.c:1197 mm/page_alloc.c:1548 mm/page_alloc.c:2036)
[ 3446.335888] RSP <ffff88053e247778>
Hey, that's not an issue, that means the patch works as intended :) And
I believe it's not a bug introduced by PATCH 1/2.

So, according to my decodecode reading, RAX is the results of
get_pageblock_migratetype() and it's MIGRATE_RESERVE. RDX is the result
of get_freepage_migratetype() and it's MIGRATE_UNMOVABLE. The
freepage_migratetype has just been set either by __rmqueue_smallest() or
__rmqueue_fallback(), according to the free_list the page has been taken
from. So this looks like a page from MIGRATE_RESERVE pageblock found on
the !MIGRATE_RESERVE free_list, which is exactly what the patch intends
to catch.

I think there are two possible explanations.

1) the pageblock is genuinely MIGRATE_RESERVE and it was misplaced by
mistake. I think it wasn't in free_pcppages_bulk() as there's the same
VM_BUG_ON which would supposedly trigger at the moment of displacing. In
theory it's possible that there's a race through __free_pages_ok() ->
free_one_page() where the get_pageblock_migratetype() in
__free_pages_ok() would race with set_pageblock_migratetype() and result
in bogus value. But nobody should be calling set_pageblock_migratetype()
on a MIGRATE_RESERVE pageblock.

2) the pageblock was marked as MIGRATE_RESERVE due to a race between
set_pageblock_migratetype() and set_pageblock_skip(). The latter is
currently not serialized by zone->lock, nor it uses atomic bit set. So
it may result in lost updates in a racing set_pageblock_migratetype(). I
think a well-placed race when changing pageblock from MIGRATE_MOVABLE to
MIGRATE_RECLAIMABLE could result in MIGRATE_RESERVE value. Similar races
have been already observed to be a problem where frequent changing
to/from MIGRATE_ISOLATE is involved, and I did a patch series to address
this, but it was not complete and I postponed it after Mel's changes
that remove the racy for-cycles completely. So it might be that his
"[PATCH 08/17] mm: page_alloc: Use word-based accesses for get/set
pageblock bitmaps" already solves this bug (but maybe only on certain
architectures where you don't need atomic operations). You might try
that patch if you can reproduce this bug frequently enough?

I've tried that patch, but still see the same BUG_ON.

Oh damn, I've realized that my assumptions about MIGRATE_RESERVE
pageblocks being created only on zone init time were wrong.
setup_zone_migrate_reserve() is called also from the handler of
min_free_kbytes sysctl... does trinity try to change that while
running?
The function will change MOVABLE pageblocks to RESERVE and try to
move all free pages to the RESERVE free_list, but of course pages on
pcplists will remain MOVABLE and may trigger the VM_BUG_ON. You
triggered the bug with page on MOVABLE free_list (in the first reply
I said its UNMOVABLE by mistake) so this might be good explanation
if trinity changes min_free_kbytes.

Furthermore, I think there's a problem that
setup_zone_migrate_reserve() operates on pageblocks, but as MAX_ODER
is higher than pageblock_order, RESERVE pages might be merged with
buddies of different migratetype and end up on their free_list. That
seems to me like a flaw in the design of reserves, but perhaps
others won't think it's serious enough to fix?

I wanna know who want MIGRATE_RESERVE. On my previous testing, one
pageblock for MIGRATE_RESERVE is merged with buddies of different
migratetype during boot-up and never come back again. But my system works
well. :)

If it is really useful feature, we can fix the situation by aligning
reserve size and pfn to MAX_ORDER_NR_PAGES.

Yes, I plan to do that.

And, IMHO, it isn't reasonable to increase, decrease and change
MIGRATE_RESERVE pageblock by the handler of min_free_kbytes sysctl,
because new pageblock may have no free pages to reserve. I think that
it is better to prevent sysctl handler from changing MIGRATE_RESERVE
pageblock, after initialization.

This dynamic allocation could be more aggressive by doing the same stuff as memory isolation. In any case, I think that if it cannot guarantee the pageblock to be free, there is little benefit from trying to make sure no pages from the pageblock are strayed on pcplists and later misplaced in the free_list. The only danger of misplacement is page stealing code changing MIGRATE_RESERVE pageblock to something else. This can be easily avoided directly in the page stealing code by checking the pageblock migratetype before trying to change it. This has very little extra overhead, in a path that's not hot to begin with.

In conclusion, if MIGRATE_RESERVE is useful enough to maintain, to fix
above problem and keep this patch is preferable to me.

I think it's useful in general, but not so critical to make the min_free_kbytes handler provide 100% guarantees. Presumably one uses the handler only during early boot time when there's enough free memory?
So that leaves the debug patch only for CMA which I won't push anymore to mainline, but feel free to adapt it for mm only.


So in the end this VM_DEBUG check probably cannot work anymore for
MIGRATE_RESERVE, only for CMA. I'm not sure if it's worth keeping it
only for CMA, what are the CMA guys' opinions on that?

I really don't want it. That was I didn't add my Acked-by at that time.
For a long time, I never wanted to add more overhead hot path due to
CMA unless it's really critical. It's same to this.
Although such debug patch helps to notice something goes wrong for CMA,
more information would be helpful to know why CMA failed because
there are another potential reasons to fail CMA allocation.

One of the idea about that is to store alloc trace into somewhere(ex,
naive idea is page description like page-owner) and then we could investigate
what's the owner of that page so we could know why we fail to migrate it out.
With that, we would figure out how on earth such page is allocated from CMA
and it would be more helpful rather just VM_BUG_ON notice.

The whole point is I'd like to avoid adding more overhead to hot path for
rare case although it's debugging feature.

OK. I'm not that concerned with VM_DEBUG overhead as it's intended for
testing, not production. But as you say the patch is not that useful
without the MIGRATE_RESERVE part, so Andrew could you please drop the
patch (mm-page_alloc-debug_vm-checks-for-free_list-placement-of-cma-and-reserve-pages.patch)?

I also think that VM_DEBUG overhead isn't problem because of same
reason from Vlastimil.

Thanks.


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/