Re: [PATCH v9] mm: compaction: handle incorrect MIGRATE_UNMOVABLE typepageblocks

From: Bartlomiej Zolnierkiewicz
Date: Wed Jun 06 2012 - 06:07:50 EST


On Tuesday 05 June 2012 04:38:53 Minchan Kim wrote:
> On 06/05/2012 10:59 AM, Minchan Kim wrote:
>
> > On 06/05/2012 05:22 AM, KOSAKI Motohiro wrote:
> >
> >>> +/*
> >>> + * Returns true if MIGRATE_UNMOVABLE pageblock can be successfully
> >>> + * converted to MIGRATE_MOVABLE type, false otherwise.
> >>> + */
> >>> +static bool can_rescue_unmovable_pageblock(struct page *page, bool
> >>> locked)
> >>> +{
> >>> + unsigned long pfn, start_pfn, end_pfn;
> >>> + struct page *start_page, *end_page, *cursor_page;
> >>> +
> >>> + pfn = page_to_pfn(page);
> >>> + start_pfn = pfn& ~(pageblock_nr_pages - 1);
> >>> + end_pfn = start_pfn + pageblock_nr_pages - 1;
> >>> +
> >>> + start_page = pfn_to_page(start_pfn);
> >>> + end_page = pfn_to_page(end_pfn);
> >>> +
> >>> + for (cursor_page = start_page, pfn = start_pfn; cursor_page<=
> >>> end_page;
> >>> + pfn++, cursor_page++) {
> >>> + struct zone *zone = page_zone(start_page);
> >>> + unsigned long flags;
> >>> +
> >>> + if (!pfn_valid_within(pfn))
> >>> + continue;
> >>> +
> >>> + /* Do not deal with pageblocks that overlap zones */
> >>> + if (page_zone(cursor_page) != zone)
> >>> + return false;
> >>> +
> >>> + if (!locked)
> >>> + spin_lock_irqsave(&zone->lock, flags);
> >>> +
> >>> + if (PageBuddy(cursor_page)) {
> >>> + int order = page_order(cursor_page);
> >>>
> >>> -/* Returns true if the page is within a block suitable for migration
> >>> to */
> >>> -static bool suitable_migration_target(struct page *page)
> >>> + pfn += (1<< order) - 1;
> >>> + cursor_page += (1<< order) - 1;
> >>> +
> >>> + if (!locked)
> >>> + spin_unlock_irqrestore(&zone->lock, flags);
> >>> + continue;
> >>> + } else if (page_count(cursor_page) == 0 ||
> >>> + PageLRU(cursor_page)) {
> >>> + if (!locked)
> >>> + spin_unlock_irqrestore(&zone->lock, flags);
> >>> + continue;
> >>> + }
> >>> +
> >>> + if (!locked)
> >>> + spin_unlock_irqrestore(&zone->lock, flags);
> >>> +
> >>> + return false;
> >>> + }
> >>> +
> >>> + return true;
> >>> +}
> >>
> >> Minchan, are you interest this patch? If yes, can you please rewrite it?
> >
> >
> > Can do it but I want to give credit to Bartlomiej.
> > Bartlomiej, if you like my patch, could you resend it as formal patch after you do broad testing?

Sure.

> >> This one are
> >> not fixed our pointed issue and can_rescue_unmovable_pageblock() still
> >> has plenty bugs.
> >> We can't ack it.
> >>
> >
> >
> > Frankly speaking, I don't want to merge it without any data which prove it's really good for real practice.
> >
> > When the patch firstly was submitted, it wasn't complicated so I was okay at that time but it has been complicated
> > than my expectation. So if Andrew might pass the decision to me, I'm totally NACK if author doesn't provide
> > any real data or VOC of some client.

I found this issue by accident while testing compaction code so unfortunately
I don't have any real bugreport to back it up. It is just a corner case which
is more likely to happen in situation where there is rather small number of
pageblocks and quite heavy kernel memory allocation/freeing activity in
kernel going on. I would presume that the issue can happen on some embedded
configurations but they are not your typical machine and it is not likely
to see a real bugreport for it.

I'm also quite unhappy with the increasing complexity of what seemed as
a quite simple fix initially and I tend to agree that the patch may stay
out-of-tree until there is a more proven need for it (maybe with documenting
the issue in the code for the time being). Still, I would like to have
all outstanding issues fixed so I can merge the patch locally (and to -mm
if Andrew agrees) and just wait to see if the patch is ever needed in
practice.

I've attached the code that I use to trigger the issue at the bottom of this
mail so people can reproduce the problem and see for themselves whether it
is worth to fix it or not.

> > 1) Any comment?
> >
> > Anyway, I fixed some bugs and clean up something I found during review.

Thanks for doing this.

> > Minor thing.
> > 1. change smt_result naming - I never like such long non-consistent naming. How about this?
> > 2. fix can_rescue_unmovable_pageblock
> > 2.1 pfn valid check for page_zone
> >
> > Major thing.
> >
> > 2.2 add lru_lock for stablizing PageLRU
> > If we don't hold lru_lock, there is possibility that unmovable(non-LRU) page can put in movable pageblock.
> > It can make compaction/CMA's regression. But there is a concern about deadlock between lru_lock and lock.
> > As I look the code, I can't find allocation trial with holding lru_lock so it might be safe(but not sure,
> > I didn't test it. It need more careful review/testing) but it makes new locking dependency(not sure, too.
> > We already made such rule but I didn't know that until now ;-) ) Why I thought so is we can allocate
> > GFP_ATOMIC with holding lru_lock, logically which might be crazy idea.
> >
> > 2.3 remove zone->lock in first phase.
> > We do rescue unmovable pageblock by 2-phase. In first-phase, we just peek pages so we don't need locking.
> > If we see non-stablizing value, it would be caught by 2-phase with needed lock or
> > can_rescue_unmovable_pageblock can return out of loop by stale page_order(cursor_page).
> > It couldn't make unmovable pageblock to movable but we can do it next time, again.
> > It's not critical.
> >
> > 2) Any comment?
> >
> > Now I can't inline the code so sorry but attach patch.
> > It's not a formal patch/never tested.
> >
>
>
> Attached patch has a BUG in can_rescue_unmovable_pageblock.
> Resend. I hope it is fixed.

@@ -399,10 +399,14 @@
} else if (page_count(cursor_page) == 0) {
continue;
} else if (PageLRU(cursor_page)) {
- if (!lru_locked && need_lrulock) {
+ if (!need_lrulock)
+ continue;
+ else if (lru_locked)
+ continue;
+ else {
spin_lock(&zone->lru_lock);
lru_locked = true;
- if (PageLRU(cursor_page))
+ if (PageLRU(page))
continue;
}
}

Could you please explain why do we need to check page and not cursor_page
here?

Best regards,
--
Bartlomiej Zolnierkiewicz
Samsung Poland R&D Center


My test case (on 512 MiB machine):
* insmod alloc_frag.ko
* run ./alloc_app and push it to background with Ctrl-Z
* rmmod alloc_frag.ko
* insmod alloc_test.ko

---
alloc_app.c | 22 ++++++++++++++++++++++
mm/Kconfig | 8 ++++++++
mm/Makefile | 3 +++
mm/alloc_frag.c | 35 +++++++++++++++++++++++++++++++++++
mm/alloc_test.c | 40 ++++++++++++++++++++++++++++++++++++++++
5 files changed, 108 insertions(+)

Index: b/alloc_app.c
===================================================================
--- /dev/null 1970-01-01 00:00:00.000000000 +0000
+++ b/alloc_app.c 2012-04-06 11:49:23.789380700 +0200
@@ -0,0 +1,22 @@
+
+#include <stdlib.h>
+#include <string.h>
+#include <stdio.h>
+
+#define ALLOC_NR_PAGES 60000
+
+int main(void)
+{
+ void *alloc_app_pages[ALLOC_NR_PAGES];
+ int i;
+
+ for (i = 0; i < ALLOC_NR_PAGES; i++) {
+ alloc_app_pages[i] = malloc(4096);
+ if (alloc_app_pages[i])
+ memset(alloc_app_pages[i], 'z', 4096);
+ }
+
+ getchar();
+
+ return 0;
+}
Index: b/mm/Kconfig
===================================================================
--- a/mm/Kconfig 2012-04-05 18:40:36.000000000 +0200
+++ b/mm/Kconfig 2012-04-06 11:49:23.789380700 +0200
@@ -379,3 +379,11 @@
in a negligible performance hit.

If unsure, say Y to enable cleancache
+
+config ALLOC_FRAG
+ tristate "alloc frag"
+ help
+
+config ALLOC_TEST
+ tristate "alloc test"
+ help
Index: b/mm/Makefile
===================================================================
--- a/mm/Makefile 2012-04-05 18:40:36.000000000 +0200
+++ b/mm/Makefile 2012-04-06 11:49:23.789380700 +0200
@@ -16,6 +16,9 @@
$(mmu-y)
obj-y += init-mm.o

+obj-$(CONFIG_ALLOC_FRAG) += alloc_frag.o
+obj-$(CONFIG_ALLOC_TEST) += alloc_test.o
+
ifdef CONFIG_NO_BOOTMEM
obj-y += nobootmem.o
else
Index: b/mm/alloc_frag.c
===================================================================
--- /dev/null 1970-01-01 00:00:00.000000000 +0000
+++ b/mm/alloc_frag.c 2012-04-06 11:52:43.761439914 +0200
@@ -0,0 +1,35 @@
+
+#include <linux/init.h>
+#include <linux/kernel.h>
+#include <linux/module.h>
+
+#define ALLOC_NR_PAGES 120000
+static struct page *alloc_frag_pages[ALLOC_NR_PAGES];
+
+static int __init alloc_frag_init(void)
+{
+ int i;
+
+ for (i = 0; i < ALLOC_NR_PAGES; i++)
+ alloc_frag_pages[i] = alloc_pages(GFP_KERNEL, 0);
+
+ for (i = 0; i < ALLOC_NR_PAGES; i += 2) {
+ if (alloc_frag_pages[i])
+ __free_pages(alloc_frag_pages[i], 0);
+ }
+
+ return 0;
+}
+module_init(alloc_frag_init);
+
+static void __exit alloc_frag_exit(void)
+{
+ int i;
+
+ for (i = 1; i < ALLOC_NR_PAGES; i += 2)
+ if (alloc_frag_pages[i])
+ __free_pages(alloc_frag_pages[i], 0);
+}
+module_exit(alloc_frag_exit);
+
+MODULE_LICENSE("GPL");
Index: b/mm/alloc_test.c
===================================================================
--- /dev/null 1970-01-01 00:00:00.000000000 +0000
+++ b/mm/alloc_test.c 2012-04-06 11:49:23.789380700 +0200
@@ -0,0 +1,40 @@
+
+#include <linux/init.h>
+#include <linux/kernel.h>
+#include <linux/module.h>
+
+#define ALLOC_NR_PAGES 120000
+static struct page *alloc_test_pages[ALLOC_NR_PAGES];
+
+static int __init alloc_test_init(void)
+{
+ int i;
+
+ printk("trying order-9 allocs..\n");
+ for (i = 0; i < 100; i++) {
+ alloc_test_pages[i] = alloc_pages(GFP_KERNEL, 9);
+ if (alloc_test_pages[i])
+ printk("\ttry %d succes\n", i);
+ else {
+ printk("\ttry %d failure\n", i);
+ break;
+ }
+ }
+
+ return 0;
+}
+module_init(alloc_test_init);
+
+static void __exit alloc_test_exit(void)
+{
+ int i;
+
+ for (i = 0; i < 100; i++) {
+ if (alloc_test_pages[i])
+ __free_pages(alloc_test_pages[i], 9);
+ }
+
+}
+module_exit(alloc_test_exit);
+
+MODULE_LICENSE("GPL");
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/