Re: Early test: hangs in mm/compact.c w. Linus's 12d7aacab56e9ef185c

From: Vlastimil Babka
Date: Mon Nov 10 2014 - 02:53:47 EST


On 11/10/2014 07:07 AM, Joonsoo Kim wrote:
On Sat, Nov 08, 2014 at 11:18:37PM +0100, Vlastimil Babka wrote:
On 11/08/2014 02:11 PM, P. Christeas wrote:

Hi,

I think I finally found the cause by staring into the code... CCing
people from all 4 separate threads I know about this issue.
The problem with finding the cause was that the first report I got from
Markus was about isolate_freepages_block() overhead, and later Norbert
reported that reverting a patch for isolate_freepages* helped. But the
problem seems to be that although the loop in isolate_migratepages exits
because the scanners almost meet (they are within same pageblock), they
don't truly meet, therefore compact_finished() decides to continue, but
isolate_migratepages() exits immediately... boom! But indeed e14c720efdd7
made this situation possible, as free scaner pfn can now point to a
middle of pageblock.

Indeed.


So I hope the attached patch will fix the soft-lockup issues in
compact_zone. Please apply on 3.18-rc3 or later without any other reverts,
and test. It probably won't help Markus and his isolate_freepages_block()
overhead though...

Yes, I found this bug too, but, it can't explain
isolate_freepages_block() overhead. Anyway, I can't find another bug
related to isolate_freepages_block(). :/

Thanks for checking.

Thanks,
Vlastimil

------8<------
>From fbf8eb0bcd2897090312e23da6a31bad9cc6b337 Mon Sep 17 00:00:00 2001
From: Vlastimil Babka <vbabka@xxxxxxx>
Date: Sat, 8 Nov 2014 22:20:43 +0100
Subject: [PATCH] mm, compaction: prevent endless loop in migrate scanner

---
mm/compaction.c | 8 ++++++--
1 file changed, 6 insertions(+), 2 deletions(-)

diff --git a/mm/compaction.c b/mm/compaction.c
index ec74cf0..1b7a1be 100644
--- a/mm/compaction.c
+++ b/mm/compaction.c
@@ -1029,8 +1029,12 @@ static isolate_migrate_t isolate_migratepages(struct zone *zone,
}

acct_isolated(zone, cc);
- /* Record where migration scanner will be restarted */
- cc->migrate_pfn = low_pfn;
+ /*
+ * Record where migration scanner will be restarted. If we end up in
+ * the same pageblock as the free scanner, make the scanners fully
+ * meet so that compact_finished() terminates compaction.
+ */
+ cc->migrate_pfn = (end_pfn <= cc->free_pfn) ? low_pfn : cc->free_pfn;

return cc->nr_migratepages ? ISOLATE_SUCCESS : ISOLATE_NONE;
}

IMHO, proper fix is not to change this logic, but, to change decision
logic in compact_finished() and in compact_zone(). Maybe helper
function would be good for readability.

OK but I would think that to fix 3.18 ASAP and not introduce more regressions, go with the patch above first, as it is the minimal fix and people already test it. Then we can implement your suggestion later as a cleanup for 3.19?

Vlastimil

Thanks.


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/