Re: KSWAPD Algorithm - 100% CPU

From: Nick Piggin
Date: Thu Dec 04 2008 - 09:02:29 EST


On Wed, Dec 03, 2008 at 06:20:46PM +0900, KOSAKI Motohiro wrote:
> (CC to Nick Piggin and Andrew Morton.)
>
> Hi
>
> At first, could you post reproduce program?
> if nobody can reproduce, fixing is difficult.
>
> obiously, we need the patch validate by reproduce program.
>
>
> > Hi All,
> > Description:
> > I countered a weird problem with kswapd:
> > it runs in some infinite loop trying to swap until order 10 of zone
> > highmem is OK, While zone higmem (as I understand) has nothing to do
> > with contiguous memory (cause there is no 1-1 mapping) which means
> > kswapd will continue to try to balance order 10 of zone highmem
> > forever (or until someone release a very large chunk of highmem).
> > Can anyone please explain me the algorithm of kswapd and why it tries
> > to balance order 10 of zone higmem ?
>
> At second, I'd like to talk about kswapd background and algorithm.
>
> 1st kswapd balancing introduced following commit.
>
> --------------------------------------------------------
> commit 6cbd719443491404f63f9ff79ead9eba256511ee
> Author: akpm <akpm>
> Date: Fri Mar 12 16:24:40 2004 +0000
>
> [PATCH] kswapd: fix lumpy page reclaim
>
> As kswapd is now scanning zones in the highmem->normal->dma direction it can
> get into competition with the page allocator: kswapd keep on trying to free
> pages from highmem, then kswapd moves onto lowmem. By the time kswapd has
> done proportional scanning in lowmem, someone has come in and allocated a few
> pages from highmem. So kswapd goes back and frees some highmem, then some
> lowmem again. But nobody has allocated any lowmem yet. So we keep on and on
> scanning lowmem in response to highmem page allocations.
>
> With a simple `dd' on a 1G box we get:
>
> r b swpd free buff cache si so bi bo in cs us sy wa id
> 0 3 0 59340 4628 922348 0 0 4 28188 1072 808 0 10 46 44
> 0 3 0 29932 4660 951760 0 0 0 30752 1078 441 1 6 30 64
> 0 3 0 57568 4556 924052 0 0 0 30748 1075 478 0 8 43 49
> 0 3 0 29664 4584 952176 0 0 0 30752 1075 472 0 6 34 60
> 0 3 0 5304 4620 976280 0 0 4 40484 1073 456 1 7 52 41
> 0 3 0 104856 4508 877112 0 0 0 18452 1074 97 0 7 67 26
> 0 3 0 70768 4540 911488 0 0 0 35876 1078 746 0 7 34 59
> 1 2 0 42544 4568 939680 0 0 0 21524 1073 556 0 5 43 51
> 0 3 0 5520 4608 976428 0 0 4 37924 1076 836 0 7 41 51
> 0 2 0 4848 4632 976812 0 0 32 12308 1092 94 0 1 33 66
>
> Simple fix: go back to scanning the zones in the dma->normal->highmem
> direction so we meet the page allocator in the middle somewhere.
>
> r b swpd free buff cache si so bi bo in cs us sy wa id
> 1 3 0 5152 3468 976548 0 0 4 37924 1071 650 0 8 64 28
> 1 2 0 4888 3496 976588 0 0 0 23576 1075 726 0 6 66 27
> 0 3 0 5336 3532 976348 0 0 0 31264 1072 708 0 8 60 32
> 0 3 0 6168 3560 975504 0 0 0 40992 1072 683 0 6 63 31
> 0 3 0 4560 3580 976844 0 0 0 18448 1073 233 0 4 59 37
> 0 3 0 5840 3624 975712 0 0 4 26660 1072 800 1 8 46 45
> 0 3 0 4816 3648 976640 0 0 0 40992 1073 526 0 6 47 47
> 0 3 0 5456 3672 976072 0 0 0 19984 1070 320 0 5 60 35
>
> BKrev: 4051e448CiuO4KIoyJ6pqIVrkhuNnw
> --------------------------------------------------------
>
> At that time, kswapd didn't check memory contenious at all.
> it has following code.
>
> ------------------------------------------------------------
> + if (zone->free_pages <= zone->pages_high) {
> + end_zone = i;
> + goto scan;
> + }
> -----------------------------------------------------------------
>
>
>
> 2nd commit improve memory coutenious check.
>
> --------------------------------------------------------
> commit e0e1723229b6f96922d10bb932f94d899132b462
> Author: nickpiggin <nickpiggin>
> Date: Tue Jan 4 04:14:42 2005 +0000
>
> [PATCH] mm: teach kswapd about higher order areas
>
> Teach kswapd to free memory on behalf of higher order allocators. This
> could be important for higher order atomic allocations because they
> otherwise have no means to free the memory themselves.
>
> Signed-off-by: Nick Piggin <nickpiggin@xxxxxxxxxxxx>
> Signed-off-by: Andrew Morton <akpm@xxxxxxxx>
> Signed-off-by: Linus Torvalds <torvalds@xxxxxxxx>
>
> BKrev: 41da1832E5flzqtNXq5m70WxihpcMw
> --------------------------------------------------------
>
> At that time, kswapd has following code.
>
> --------------------------------------------------------
> - if (zone->free_pages <= zone->pages_high) {
> + if (!zone_watermark_ok(zone, order,
> + zone->pages_high, 0, 0, 0)) {
> end_zone = i;
> goto scan;
> }
> --------------------------------------------------------
>
> The problem is, alloc_pages(GFP_KERNEL, 10) need to contenious order-10 memory.
> but doesn't need to highmem couteniously.
>
> However alloc_pages() pass to order==10 information.
> but doesn't pass to highmem coutinuous is unnecessary.
>
> Oops, that is bug, I think.
>
>
> So, I'd like to fix this bug.
> However, I check my guessing is right or not at first.
> please reproduce program.
>
>
>
> > Details:
> > I build an instrumented kernel with debug messages in
> > "zone_watermark_ok" function, and from the code and debug messages I
> > see that "zone_watermark_ok" returns 0 when kswapd invokes it (through
> > balance_pgdat) in order to decide if zone highmem is balanced or not,
> > which lead in some configurations to infinite loop of kswapd ( if no
> > large chunks of highmem released) . I added a condition to
> > "balance_pgdat" so it doesn't try to balance order higher than 1 in
> > zone highmem and this conditon solved the problem, what are the risks
> > with such solution? isn't it a bug that kswapd is looking for
> > continuous memory in zone highmem ( as I understand there is no 1-1
> > mapping in zone highmem which is meaningless in kswapd)?
>
>
> simple removing seems no good.
> because hugepage on highmem need to highmem coutenious.

kswapd_max_order check and reset should probably go inside
balance_pgdat:loop_again loop.

It is possible we could have a kswapd_max_order[MAX_NR_ZONES] or
something, but I don't know if the complexity would be worth while
given that huge order allocations aren't too common, and resetting
kswapd_max_order inside the loop should be a reasonable fix.



--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/