Re: [PATCH 04/10] mm: slub: add knowledge of reserve pages

From: Christoph Lameter
Date: Wed Aug 08 2007 - 13:14:32 EST


On Tue, 7 Aug 2007, Matt Mackall wrote:

> > If you are in an atomic context and bound to a cpu then a per cpu slab is
> > assigned to you and no one else can take object aways from that process
> > since nothing else can run on the cpu.
>
> Servicing I/O over the network requires an allocation to send a buffer
> and an allocation to later receive the acknowledgement. We can't free
> our send buffer (or the memory it's supposed to clean) until the
> relevant ack is received. We have to hold our reserves privately
> throughout, even if an interrupt that wants to do GFP_ATOMIC
> allocation shows up in-between.

If you can take an interrupt then you can move to a different allocation
context. This means reclaim could free up more pages if we tell reclaim
not to allocate any memory.

> > If you are not in an atomic context and are preemptable or can switch
> > allocation context then you can create another context in which reclaim
> > could be run to remove some clean pages and get you more memory. Again no
> > need for the patch.
>
> By the point that this patch is relevant, there are already no clean
> pages. The only way to free up more memory is via I/O.

That is never true. The dirty ratio limit limits the number of dirty pages
in memory. There is always a large percentage of memory that is kept
clean. Pages that are file backed and clean can be freed without any
additional memory allocation. This is true for the executable code that
you must have to execute any instructions. We could guarantee that the
number of pages reclaimable without memory allocs stays above certain
limits by checking VM counters.

I think there are two ways to address this in a simpler way:

1. Allow recursive calls into reclaim. If we are in a PF_MEMALLOC context
then we can still scan lru lists and free up memory of clean pages. Idea
patch follows.

2. Make pageout figure out if the write action requires actual I/O
submission. If so then the submission will *not* immediately free memory
and we have to wait for I/O to complete. In that case do not immediately
initiate I/O (which would not free up memory and its bad to initiate
I/O when we have not enough free memory) but put all those pages on a
pageout list. When reclaim has reclaimed enough memory then go through the
pageout list and trigger I/O. That can be done without PF_MEMALLOC so that
additional reclaim could be triggered as needed. Maybe we can just get rid
of PF_MEMALLOC and some of the contorted code around it?




Recursive reclaim concept patch:

---
include/linux/swap.h | 2 ++
mm/page_alloc.c | 11 +++++++++++
mm/vmscan.c | 27 +++++++++++++++++++++++++++
3 files changed, 40 insertions(+)

Index: linux-2.6/include/linux/swap.h
===================================================================
--- linux-2.6.orig/include/linux/swap.h 2007-08-08 04:31:06.000000000 -0700
+++ linux-2.6/include/linux/swap.h 2007-08-08 04:31:28.000000000 -0700
@@ -190,6 +190,8 @@ extern void swap_setup(void);
/* linux/mm/vmscan.c */
extern unsigned long try_to_free_pages(struct zone **zones, int order,
gfp_t gfp_mask);
+extern unsigned long emergency_free_pages(struct zone **zones, int order,
+ gfp_t gfp_mask);
extern unsigned long shrink_all_memory(unsigned long nr_pages);
extern int vm_swappiness;
extern int remove_mapping(struct address_space *mapping, struct page *page);
Index: linux-2.6/mm/page_alloc.c
===================================================================
--- linux-2.6.orig/mm/page_alloc.c 2007-08-08 04:17:33.000000000 -0700
+++ linux-2.6/mm/page_alloc.c 2007-08-08 04:39:26.000000000 -0700
@@ -1306,6 +1306,17 @@ nofail_alloc:
zonelist, ALLOC_NO_WATERMARKS);
if (page)
goto got_pg;
+
+ /*
+ * We cannot go into full synchrononous reclaim
+ * but we can still scan for easily reclaimable
+ * pages.
+ */
+ if (p->flags & PF_MEMALLOC &&
+ emergency_free_pages(zonelist->zones, order,
+ gfp_mask))
+ goto nofail_alloc;
+
if (gfp_mask & __GFP_NOFAIL) {
congestion_wait(WRITE, HZ/50);
goto nofail_alloc;
Index: linux-2.6/mm/vmscan.c
===================================================================
--- linux-2.6.orig/mm/vmscan.c 2007-08-08 04:21:14.000000000 -0700
+++ linux-2.6/mm/vmscan.c 2007-08-08 04:42:24.000000000 -0700
@@ -1204,6 +1204,33 @@ out:
}

/*
+ * Emergency reclaim. We are alreedy in the vm write out path
+ * and we have exhausted all memory. We have to free memory without
+ * any additional allocations. So no writes and no swap. Get
+ * as bare bones as we can.
+ */
+unsigned long emergency_free_pages(struct zone **zones, int order, gfp_t gfp_mask)
+{
+ int priority;
+ unsigned long nr_reclaimed = 0;
+ struct scan_control sc = {
+ .gfp_mask = gfp_mask,
+ .swap_cluster_max = SWAP_CLUSTER_MAX,
+ .order = order,
+ };
+
+ for (priority = DEF_PRIORITY; priority >= 0; priority--) {
+ sc.nr_scanned = 0;
+ nr_reclaimed += shrink_zones(priority, zones, &sc);
+ if (nr_reclaimed >= sc.swap_cluster_max)
+ return 1;
+ }
+
+ /* top priority shrink_caches still had more to do? don't OOM, then */
+ return sc.all_unreclaimable;
+}
+
+/*
* For kswapd, balance_pgdat() will work across all this node's zones until
* they are all at pages_high.
*



-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/