[PATCH] (#2) VM kswapd autotuning vs. -ac7

From: Rik van Riel (riel@conectiva.com.br)
Date: Fri Jun 02 2000 - 13:53:08 EST

Next message: Andrea Arcangeli: "raid 2.2.x"
Previous message: Walter Hofmann: "Re: RE: Cryptography in the kernel (was: Re: Linux 2.5 / 2.6 TODO (preliminary))"
Next in thread: t.n.vanderleeuw@chello.nl: "Re: [PATCH] (#2) VM kswapd autotuning vs. -ac7"
Maybe reply: t.n.vanderleeuw@chello.nl: "Re: [PATCH] (#2) VM kswapd autotuning vs. -ac7"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]

Hi Alan,

This patch improves a number of things over yesterday's
version:
- in do_try_to_free_pages, initialise swap_count to the number
  of pages we still have to free, this is more friendly to the
  running processes in the system and should improve IO performance
- also in do_try_to_free_pages, if both shrink_mmap *and* swap_out
  were having a hard time, we make the function more agressive ...
- "beautify" the async swap case in shrink_mmap (thanks quintela)
- refine the kswapd_pause logic, now we waste less memory and
  swapping is a little bit smoother
- low-priority (niced, cpu hog, ...) processes will reschedule
  after waking up kswapd in very low memory situations, hopefully
  this will slow them down a bit under memory pressure, leading
  to less memory load

I've been stressing this test like crazy for the last few hours
and stuff seems to run fine.

[Under my .sig: 1) the patch 2) yesterday's message]

Regards,

Rik

--
The Internet is not a network of computers. It is a network
of people. That is its real strength.
Wanna talk about the kernel?  irc.openprojects.net / #kernelnewbies
http://www.conectiva.com/		http://www.surriel.com/
---------------------[ the patch ]-------------------------
--- linux-2.4.0-t1-ac7/fs/buffer.c.orig	Thu Jun  1 10:37:59 2000
+++ linux-2.4.0-t1-ac7/fs/buffer.c	Thu Jun  1 14:51:14 2000
@@ -1868,6 +1868,7 @@
 	}
 	
 	spin_unlock(&unused_list_lock);
+	wake_up(&buffer_wait);
 
 	return iosize;
 }
@@ -2004,6 +2005,8 @@
 		__put_unused_buffer_head(bh[bhind]);
 	}
 	spin_unlock(&unused_list_lock);
+	wake_up(&buffer_wait);
+
 	goto finished;
 }
 
@@ -2181,6 +2184,12 @@
 }
 
 /*
+ * Can the buffer be thrown out?
+ */
+#define BUFFER_BUSY_BITS	((1<<BH_Dirty) | (1<<BH_Lock) | (1<<BH_Protected))
+#define buffer_busy(bh)		(atomic_read(&(bh)->b_count) | ((bh)->b_state & BUFFER_BUSY_BITS))
+
+/*
  * Sync all the buffers on one page..
  *
  * If we have old buffers that are locked, we'll
@@ -2190,7 +2199,7 @@
  * This all is required so that we can free up memory
  * later.
  */
-static void sync_page_buffers(struct buffer_head *bh, int wait)
+static int sync_page_buffers(struct buffer_head *bh, int wait)
 {
 	struct buffer_head * tmp = bh;
 
@@ -2203,13 +2212,17 @@
 		} else if (buffer_dirty(p))
 			ll_rw_block(WRITE, 1, &p);
 	} while (tmp != bh);
-}
 
-/*
- * Can the buffer be thrown out?
- */
-#define BUFFER_BUSY_BITS	((1<<BH_Dirty) | (1<<BH_Lock) | (1<<BH_Protected))
-#define buffer_busy(bh)		(atomic_read(&(bh)->b_count) | ((bh)->b_state & BUFFER_BUSY_BITS))
+	do {
+		struct buffer_head *p = tmp;
+		tmp = tmp->b_this_page;
+		if (buffer_busy(p))
+			return 0;
+	} while (tmp != bh);
+
+	/* Success. Now try_to_free_buffers can free the page. */
+	return 1;
+}
 
 /*
  * try_to_free_buffers() checks if all the buffers on this particular page
@@ -2227,6 +2240,7 @@
 	struct buffer_head * tmp, * bh = page->buffers;
 	int index = BUFSIZE_INDEX(bh->b_size);
 
+again:
 	spin_lock(&lru_list_lock);
 	write_lock(&hash_table_lock);
 	spin_lock(&free_list[index].lock);
@@ -2272,7 +2286,8 @@
 	spin_unlock(&free_list[index].lock);
 	write_unlock(&hash_table_lock);
 	spin_unlock(&lru_list_lock);	
-	sync_page_buffers(bh, wait);
+	if (sync_page_buffers(bh, wait))
+		goto again;
 	return 0;
 }
 
--- linux-2.4.0-t1-ac7/mm/vmscan.c.orig	Wed May 31 14:08:50 2000
+++ linux-2.4.0-t1-ac7/mm/vmscan.c	Fri Jun  2 15:39:53 2000
@@ -439,12 +439,12 @@
  * latency.
  */
 #define FREE_COUNT	8
-#define SWAP_COUNT	16
 static int do_try_to_free_pages(unsigned int gfp_mask)
 {
 	int priority;
 	int count = FREE_COUNT;
 	int swap_count;
+	int ret = 0;
 
 	/* Always trim SLAB caches when memory gets low. */
 	kmem_cache_reap(gfp_mask);
@@ -452,6 +452,7 @@
 	priority = 64;
 	do {
 		while (shrink_mmap(priority, gfp_mask)) {
+			ret = 1;
 			if (!--count)
 				goto done;
 		}
@@ -467,8 +468,10 @@
 			count -= shrink_dcache_memory(priority, gfp_mask);
 			count -= shrink_icache_memory(priority, gfp_mask);
 			if (count <= 0)
+				ret = 1;
 				goto done;
 			while (shm_swap(priority, gfp_mask)) {
+				ret = 1;
 				if (!--count)
 					goto done;
 			}
@@ -480,24 +483,28 @@
 		 * This will not actually free any pages (they get
 		 * put in the swap cache), so we must not count this
 		 * as a "count" success.
+		 *
+		 * The amount we page out is the amount of pages we're
+		 * short freeing.
 		 */
-		swap_count = SWAP_COUNT;
+		swap_count = count;
 		while (swap_out(priority, gfp_mask))
 			if (--swap_count < 0)
 				break;
+		/* It was difficult?  Push harder... */
+		count += swap_count;
 
 	} while (--priority >= 0);
 
 	/* Always end on a shrink_mmap.. */
 	while (shrink_mmap(0, gfp_mask)) {
+		ret = 1;
 		if (!--count)
 			goto done;
 	}
-	/* We return 1 if we are freed some page */
-	return (count != FREE_COUNT);
 
 done:
-	return 1;
+	return ret;
 }
 
 DECLARE_WAIT_QUEUE_HEAD(kswapd_wait);
--- linux-2.4.0-t1-ac7/mm/page_alloc.c.orig	Wed May 31 14:08:50 2000
+++ linux-2.4.0-t1-ac7/mm/page_alloc.c	Fri Jun  2 15:29:21 2000
@@ -222,6 +222,9 @@
 {
 	zone_t **zone = zonelist->zones;
 	extern wait_queue_head_t kswapd_wait;
+	static int last_woke_kswapd;
+	static int kswapd_pause = HZ;
+	int gfp_mask = zonelist->gfp_mask;
 
 	/*
 	 * (If anyone calls gfp from interrupts nonatomically then it
@@ -248,14 +251,28 @@
 		}
 	}
 
-	/* All zones are in need of kswapd. */
-	if (waitqueue_active(&kswapd_wait))
+	/*
+	 * Kswapd should be freeing enough memory to satisfy all allocations
+	 * immediately.  Calling try_to_free_pages from processes will slow
+	 * down the system a lot.  On the other hand, waking up kswapd too
+	 * often means wasted memory and cpu time.
+	 *
+	 * We tune the kswapd pause interval in such a way that kswapd is
+	 * always just agressive enough to free the amount of memory we
+	 * want freed.
+	 */
+	if (waitqueue_active(&kswapd_wait) &&
+			time_after(jiffies, last_woke_kswapd + kswapd_pause)) {
+		kswapd_pause++;
+		last_woke_kswapd = jiffies;
 		wake_up_interruptible(&kswapd_wait);
+	}
 
 	/*
 	 * Ok, we don't have any zones that don't need some
 	 * balancing.. See if we have any that aren't critical..
 	 */
+again:
 	zone = zonelist->zones;
 	for (;;) {
 		zone_t *z = *(zone++);
@@ -267,16 +284,29 @@
 				z->low_on_memory = 1;
 			if (page)
 				return page;
+		} else {
+			if (kswapd_pause > 0)
+				kswapd_pause--;
 		}
 	}
 
+	/* We didn't kick kswapd often enough... */
+	kswapd_pause /= 2;
+	if (waitqueue_active(&kswapd_wait))
+		wake_up_interruptible(&kswapd_wait);
+	/* If we're low priority, we just wait a bit and try again later. */
+	if ((gfp_mask & __GFP_WAIT) && current->need_resched &&
+				current->state == TASK_RUNNING) {
+		schedule();
+		goto again;
+	}
+
 	/*
 	 * Uhhuh. All the zones have been critical, which means that
 	 * we'd better do some synchronous swap-out. kswapd has not
 	 * been able to cope..
 	 */
 	if (!(current->flags & PF_MEMALLOC)) {
-		int gfp_mask = zonelist->gfp_mask;
 		if (!try_to_free_pages(gfp_mask)) {
 			if (!(gfp_mask & __GFP_HIGH))
 				goto fail;
@@ -303,7 +333,6 @@
 	zone = zonelist->zones;
 	for (;;) {
 		zone_t *z = *(zone++);
-		int gfp_mask = zonelist->gfp_mask;
 		if (!z)
 			break;
 		if (z->free_pages > z->pages_min) {
--- linux-2.4.0-t1-ac7/mm/filemap.c.orig	Wed May 31 14:08:50 2000
+++ linux-2.4.0-t1-ac7/mm/filemap.c	Fri Jun  2 15:42:25 2000
@@ -334,13 +334,6 @@
 
 		count--;
 		/*
-		 * Page is from a zone we don't care about.
-		 * Don't drop page cache entries in vain.
-		 */
-		if (page->zone->free_pages > page->zone->pages_high)
-			goto dispose_continue;
-
-		/*
 		 * Avoid unscalable SMP locking for pages we can
 		 * immediate tell are untouchable..
 		 */
@@ -375,6 +368,13 @@
 			}
 		}
 
+		/*
+		 * Page is from a zone we don't care about.
+		 * Don't drop page cache entries in vain.
+		 */
+		if (page->zone->free_pages > page->zone->pages_high)
+			goto unlock_continue;
+
 		/* Take the pagecache_lock spinlock held to avoid
 		   other tasks to notice the page while we are looking at its
 		   page count. If it's a pagecache-page we'll free it
@@ -400,8 +400,15 @@
 				goto made_inode_progress;
 			}
 			/* PageDeferswap -> we swap out the page now. */
-			if (gfp_mask & __GFP_IO)
-				goto async_swap_continue;
+			if (gfp_mask & __GFP_IO) {
+				spin_unlock(&pagecache_lock);
+				/* Do NOT unlock the page ... brw_page does. */
+				ClearPageDirty(page);
+				rw_swap_page(WRITE, page, 0);
+				spin_lock(&pagemap_lru_lock);
+				page_cache_release(page);
+				goto dispose_continue;
+			}
 			goto cache_unlock_continue;
 		}
 
@@ -422,14 +429,6 @@
 unlock_continue:
 		spin_lock(&pagemap_lru_lock);
 		UnlockPage(page);
-		page_cache_release(page);
-		goto dispose_continue;
-async_swap_continue:
-		spin_unlock(&pagecache_lock);
-		/* Do NOT unlock the page ... that is done after IO. */
-		ClearPageDirty(page);
-		rw_swap_page(WRITE, page, 0);
-		spin_lock(&pagemap_lru_lock);
 		page_cache_release(page);
 dispose_continue:
 		list_add(page_lru, &lru_cache);
--- linux-2.4.0-t1-ac7/include/linux/swap.h.orig	Wed May 31 21:00:06 2000
+++ linux-2.4.0-t1-ac7/include/linux/swap.h	Thu Jun  1 11:51:25 2000
@@ -166,7 +166,7 @@
  * The 2.4 code, however, is mostly simple and stable ;)
  */
 #define PG_AGE_MAX	64
-#define PG_AGE_START	5
+#define PG_AGE_START	2
 #define PG_AGE_ADV	3
 #define PG_AGE_DECL	1
 
----------------[ yesterday's message ]--------------------
this patch does the following things:
- move the starting page age to below the PG_AGE_ADV, so
  reclaimed pages have an advantage over pages which are
  new in the lru queue
- add two missing wake_up calls to buffer.c (I'm not 100%
  sure about these; I found them when digging through the
  classzone patch and they are consistent with other uses
  of the unused_list_lock
- if try_to_free_buffers waits on a page with buffers and
  succeeds in freeing them, return success (partly mined
  from classzone)
- __alloc_pages is responsible for waking up kswapd, however
  there seemed to be some flaws in the wakeup logic:
    - if kswapd is woken up too early, we free too much memory
      and waste CPU time
    - if kswapd is woken up too late, processes will call
      try_to_free_pages() themselves and stall; extremely
      bad for performance
The obvious solution is to have an auto-tuning algorithm where
the system tunes how often kswapd is woken up. To do that we
use the zone->zone_wake_kswapd and zone->low_on_memory flags.
Basically kswapd will always continue until no zone is low on
memory any more, sometimes resulting in one zone which has too
much free memory.
If we can keep all zones from being low on memory, allocations
can succeed immediately and applications can run fast. To ensure
that we must wake up kswapd often enough (but not too often).
The goal is to have every allocation happen in the second
"alloc loop" without any zones running low on memory. We achieve
this by waking up kswapd whenever we fall through the first loop
and it was longer than kswapd_pause ago that we last woke up
kswapd.
If we get a zone low on memory, we will half the value of
kswapd_pause so next time we'll wake up kswapd earlier. When
we never get low on memory, kswapd_pause will grow slowly over
time, balancing the halving of the period we did earlier.
I'm running some tests now and it seems that system performance
is good, kswapd overhead is quite a bit lower than before and
the amount of free memory is very stable (between freepages.low
and freepages.high, as it was called in 2.2 ;))
Please give this patch some exposure...
regards,
Rik
--
The Internet is not a network of computers. It is a network
of people. That is its real strength.
Wanna talk about the kernel?  irc.openprojects.net / #kernelnewbies
http://www.conectiva.com/		http://www.surriel.com/
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.rutgers.edu
Please read the FAQ at http://www.tux.org/lkml/

Next message: Andrea Arcangeli: "raid 2.2.x"
Previous message: Walter Hofmann: "Re: RE: Cryptography in the kernel (was: Re: Linux 2.5 / 2.6 TODO (preliminary))"
Next in thread: t.n.vanderleeuw@chello.nl: "Re: [PATCH] (#2) VM kswapd autotuning vs. -ac7"
Maybe reply: t.n.vanderleeuw@chello.nl: "Re: [PATCH] (#2) VM kswapd autotuning vs. -ac7"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]

This archive was generated by hypermail 2b29 : Wed Jun 07 2000 - 21:00:15 EST