Re: [patch] my latest oom stuff

Andrea Arcangeli (andrea@e-mind.com)
Sun, 25 Oct 1998 02:56:32 +0100 (CET)


On Sat, 24 Oct 1998, Linus Torvalds wrote:

>I have an alternate patch for low memory circumstances that I'd like you
>to test out.

Tried it now and it does not work.

The problem is still that kswapd run all the time (I can see it pressing
the magic keys). Nothing except kswapd is running and the kernel deadlock.
No other processes are allowed to generate a segfault while swapin or
while allocing mmapped memory.

And I don' t understand why you continue to use this way to allocate
memory:

__get_free_pages()
{
page = get_the_page();

if (page)
return page;
else {
free_memory_and_wait();
return OOM;
}
}

Why we are freeing memory if we are going to fail for sure? The
right place to wakeup kswapd is after a process is been killed and not
before, I think.

The only reason I can see to wakeup kswapd or use try_to_free_page by hand
(better I think) is to do 1 try again, and I found that very useful since
without that trick, kswapd _has_ to work in advance all the time.

The approch in my patch is this infact:

__get_free_pages()
{
try_again:
page = get_the_page();

if (page)
return page;
else {
if (first_time)
{
free_memory_and_wait();
goto try_again;
} else
/* ok try to kill the process */
return OOM;
}
}

Doing this try again we can be more lazy in kswapd and return at the first
fail of do_try_to_free_pages() without problems I think (my machine tell
me that I am right but I know it' s not the only hardware out there...).

Right now I implemented the free_memory_and_wait() with

try_to_free_pages(SWAP_CLUSTER_MAX)

and I try again _only_ if SWAP_CLUSTER_MAX pages got freed. This setting
seems to be fine here.

>The problem with the old kswapd setup was at least partly that kswapd was
>woken up too late - by the time kswapd was woken up, it really had to work
>fairly hard. Also, kswapd really shouldn't be real-time at all: normally
>it should just be a fairly low-priority process, and the priority should
>grow as there is more urgent need for memory.

I agree that kswapd should not run at real time priority during the
swapout but your priority wakeup doesn' t convince me. This because kswapd
will run for tons of times when the system is oom and due this fact
setting the counter to an high value will give kswapd probably only a fast
start. I' d like to give the fast start to kswapd in a simpler manner,
simply putting kswapd to sleep as a real time process and returning to
sched other when waken up. And kswapd should sleep in its path only during
the swapout and the counter value should be not so critical at that time,
I think for this reason the real time / not real time runtime differences
are not critical in practice for kswapd (at least here).

>This alternate approach seems to work for me, and is designed to avoid the

I think you are not using an enough evil testsuite. I suggest you the
one I developed and I use to do the oom test. I run many of these in
background.

main()
{
char *p[20];
int i, j;
for (j=0; j<20; j++)
{
p[j] = (char *) malloc(1000000);
}
for (;;)
for (j=0; j<20; j++)
{
for (i=0; i<1000000; i++)
p[j][i] = 0;
}
}

You can increase the size of the arrays if you have tons of memory of
course...

To reproduce the deadlock still more fast you can do a swapoff -a and
touch the oom with one of my process.

>"spikes" of heavy real-time kswapd activity during which the machine is
>fairly unusable in the old scheme.

kswapd has a long path execution even with nice +19 priority because it
can' t be preempted (1/2 sec of kswapd then schedule() and than again
1/2sec of kswapd). When the system is oom kswapd must stop to
do_try_to_free_pages() to avoid the deadlock. We can use a more clever
heuristic than mine `if (!do_try_to_free_pages())' ;-) if somebody will
cause it to fail.

Right now I merged your good stuff and I changed some things and I
verified that this my last patch works great here (still more aggressive
while paging out, and still more responsive to kill a process when oom).

Patch against 2.1.127-pre1:

Index: linux/mm/vmscan.c
diff -u linux/mm/vmscan.c:1.1.1.4 linux/mm/vmscan.c:1.1.1.2.4.10
--- linux/mm/vmscan.c:1.1.1.4 Sun Oct 25 01:28:52 1998
+++ linux/mm/vmscan.c Sun Oct 25 02:46:55 1998
@@ -442,39 +442,43 @@
static int do_try_to_free_page(int gfp_mask)
{
static int state = 0;
- int i=6;
- int stop;
+ int from_prio, to_prio;

/* Always trim SLAB caches when memory gets low. */
kmem_cache_reap(gfp_mask);

/* We try harder if we are waiting .. */
- stop = 3;
if (gfp_mask & __GFP_WAIT)
- stop = 0;
+ {
+ from_prio = 3;
+ to_prio = 0;
+ } else {
+ from_prio = 6;
+ to_prio = 3;
+ }

if (buffer_over_borrow() || pgcache_over_borrow())
- shrink_mmap(i, gfp_mask);
+ state = 0;

switch (state) {
do {
case 0:
- if (shrink_mmap(i, gfp_mask))
+ if (shrink_mmap(from_prio, gfp_mask))
return 1;
state = 1;
case 1:
- if (shm_swap(i, gfp_mask))
+ if (shm_swap(from_prio, gfp_mask))
return 1;
state = 2;
case 2:
- if (swap_out(i, gfp_mask))
+ if (swap_out(from_prio, gfp_mask))
return 1;
state = 3;
case 3:
- shrink_dcache_memory(i, gfp_mask);
+ shrink_dcache_memory(from_prio, gfp_mask);
state = 0;
- i--;
- } while ((i - stop) >= 0);
+ from_prio--;
+ } while (from_prio >= to_prio);
}
return 0;
}
@@ -516,12 +520,10 @@
*/
lock_kernel();

- /*
- * Set the base priority to something smaller than a
- * regular process. We will scale up the priority
- * dynamically depending on how much memory we need.
- */
- current->priority = (DEF_PRIORITY * 2) / 3;
+ /* Give kswapd a realtime priority. */
+ current->rt_priority = 32; /* Fixme --- we need to standardise our
+ namings for POSIX.4 realtime scheduling
+ priorities. */

/*
* Tell the memory management that we're a "memory allocator",
@@ -540,12 +542,18 @@
init_swap_timer();
kswapd_task = current;
while (1) {
- int tries;
+ int tries, free_memory, count;

- current->state = TASK_INTERRUPTIBLE;
- flush_signals(current);
run_task_queue(&tq_disk);
+ flush_signals(current);
+ /*
+ * Remeber to enable up the swap tick before go to sleep.
+ */
+ timer_active |= 1<<SWAP_TIMER;
+ current->state = TASK_INTERRUPTIBLE;
+ current->policy = SCHED_FIFO;
schedule();
+ current->policy = SCHED_OTHER;
swapstats.wakeups++;

/*
@@ -564,20 +572,35 @@
* woken up more often and the rate will be even
* higher).
*/
- tries = pager_daemon.tries_base;
- tries >>= 4*free_memory_available();

- do {
- do_try_to_free_page(0);
+ /*
+ * free_memory_available() can return 0 or 1 or 2:
+ * case 0: very low on memory
+ * case 1: pretty low on memory
+ * case 2: we get here because buffer or page cache are
+ * too big
+ * -arca
+ */
+ free_memory = free_memory_available();
+
+ tries = pager_daemon.tries_base >> (free_memory + 1);
+
+ for (count = 0; count < tries; count++)
+ {
/*
+ * Stop carefully if we could eat all CPU power. -arca
+ */
+ if (!do_try_to_free_page(0))
+ break;
+ /*
* Syncing large chunks is faster than swapping
* synchronously (less head movement). -- Rik.
*/
if (atomic_read(&nr_async_pages) >= pager_daemon.swap_cluster)
run_task_queue(&tq_disk);
- if (free_memory_available() > 1)
+ if (free_memory_available() == 2)
break;
- } while (--tries > 0);
+ }
}
/* As if we could ever get here - maybe we want to make this killable */
kswapd_task = NULL;
@@ -592,81 +615,44 @@
*
* The "PF_MEMALLOC" flag protects us against recursion:
* if we need more memory as part of a swap-out effort we
- * will just silently return "success" to tell the page
- * allocator to accept the allocation.
+ * will just silently return "fail" to tell the page
+ * allocator that we are OOM.
*/
int try_to_free_pages(unsigned int gfp_mask, int count)
{
- int retval = 1;
+ int retval = 0;

lock_kernel();
if (!(current->flags & PF_MEMALLOC)) {
current->flags |= PF_MEMALLOC;
- do {
+ while (count--)
+ {
retval = do_try_to_free_page(gfp_mask);
if (!retval)
break;
- count--;
- } while (count > 0);
+ }
current->flags &= ~PF_MEMALLOC;
}
unlock_kernel();
return retval;
}

-/*
- * Wake up kswapd according to the priority
- * 0 - no wakeup
- * 1 - wake up as a low-priority process
- * 2 - wake up as a normal process
- * 3 - wake up as an almost real-time process
- *
- * This plays mind-games with the "goodness()"
- * function in kernel/sched.c.
- */
-static inline void kswapd_wakeup(int priority)
-{
- if (priority) {
- struct task_struct *p = kswapd_task;
- if (p) {
- p->counter = p->priority << priority;
- wake_up_process(p);
- }
- }
-}
-
/*
* The swap_tick function gets called on every clock tick.
*/
void swap_tick(void)
{
- unsigned int pages;
- int want_wakeup;
-
/*
* Schedule for wakeup if there isn't lots
* of free memory or if there is too much
* of it used for buffers or pgcache.
- *
- * "want_wakeup" is our priority: 0 means
- * not to wake anything up, while 3 means
- * that we'd better give kswapd a realtime
- * priority.
*/
- want_wakeup = 0;
- if (buffer_over_max() || pgcache_over_max())
- want_wakeup = 1;
- pages = nr_free_pages;
- if (pages < freepages.high)
- want_wakeup = 1;
- if (pages < freepages.low)
- want_wakeup = 2;
- if (pages < freepages.min)
- want_wakeup = 3;
-
- kswapd_wakeup(want_wakeup);

- timer_active |= (1<<SWAP_TIMER);
+ if (free_memory_available() < 2 || buffer_over_max() ||
+ pgcache_over_max())
+ kswapd_wakeup();
+ else
+ timer_active |= (1<<SWAP_TIMER);
}

/*
Index: linux/mm/page_alloc.c
diff -u linux/mm/page_alloc.c:1.1.1.3 linux/mm/page_alloc.c:1.1.1.1.18.4
--- linux/mm/page_alloc.c:1.1.1.3 Sun Oct 25 01:28:52 1998
+++ linux/mm/page_alloc.c Sat Oct 24 20:25:17 1998
@@ -237,43 +237,29 @@
unsigned long __get_free_pages(int gfp_mask, unsigned long order)
{
unsigned long flags;
+ int again = 0;
+ int wait = gfp_mask & __GFP_WAIT;

if (order >= NR_MEM_LISTS)
goto nopage;

- if (gfp_mask & __GFP_WAIT) {
- if (in_interrupt()) {
- static int count = 0;
- if (++count < 5) {
- printk("gfp called nonatomically from interrupt %p\n",
- __builtin_return_address(0));
- }
- goto nopage;
- }
-
- if (freepages.min > nr_free_pages) {
- int freed;
- freed = try_to_free_pages(gfp_mask, SWAP_CLUSTER_MAX);
- /*
- * Low priority (user) allocations must not
- * succeed if we didn't have enough memory
- * and we couldn't get more..
- */
- if (!freed && !(gfp_mask & (__GFP_MED | __GFP_HIGH)))
- goto nopage;
- }
+ if (wait && in_interrupt()) {
+ printk("gfp called nonatomically from interrupt %p\n",
+ __builtin_return_address(0));
+ goto nopage;
}
+ again:
spin_lock_irqsave(&page_alloc_lock, flags);
RMQUEUE(order, (gfp_mask & GFP_DMA));
spin_unlock_irqrestore(&page_alloc_lock, flags);
+
+ if (!again && wait)
+ {
+ again = 1;
+ if (try_to_free_pages(gfp_mask, SWAP_CLUSTER_MAX))
+ goto again;
+ }

- /*
- * If we failed to find anything, we'll return NULL, but we'll
- * wake up kswapd _now_ and even wait for it synchronously if
- * we can.. This way we'll at least make some forward progress
- * over time.
- */
- kswapd_notify(gfp_mask);
nopage:
return 0;
}
Index: linux/kernel/fork.c
diff -u linux/kernel/fork.c:1.1.1.2 linux/kernel/fork.c:1.1.1.2.4.2
--- linux/kernel/fork.c:1.1.1.2 Fri Oct 9 17:44:09 1998
+++ linux/kernel/fork.c Sun Oct 25 02:43:48 1998
@@ -296,6 +296,8 @@
exit_mmap(mm);
free_page_tables(mm);
kmem_cache_free(mm_cachep, mm);
+ if (free_memory_available() != 2)
+ kswapd_wakeup();
}
}

Index: linux/include/linux/mm.h
diff -u linux/include/linux/mm.h:1.1.1.3 linux/include/linux/mm.h:1.1.1.1.16.2
--- linux/include/linux/mm.h:1.1.1.3 Sun Oct 25 01:28:37 1998
+++ linux/include/linux/mm.h Sun Oct 25 02:43:49 1998
@@ -330,15 +330,11 @@
extern int free_memory_available(void);
extern struct task_struct * kswapd_task;

-extern inline void kswapd_notify(unsigned int gfp_mask)
+static inline void kswapd_wakeup(void)
{
- if (kswapd_task) {
- wake_up_process(kswapd_task);
- if (gfp_mask & __GFP_WAIT) {
- current->policy |= SCHED_YIELD;
- schedule();
- }
- }
+ struct task_struct *p = kswapd_task;
+ if (p)
+ wake_up_process(p);
}

/* vma is the first one with address < vma->vm_end,

Andrea Arcangeli

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.rutgers.edu
Please read the FAQ at http://www.tux.org/lkml/