[patch] Re: Out Of Memory in v. 2.1

Andrea Arcangeli (andrea@e-mind.com)
Sun, 4 Oct 1998 18:49:01 +0200 (CEST)


On Sat, 3 Oct 1998, Bill Hawes wrote:

>Roy Bixler wrote:
>
>> I am running kernel 2.1.123 on a couple of workstations and find a
>> reliable way to make the machines completely non-responsive is simply
>> to run a buggy X application which tries to allocate more memory than
>> is availabe. The only way out is a hard reset or a Alt-SysRq
>> sequence. In ordinary circumstances, the memory and swap space on
>> each machine is perfectly adequate and so I don't see "get more
>> memory" as being a solution. I could just get rid of the buggy apps,
>> but feel that the system ought to offer some protection against the
>> Out Of Memory scenario (i.e. a program shouldn't be able to cause a
>> system freeze simply by using too much memory.)
>
>Can you send a pointer to a "buggy app" that illustrates the problem?
>
>OOM killers are dangerous and I'd like to think we can find a better
>solution.

I passed last night and this afternoon on the OOM problem. I discovered
many problems in the current MM of 2.1.123.

I developed a patch that fix all problems I can reproduce. With this my
patch applyed I am not able to deadlock (or better persistence starvation)
2.1.123. Linux now is _always_ able to kill a process when _needed_.

Short description of the patch:

- __get_free_pages() was stalling everything. The system was running
try_to_free_pages() all the time. Now it run try_to_free_pages only
if really there isn' t a free page in the page-order-list. If
try_to_free_pages has freed the whole chunk we try again because
we are sure we' ll find a free page.
And if we can' t free a chunk of pages by hand using try_to_free_pages()
kswapd won' t do better so I don't wakeup it.
- kswapd was running for 2 sec or so when the system was out of memory.
When the system is out of swap and out of memory do_try_to_free_page()
always fail and so there' s no need to run it for 2 sec (for `tries'
times). If we can' t free 1 pages of ram obviously we won' t be able
to free #tries pages of ram. With my patch kswpad is aware of when it
has to return to sleep.
- shrink_mmap() had not an high limit of the pages are tryed to free every
time is run. It could happens that shrink_mmap() was running for more
than num_physpages. This make no sense because if a pages is not free
now, it won' t be free after awhile too. And even if one page could be
free after a while, incrasing the stall in shrink_mmap() it' s worse than
be able to free 1 page.
- swap_out() had the same problem of shrink_mmap(). There wasn' t an
high limit to the number of passes on the process.
- I also stopped the swap_tick() while kswapd is running.
- I fixed the calc of the tries in kswapd. I followed the calc discribed
in the comment.
- try_to_free_page now return 0 if it was recursing. We should not return
1 if we really havn' t freed a page I think.
And now try_to_free_pages(xx, 0) will do nothing.
- do_try_to_free_page() now return 1 if has just freed one page in the
preliminar shrink_mmap() since we have just freed a page and better
we have free it in the best place.
- I xchanged the ugly check for min/max/borrow with nice symmetric
#defines.
- try_to_unuse need the PF_MEMALLOC flag.

I' d like to know if somebody is able to deadlock linux-2.1.123 + this my
MM patch applyed. I can' t. Now works fine also swapoff -a while the swap
is needed. If the applications swapped out don' t touch their memory while
swapoff is running swapoff return oom. If they touch the swapped out memory
they get oom and swapoff return succesfully. In this second case there
are some swap_duplicate messages but should be harmless. It means that
swapoff has swap_freed() a swap cache page while such entry was added
to the swap cache. This should generate orphan swap entries that will be
freed by shrink_mmap fine when needed.

Index: linux/include/linux/mm.h
diff -u linux/include/linux/mm.h:1.1.1.1 linux/include/linux/mm.h:1.1.1.1.8.1
--- linux/include/linux/mm.h:1.1.1.1 Fri Oct 2 19:22:40 1998
+++ linux/include/linux/mm.h Sun Oct 4 16:22:19 1998
@@ -380,6 +380,31 @@
return vma;
}

+#define buffer_under_min() ((buffermem >> PAGE_SHIFT) * 100 < \
+ buffer_mem.min_percent * num_physpages)
+#define buffer_under_borrow() ((buffermem >> PAGE_SHIFT) * 100 < \
+ buffer_mem.borrow_percent * num_physpages)
+#define buffer_under_max() ((buffermem >> PAGE_SHIFT) * 100 < \
+ buffer_mem.max_percent * num_physpages)
+#define buffer_over_min() ((buffermem >> PAGE_SHIFT) * 100 > \
+ buffer_mem.min_percent * num_physpages)
+#define buffer_over_borrow() ((buffermem >> PAGE_SHIFT) * 100 > \
+ buffer_mem.borrow_percent * num_physpages)
+#define buffer_over_max() ((buffermem >> PAGE_SHIFT) * 100 > \
+ buffer_mem.max_percent * num_physpages)
+#define pgcache_under_min() (page_cache_size * 100 < \
+ page_cache.min_percent * num_physpages)
+#define pgcache_under_borrow() (page_cache_size * 100 < \
+ page_cache.borrow_percent * num_physpages)
+#define pgcache_under_max() (page_cache_size * 100 < \
+ page_cache.max_percent * num_physpages)
+#define pgcache_over_min() (page_cache_size * 100 > \
+ page_cache.min_percent * num_physpages)
+#define pgcache_over_borrow() (page_cache_size * 100 > \
+ page_cache.borrow_percent * num_physpages)
+#define pgcache_over_max() (page_cache_size * 100 > \
+ page_cache.max_percent * num_physpages)
+
#endif /* __KERNEL__ */

#endif
Index: linux/mm/filemap.c
diff -u linux/mm/filemap.c:1.1.1.1 linux/mm/filemap.c:1.1.1.1.8.1
--- linux/mm/filemap.c:1.1.1.1 Fri Oct 2 19:22:39 1998
+++ linux/mm/filemap.c Sun Oct 4 16:22:22 1998
@@ -153,7 +153,7 @@
} while (tmp != bh);

/* Refuse to swap out all buffer pages */
- if ((buffermem >> PAGE_SHIFT) * 100 < (buffer_mem.min_percent * num_physpages))
+ if (buffer_under_min())
goto next;
}

@@ -213,6 +213,11 @@

count_max = (limit<<2) >> (priority>>1);
count_min = (limit<<2) >> (priority);
+
+ if (count_max > num_physpages)
+ count_max = num_physpages;
+ if (count_min > num_physpages)
+ count_min = num_physpages >> 1;

page = mem_map + clock;
do {
Index: linux/mm/page_alloc.c
diff -u linux/mm/page_alloc.c:1.1.1.1 linux/mm/page_alloc.c:1.1.1.1.8.1
--- linux/mm/page_alloc.c:1.1.1.1 Fri Oct 2 19:22:39 1998
+++ linux/mm/page_alloc.c Sun Oct 4 18:29:38 1998
@@ -241,7 +241,7 @@
if (order >= NR_MEM_LISTS)
goto nopage;

- if (gfp_mask & __GFP_WAIT) {
+ if (gfp_mask & __GFP_WAIT)
if (in_interrupt()) {
static int count = 0;
if (++count < 5) {
@@ -249,33 +249,16 @@
__builtin_return_address(0));
}
goto nopage;
- }
-
- if (freepages.min > nr_free_pages) {
- int freed;
- freed = try_to_free_pages(gfp_mask, SWAP_CLUSTER_MAX);
- /*
- * Low priority (user) allocations must not
- * succeed if we didn't have enough memory
- * and we couldn't get more..
- */
- if (!freed && !(gfp_mask & (__GFP_MED | __GFP_HIGH)))
- goto nopage;
}
- }
+ again:
spin_lock_irqsave(&page_alloc_lock, flags);
RMQUEUE(order, (gfp_mask & GFP_DMA));
spin_unlock_irqrestore(&page_alloc_lock, flags);

- /*
- * If we failed to find anything, we'll return NULL, but we'll
- * wake up kswapd _now_ ad even wait for it synchronously if
- * we can.. This way we'll at least make some forward progress
- * over time.
- */
- wake_up(&kswapd_wait);
- if (gfp_mask & __GFP_WAIT)
- schedule();
+ if (nr_free_pages < freepages.min)
+ if (try_to_free_pages(gfp_mask, SWAP_CLUSTER_MAX))
+ goto again;
+
nopage:
return 0;
}
Index: linux/mm/swapfile.c
diff -u linux/mm/swapfile.c:1.1.1.1 linux/mm/swapfile.c:1.1.1.1.8.1
--- linux/mm/swapfile.c:1.1.1.1 Fri Oct 2 19:22:39 1998
+++ linux/mm/swapfile.c Sun Oct 4 19:37:04 1998
@@ -398,7 +398,9 @@
swap_list.next = swap_list.head;
}
p->flags = SWP_USED;
+ current->flags |= PF_MEMALLOC;
err = try_to_unuse(type);
+ current->flags &= ~PF_MEMALLOC;
if (err) {
/* re-insert swap space back into swap_list */
for (prev = -1, i = swap_list.head; i >= 0; prev = i, i = swap_info[i].next)
Index: linux/mm/vmscan.c
diff -u linux/mm/vmscan.c:1.1.1.1 linux/mm/vmscan.c:1.1.1.1.8.2
--- linux/mm/vmscan.c:1.1.1.1 Fri Oct 2 19:22:39 1998
+++ linux/mm/vmscan.c Sun Oct 4 18:29:38 1998
@@ -384,6 +384,8 @@
* task won't be selected again until all others have been tried.
*/
counter = ((PAGEOUT_WEIGHT * nr_tasks) >> 10) >> priority;
+ if (counter > nr_tasks)
+ counter = nr_tasks;
for (; counter >= 0; counter--) {
assign = 0;
max_cnt = 0;
@@ -458,9 +460,9 @@
if (gfp_mask & __GFP_WAIT)
stop = 0;

- if (((buffermem >> PAGE_SHIFT) * 100 > buffer_mem.borrow_percent * num_physpages)
- || (page_cache_size * 100 > page_cache.borrow_percent * num_physpages))
- shrink_mmap(i, gfp_mask);
+ if (buffer_over_borrow() || pgcache_over_borrow())
+ if (shrink_mmap(i, gfp_mask))
+ return 1;

switch (state) {
do {
@@ -546,12 +548,14 @@
init_swap_timer();
add_wait_queue(&kswapd_wait, &wait);
while (1) {
- int tries;
+ int tries, free_memory, count;

current->state = TASK_INTERRUPTIBLE;
flush_signals(current);
run_task_queue(&tq_disk);
+ timer_active |= 1<<SWAP_TIMER;
schedule();
+ timer_active &= ~(1<<SWAP_TIMER);
swapstats.wakeups++;

/*
@@ -570,12 +574,21 @@
* woken up more often and the rate will be even
* higher).
*/
- tries = pager_daemon.tries_base;
- tries >>= 4*free_memory_available();
+ free_memory = free_memory_available();

- do {
- do_try_to_free_page(0);
+ if (free_memory == 2)
+ continue;
+ tries = pager_daemon.tries_base >> (free_memory + 2);
+
+ for (count = 1; count <= tries; count++)
+ {
/*
+ * If we can' t free one page we can' t able to
+ * free tries page.
+ */
+ if (!do_try_to_free_page(0))
+ break;
+ /*
* Syncing large chunks is faster than swapping
* synchronously (less head movement). -- Rik.
*/
@@ -583,7 +596,7 @@
run_task_queue(&tq_disk);
if (free_memory_available() > 1)
break;
- } while (--tries > 0);
+ }
}
/* As if we could ever get here - maybe we want to make this killable */
remove_wait_queue(&kswapd_wait, &wait);
@@ -603,17 +616,17 @@
*/
int try_to_free_pages(unsigned int gfp_mask, int count)
{
- int retval = 1;
+ int retval = 0;

lock_kernel();
if (!(current->flags & PF_MEMALLOC)) {
current->flags |= PF_MEMALLOC;
- do {
+ while (count--)
+ {
retval = do_try_to_free_page(gfp_mask);
if (!retval)
break;
- count--;
- } while (count > 0);
+ }
current->flags &= ~PF_MEMALLOC;
}
unlock_kernel();
@@ -649,8 +662,8 @@
}

if ((long) (now - want) >= 0) {
- if (want_wakeup || (num_physpages * buffer_mem.max_percent) < (buffermem >> PAGE_SHIFT) * 100
- || (num_physpages * page_cache.max_percent < page_cache_size * 100)) {
+ if (want_wakeup || buffer_over_max() || pgcache_over_max())
+ {
/* Set the next wake-up time */
next_swap_jiffies = now + swapout_interval;
wake_up(&kswapd_wait);

Andrea[s] Arcangeli

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.rutgers.edu
Please read the FAQ at http://www.tux.org/lkml/