[RFC] [PATCH] Include LRU in page count

From: Daniel Phillips (phillips@arcor.de)
Date: Fri Aug 30 2002 - 18:03:07 EST


The executive summary is 'Houston, we have a bug'. The long version of this
is that all current linux kernels are now thought to suffer from a rarely
occurring page double-free bug that results from a race between shrink_cache
and page_cache_release, where both attempt to free a page from the lru at
the same time.

Andrew's succinct description of the race is:

        if (put_page_testzero(page)) {
                /* window here */
                lru_cache_del(page);
                __free_pages_ok(page, 0);
        }

versus:

        spin_lock(lru lock);
        page = list_entry(lru, ...);
        if (page_count(page) == 0)
                continue;
        /* window here */
        page_cache_get(page);
        page_cache_release(page); /* double-free */

This rare race happened to become not so rare in 2.5 recently, and was
analyzed by Christian Ehrhardt, who also proposed a solution based on a new
approach to locking, essentially put_page_testone. We went on to check 2.4
for the race, and while it's quite difficult to prove the race exists there,
after a year or two of plugging holes, it's even more difficult to prove it's
not there. In fact, sifting through recent kernel archives turns up
unexplained bug reports which fit the description of this race, for example:

   http://marc.theaimsgroup.com/?l=linux-xfs&m=103055330831850&w=2
   (2.4.18-xfs (xfs related?) oops report)

I proposed an alternate solution using the traditional put_page_testzero
primitive, which relies on assigning a page count of one for membership on
the lru list. A slightly racy heuristic is used for efficient lru list
removal. The resulting incarnation of lru_cache_release is:

static inline void page_cache_release(struct page *page)
{
        if (page_count(page) == 2 && spin_trylock(&pagemap_lru_lock)) {
                if (PageLRU(page) && page_count(page) == 2)
                        __lru_cache_del(page);
                spin_unlock(&pagemap_lru_lock);
        }
        put_page(page);
}

which permits the following benign race, starting with page count of 3:

        if (page_count(page) == 2 /* false */
                                                if (page_count(page) == 2 /* false */
        put_page(page);
                                                put_page(page);

Neither holder of a page reference sees the count at 2, and so the page
is left on the lru with count = 1. This won't happen often (I haven't seen
it happen yet) and such pages will be recovered from the cold end of the list
in due course. There are a number of other variations of this race, but they
all have the same effect and are all rare.

The important question is: can this code ever remove a page from the lru
erroneously, leaving somebody holding a reference to a non-lru page? In
other words, can the test PageLRU(page) && page_count(page) == 2 return a
false positive? Well, when this test is true we can account for both
both references: the one we own, and the one the lru list owns. Since
we hold the lru lock, the latter won't change. Nobody else has the right to
increment the page count, since they must inherit that right from somebody
who holds a reference, and there are none.

The spin_trylock is efficient - no contention ever, just some cache
pingponging at worst - and permits page_cache_release to be used even by a
process that holds the lru lock.

The attached patch against 2.4.19 implements these ideas, along with some
proc statistics to help verify that the intended effect is being achieved.
The patch supports both Marcelo's current design (hopefully with no semantic
changes at all) and the new lru counting strategy, selected by setting the
LRU_PLUS_CACHE to one to get the existing strategy or two to get the new one.

With the patch, there remain only three instances of __lru_cache_del: the one
in page_cache_release, the one in shrink_cache and the one in truncate. The
latter looks suspicious since it's not clear why anonymous pages can't be
allowed to appear on the lru. However, that is not the focus of the current
effort. The atomic tests for lru membership appear to be obsoleted by the
new regimen, since we are protected in every case by the lru lock, but I did
not disturb them.

Due to laziness, the correctness of the swapcache logic in swapfile.c has not
yet been verified. I've done some light testing on my 2GB 2X1Ghz PIII box,
using the following pair of scripts inherited from akpm:

file: doitlots
-----------------------
#!/bin/sh

doit()
{
        ( cat $1 | wc -l )
}
        
count=0
        
while [ $count != 500 ]
do
        doit doitlots > /dev/null

        count=$(expr $count + 1)
done
echo done
-----------------------

file: forklots
-----------------------
echo >foocount
./doitlots >>foocount &
./doitlots >>foocount &
./doitlots >>foocount &
./doitlots >>foocount &
./doitlots >>foocount &
./doitlots >>foocount &
./doitlots >>foocount &
./doitlots >>foocount &
./doitlots >>foocount &

count=0

while [ $count != 10 ]
do
        count=$(wc foocount | cut -b -8)
done
-----------------------

The patch computes the following statistics and makes them available via
/proc/mmstats:

   page_cache_release ... Total calls to page_cache_release
   page_cache_release_test_lru ... How often the lru heuristic triggers
   page_cache_release_del_lru ... How often the heuristic is correct
   shrink_cache_del_lru ... How often an orphan is left on the lru
   put_page_testzero ... Total page releases
   lru_cache_del ... Total lru deletes
   truncate_lru_cache_del ... Total lru deletes during truncate

Now the benchmark results...

LRU_PLUS_CACHE = 2, with LRU_DEL_HEURISTIC:

   time sh forklots
   24.81user 28.15system 0:26.50elapsed 199%CPU

   cat /proc/mmstats
   page_cache_release 8555807
   page_cache_release_test_lru 771550
   page_cache_release_del_lru 771535
   shrink_cache_del_lru 0
   put_page_testzero 9209139
   lru_cache_del 771547
   truncate_lru_cache_del 12

According to the statistics, we took the lru lock to see if the page is
really on the lru about 9% of the time. In just .002% (15) of those cases
the heuristic was wrong, the lock was taken and no page removed from the lru.
No orphan pages were ever found on the lru. In this case, all the
lru_cache_dels are accounted for by the 771,535 that were removed via the
page_cache_release heuristic, and the 12 that were converted to anonymous
pages via truncate.

LRU_PLUS_CACHE = 2, without LRU_DEL_HEURISTIC:

   time sh forklots
   25.45user 28.70system 0:27.28elapsed 198%CPU

   cat /proc/mmstats
   page_cache_release 8611106
   page_cache_release_test_lru 0
   page_cache_release_del_lru 0
   shrink_cache_del_lru 614573
   put_page_testzero 9832924
   lru_cache_del 11
   truncate_lru_cache_del 11

Omitting the heuristic forces shrink_cache to pick up the orphaned lru pages,
which costs about 3% in terms of wall clock time on this test. At least this
exercises the orphan recovery path in shrink_cache.

LRU_PLUS_CACHE = 2, with LRU_DEL_HEURISTIC, no statistics:

   time sh forklots
   24.74user 28.02system 0:26.39elapsed 199%CPU

The statistics cost about .5%, which is basically noise.

Vanilla 2.4.19:

   time sh forklots
   24.15user 28.44system 0:26.31elapsed 199%CPU

Vanilla 2.4.19 may or may not have an advantage of .3%. On the other hand,
removing the atomic test for lru membership probably gives a slight advantage
to the new version.

The patch is against 2.4.19, apply with patch -p1:

--- 2.4.19.clean/fs/proc/proc_misc.c Fri Aug 2 20:39:45 2002
+++ 2.4.19/fs/proc/proc_misc.c Fri Aug 30 21:32:20 2002
@@ -539,6 +539,28 @@
                 entry->proc_fops = f;
 }
 
+struct mmstats mmstats;
+
+char *mmnames[] = {
+ "page_cache_release",
+ "page_cache_release_test_lru",
+ "page_cache_release_del_lru",
+ "shrink_cache_del_lru",
+ "put_page_testzero",
+ "lru_cache_del",
+ "truncate_lru_cache_del",
+ NULL,
+};
+
+static int read_mmstats(char *page, char **start, off_t off, int count, int *eof, void *data)
+{
+ char **names = mmnames;
+ unsigned *stats = (unsigned *) &mmstats, len = 0;
+ for (; *names; *names++, *stats++) if (strlen(*names))
+ len += sprintf(page + len, "%s %u\n", *names, *stats);
+ return proc_calc_metrics(page, start, off, count, eof, len);
+}
+
 void __init proc_misc_init(void)
 {
         struct proc_dir_entry *entry;
@@ -546,6 +568,7 @@
                 char *name;
                 int (*read_proc)(char*,char**,off_t,int,int*,void*);
         } *p, simple_ones[] = {
+ {"mmstats", read_mmstats},
                 {"loadavg", loadavg_read_proc},
                 {"uptime", uptime_read_proc},
                 {"meminfo", meminfo_read_proc},
--- 2.4.19.clean/include/linux/mm.h Fri Aug 2 20:39:45 2002
+++ 2.4.19/include/linux/mm.h Fri Aug 30 21:41:54 2002
@@ -35,6 +35,20 @@
  * mmap() functions).
  */
 
+extern struct mmstats {
+ unsigned release;
+ unsigned release_test_page;
+ unsigned release_free_page;
+ unsigned vmscan_free_page;
+ unsigned testzero;
+ unsigned lru_cache_del;
+ unsigned truncate_lru_cache_del;
+} mmstats;
+
+extern char *mmnames[];
+
+#define mmstat(field) mmstats.field++
+
 /*
  * This struct defines a memory VMM memory area. There is one of these
  * per VM-area/task. A VM area is any part of the process virtual memory
@@ -180,6 +194,15 @@
 } mem_map_t;
 
 /*
+ * There is only one 'core' page-freeing function.
+ */
+extern void FASTCALL(__free_pages(struct page *page, unsigned int order));
+extern void FASTCALL(free_pages(unsigned long addr, unsigned int order));
+
+#define __free_page(page) __free_pages((page), 0)
+#define free_page(addr) free_pages((addr),0)
+
+/*
  * Methods to modify the page usage count.
  *
  * What counts for a page usage:
@@ -191,12 +214,34 @@
  * Also, many kernel routines increase the page count before a critical
  * routine so they can be sure the page doesn't go away from under them.
  */
-#define get_page(p) atomic_inc(&(p)->count)
-#define put_page(p) __free_page(p)
-#define put_page_testzero(p) atomic_dec_and_test(&(p)->count)
 #define page_count(p) atomic_read(&(p)->count)
 #define set_page_count(p,v) atomic_set(&(p)->count, v)
 
+extern void FASTCALL(__free_pages(struct page *page, unsigned int order));
+
+static inline void get_page(struct page *page)
+{
+ atomic_inc(&page->count);
+}
+
+static inline void put_page_nofree(struct page *page)
+{
+ BUG_ON(!page_count(page));
+ atomic_dec(&page->count);
+}
+
+static inline int put_page_testzero(struct page *page)
+{
+ mmstat(testzero);
+ BUG_ON(!page_count(page));
+ return atomic_dec_and_test(&page->count);
+}
+
+static inline void put_page(struct page *page)
+{
+ __free_page(page);
+}
+
 /*
  * Various page->flags bits:
  *
@@ -451,15 +496,6 @@
  */
 #define get_free_page get_zeroed_page
 
-/*
- * There is only one 'core' page-freeing function.
- */
-extern void FASTCALL(__free_pages(struct page *page, unsigned int order));
-extern void FASTCALL(free_pages(unsigned long addr, unsigned int order));
-
-#define __free_page(page) __free_pages((page), 0)
-#define free_page(addr) free_pages((addr),0)
-
 extern void show_free_areas(void);
 extern void show_free_areas_node(pg_data_t *pgdat);
 
@@ -518,11 +554,6 @@
 
 extern struct address_space swapper_space;
 #define PageSwapCache(page) ((page)->mapping == &swapper_space)
-
-static inline int is_page_cache_freeable(struct page * page)
-{
- return page_count(page) - !!page->buffers == 1;
-}
 
 extern int can_share_swap_page(struct page *);
 extern int remove_exclusive_swap_page(struct page *);
--- 2.4.19.clean/include/linux/pagemap.h Mon Feb 25 14:38:13 2002
+++ 2.4.19/include/linux/pagemap.h Fri Aug 30 22:17:21 2002
@@ -27,9 +27,38 @@
 #define PAGE_CACHE_SIZE PAGE_SIZE
 #define PAGE_CACHE_MASK PAGE_MASK
 #define PAGE_CACHE_ALIGN(addr) (((addr)+PAGE_CACHE_SIZE-1)&PAGE_CACHE_MASK)
+#define LRU_PLUS_CACHE 2
+#define LRU_DEL_HEURISTIC
 
 #define page_cache_get(x) get_page(x)
-#define page_cache_release(x) __free_page(x)
+
+static inline void page_cache_release(struct page *page)
+{
+ mmstat(release);
+#if LRU_PLUS_CACHE==2
+#ifdef LRU_DEL_HEURISTIC
+ if (page_count(page) == 2 && spin_trylock(&pagemap_lru_lock)) {
+ mmstat(release_test_page);
+ if (PageLRU(page) && page_count(page) == 2) {
+ mmstat(release_free_page);
+ __lru_cache_del(page);
+ }
+ spin_unlock(&pagemap_lru_lock);
+ }
+#endif
+#endif
+ put_page(page);
+}
+
+static inline int has_buffers(struct page *page)
+{
+ return !!page->buffers;
+}
+
+static inline int is_page_cache_freeable(struct page *page)
+{
+ return page_count(page) - has_buffers(page) == LRU_PLUS_CACHE;
+}
 
 static inline struct page *page_cache_alloc(struct address_space *x)
 {
--- 2.4.19.clean/mm/filemap.c Fri Aug 2 20:39:46 2002
+++ 2.4.19/mm/filemap.c Fri Aug 30 21:40:33 2002
@@ -198,10 +198,12 @@
                 if (page->buffers && !try_to_free_buffers(page, 0))
                         goto unlock;
 
- if (page_count(page) != 1)
+ if (page_count(page) != LRU_PLUS_CACHE)
                         goto unlock;
 
+#if LRU_PLUS_CACHE==1
                 __lru_cache_del(page);
+#endif
                 __remove_inode_page(page);
                 UnlockPage(page);
                 page_cache_release(page);
@@ -234,8 +236,10 @@
 static void truncate_complete_page(struct page *page)
 {
         /* Leave it on the LRU if it gets converted into anonymous buffers */
- if (!page->buffers || do_flushpage(page, 0))
+ if (!page->buffers || do_flushpage(page, 0)) {
+ mmstat(truncate_lru_cache_del);
                 lru_cache_del(page);
+ }
 
         /*
          * We remove the page from the page cache _after_ we have
@@ -345,7 +349,7 @@
          * The page is locked and we hold the pagecache_lock as well
          * so both page_count(page) and page->buffers stays constant here.
          */
- if (page_count(page) == 1 + !!page->buffers) {
+ if (is_page_cache_freeable(page)) {
                 /* Restart after this page */
                 list_del(head);
                 list_add_tail(head, curr);
--- 2.4.19.clean/mm/page_alloc.c Fri Aug 2 20:39:46 2002
+++ 2.4.19/mm/page_alloc.c Fri Aug 30 21:55:03 2002
@@ -82,9 +82,12 @@
         /* Yes, think what happens when other parts of the kernel take
          * a reference to a page in order to pin it for io. -ben
          */
+#if LRU_PLUS_CACHE==2
+ BUG_ON(PageLRU(page));
+#else
         if (PageLRU(page))
                 lru_cache_del(page);
-
+#endif
         if (page->buffers)
                 BUG();
         if (page->mapping)
--- 2.4.19.clean/mm/swap.c Wed Nov 7 01:44:20 2001
+++ 2.4.19/mm/swap.c Fri Aug 30 21:32:20 2002
@@ -60,6 +60,9 @@
         if (!TestSetPageLRU(page)) {
                 spin_lock(&pagemap_lru_lock);
                 add_page_to_inactive_list(page);
+#if LRU_PLUS_CACHE==2
+ get_page(page);
+#endif
                 spin_unlock(&pagemap_lru_lock);
         }
 }
@@ -79,6 +82,10 @@
                 } else {
                         del_page_from_inactive_list(page);
                 }
+ mmstat(lru_cache_del);
+#if LRU_PLUS_CACHE==2
+ put_page_nofree(page);
+#endif
         }
 }
 
--- 2.4.19.clean/mm/swapfile.c Fri Aug 2 20:39:46 2002
+++ 2.4.19/mm/swapfile.c Fri Aug 30 21:32:20 2002
@@ -239,7 +239,7 @@
                 if (p->swap_map[SWP_OFFSET(entry)] == 1) {
                         /* Recheck the page count with the pagecache lock held.. */
                         spin_lock(&pagecache_lock);
- if (page_count(page) - !!page->buffers == 2)
+ if (page_count(page) - has_buffers(page) == LRU_PLUS_CACHE + 1)
                                 retval = 1;
                         spin_unlock(&pagecache_lock);
                 }
@@ -263,16 +263,16 @@
         if (!PageLocked(page))
                 BUG();
         switch (page_count(page)) {
- case 3:
- if (!page->buffers)
+ case LRU_PLUS_CACHE + 2:
+ if (!has_buffers(page))
                         break;
                 /* Fallthrough */
- case 2:
+ case LRU_PLUS_CACHE + 1:
                 if (!PageSwapCache(page))
                         break;
                 retval = exclusive_swap_page(page);
                 break;
- case 1:
+ case LRU_PLUS_CACHE:
                 if (PageReserved(page))
                         break;
                 retval = 1;
@@ -280,6 +280,11 @@
         return retval;
 }
 
+static inline int only_cached(struct page *page)
+{
+ return page_count(page) - has_buffers(page) == LRU_PLUS_CACHE + 1;
+}
+
 /*
  * Work out if there are any other processes sharing this
  * swap cache page. Free it if you can. Return success.
@@ -294,7 +299,7 @@
                 BUG();
         if (!PageSwapCache(page))
                 return 0;
- if (page_count(page) - !!page->buffers != 2) /* 2: us + cache */
+ if (!only_cached(page))
                 return 0;
 
         entry.val = page->index;
@@ -307,7 +312,7 @@
         if (p->swap_map[SWP_OFFSET(entry)] == 1) {
                 /* Recheck the page count with the pagecache lock held.. */
                 spin_lock(&pagecache_lock);
- if (page_count(page) - !!page->buffers == 2) {
+ if (only_cached(page)) {
                         __delete_from_swap_cache(page);
                         SetPageDirty(page);
                         retval = 1;
@@ -343,7 +348,7 @@
         if (page) {
                 page_cache_get(page);
                 /* Only cache user (+us), or swap space full? Free it! */
- if (page_count(page) - !!page->buffers == 2 || vm_swap_full()) {
+ if (only_cached(page) || vm_swap_full()) {
                         delete_from_swap_cache(page);
                         SetPageDirty(page);
                 }
--- 2.4.19.clean/mm/vmscan.c Fri Aug 2 20:39:46 2002
+++ 2.4.19/mm/vmscan.c Fri Aug 30 21:40:30 2002
@@ -89,7 +89,7 @@
                 mm->rss--;
                 UnlockPage(page);
                 {
- int freeable = page_count(page) - !!page->buffers <= 2;
+ int freeable = page_count(page) - has_buffers(page) <= LRU_PLUS_CACHE + 1;
                         page_cache_release(page);
                         return freeable;
                 }
@@ -356,20 +356,30 @@
                 BUG_ON(PageActive(page));
 
                 list_del(entry);
+#if LRU_PLUS_CACHE==2
+ BUG_ON(!page_count(page));
+ if (unlikely(page_count(page) == 1)) {
+ mmstat(vmscan_free_page);
+ BUG_ON(!TestClearPageLRU(page)); // side effect abuse!!
+ put_page(page);
+ continue;
+ }
+#endif
                 list_add(entry, &inactive_list);
 
+#if LRU_PLUS_CACHE==1
                 /*
                  * Zero page counts can happen because we unlink the pages
                  * _after_ decrementing the usage count..
                  */
                 if (unlikely(!page_count(page)))
                         continue;
-
+#endif
                 if (!memclass(page_zone(page), classzone))
                         continue;
 
                 /* Racy check to avoid trylocking when not worthwhile */
- if (!page->buffers && (page_count(page) != 1 || !page->mapping))
+ if (!page->buffers && (page_count(page) != LRU_PLUS_CACHE || !page->mapping))
                         goto page_mapped;
 
                 /*
@@ -434,9 +444,9 @@
                                          */
                                         spin_lock(&pagemap_lru_lock);
                                         UnlockPage(page);
+#if LRU_PLUS_CACHE==1
                                         __lru_cache_del(page);
-
- /* effectively free the page here */
+#endif
                                         page_cache_release(page);
 
                                         if (--nr_pages)
@@ -505,10 +515,10 @@
                         swap_free(swap);
                 }
 
+#if LRU_PLUS_CACHE==1
                 __lru_cache_del(page);
+#endif
                 UnlockPage(page);
-
- /* effectively free the page here */
                 page_cache_release(page);
 
                 if (--nr_pages)
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/



This archive was generated by hypermail 2b29 : Sat Aug 31 2002 - 22:00:32 EST