Re: Default zone_reclaim_mode = 1 on NUMA kernel is bad for file/email/web servers

From: KOSAKI Motohiro
Date: Thu Sep 16 2010 - 06:01:42 EST


Cc to linux-mm and hpc guys. and intetionally full quote.


> So over the last couple of weeks, I've noticed that our shiny new IMAP
> servers (Dual Xeon E5520 + Intel S5520UR MB) with 48G of RAM haven't
> been performing as well as expected, and there were some big oddities.
> Namely two things stuck out:
>
> 1. There was free memory. There's 20T of data on these machines. The
> kernel should have used lots of memory for caching, but for some
> reason, it wasn't. cache ~ 2G, buffers ~ 25G, unused ~ 5G
> 2. The machine has an SSD for very hot data. In total, there's about 16G
> of data on the SSD. Almost all of that 16G of data should end up
> being cached, so there should be little reading from the SSDs at all.
> Instead we saw at peak times 2k+ blocks read/s from the SSDs. Again a
> sign that caching wasn't working.
>
> After a bunch of googling, I found this thread.
>
> http://lkml.org/lkml/2009/5/12/586
>
> It appears that patch never went anywhere, and zone_reclaim_mode is
> still defaulting to 1 on our pretty standard file/email/web server type
> machine with a NUMA kernel.
>
> By changing it to 0, we saw an immediate massive change in caching
> behaviour. Now cache ~ 27G, buffers ~ 7G and unused ~ 0.2G, and IO reads
> from the SSD dropped to 100/s instead of 2000/s.
>
> Having very little knowledge of what this actually does, I'd just
> like to point out that from a users point of view, it's really
> annoying for your machine to be crippled by a default kernel setting
> that's pretty obscure.
>
> I don't think our usage scenario of serving lots of files is that
> uncommon, every file server/email server/web server will be doing pretty
> much that and expecting a large part of their memory to be used as a
> cache, which clearly isn't what actually happens.
>
> Rob
> Rob Mueller
> robm@xxxxxxxxxxx
>

Yes, sadly intel motherboard turn on zone_reclaim_mode by default. and
current zone_reclaim_mode doesn't fit file/web server usecase ;-)

So, I've created new proof concept patch. This doesn't disable zone_reclaim
at all. Instead, distinguish for file cache and for anon allocation and
only file cache doesn't use zone-reclaim.

That said, high-end hpc user often turn on cpuset.memory_spread_page and
they avoid this issue. But, why don't we consider avoid it by default?


Rob, I wonder if following patch help you. Could you please try it?


Subject: [RFC] vmscan: file cache doesn't use zone_reclaim by default

---
Need to removed debbuging piece.

Documentation/sysctl/vm.txt | 7 +++----
fs/inode.c | 2 +-
include/linux/gfp.h | 9 +++++++--
include/linux/mmzone.h | 2 ++
include/linux/swap.h | 6 ++++++
mm/filemap.c | 1 +
mm/page_alloc.c | 8 +++++++-
mm/vmscan.c | 7 ++-----
mm/vmstat.c | 2 ++
9 files changed, 31 insertions(+), 13 deletions(-)

diff --git a/Documentation/sysctl/vm.txt b/Documentation/sysctl/vm.txt
index b606c2c..4be569e 100644
--- a/Documentation/sysctl/vm.txt
+++ b/Documentation/sysctl/vm.txt
@@ -671,16 +671,15 @@ This is value ORed together of
1 = Zone reclaim on
2 = Zone reclaim writes dirty pages out
4 = Zone reclaim swaps pages
+8 = Zone reclaim for file cache on

zone_reclaim_mode is set during bootup to 1 if it is determined that pages
from remote zones will cause a measurable performance reduction. The
page allocator will then reclaim easily reusable pages (those page
cache pages that are currently not used) before allocating off node pages.

-It may be beneficial to switch off zone reclaim if the system is
-used for a file server and all of memory should be used for caching files
-from disk. In that case the caching effect is more important than
-data locality.
+By default, for file cache allocation doesn't use zone reclaim. But
+It can be turned on manually.

Allowing zone reclaim to write out pages stops processes that are
writing large amounts of data from dirtying pages on other nodes. Zone
diff --git a/fs/inode.c b/fs/inode.c
index 8646433..02a51b1 100644
--- a/fs/inode.c
+++ b/fs/inode.c
@@ -166,7 +166,7 @@ int inode_init_always(struct super_block *sb, struct inode *inode)
mapping->a_ops = &empty_aops;
mapping->host = inode;
mapping->flags = 0;
- mapping_set_gfp_mask(mapping, GFP_HIGHUSER_MOVABLE);
+ mapping_set_gfp_mask(mapping, GFP_FILE_CACHE);
mapping->assoc_mapping = NULL;
mapping->backing_dev_info = &default_backing_dev_info;
mapping->writeback_index = 0;
diff --git a/include/linux/gfp.h b/include/linux/gfp.h
index 975609c..f263b1f 100644
--- a/include/linux/gfp.h
+++ b/include/linux/gfp.h
@@ -84,6 +84,10 @@ struct vm_area_struct;
#define GFP_HIGHUSER_MOVABLE (__GFP_WAIT | __GFP_IO | __GFP_FS | \
__GFP_HARDWALL | __GFP_HIGHMEM | \
__GFP_MOVABLE)
+
+#define GFP_FILE_CACHE (GFP_HIGHUSER | __GFP_RECLAIMABLE | __GFP_MOVABLE)
+
+
#define GFP_IOFS (__GFP_IO | __GFP_FS)

#ifdef CONFIG_NUMA
@@ -120,11 +124,12 @@ struct vm_area_struct;
/* Convert GFP flags to their corresponding migrate type */
static inline int allocflags_to_migratetype(gfp_t gfp_flags)
{
- WARN_ON((gfp_flags & GFP_MOVABLE_MASK) == GFP_MOVABLE_MASK);
-
if (unlikely(page_group_by_mobility_disabled))
return MIGRATE_UNMOVABLE;

+ if ((gfp_flags & GFP_MOVABLE_MASK) == GFP_MOVABLE_MASK)
+ gfp_flags &= ~__GFP_RECLAIMABLE;
+
/* Group based on mobility */
return (((gfp_flags & __GFP_MOVABLE) != 0) << 1) |
((gfp_flags & __GFP_RECLAIMABLE) != 0);
diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
index 6e6e626..2eead52 100644
--- a/include/linux/mmzone.h
+++ b/include/linux/mmzone.h
@@ -112,6 +112,8 @@ enum zone_stat_item {
NUMA_LOCAL, /* allocation from local node */
NUMA_OTHER, /* allocation from other node */
#endif
+ NR_ZONE_CACHE_AVOID,
+ NR_ZONE_RECLAIM,
NR_VM_ZONE_STAT_ITEMS };

/*
diff --git a/include/linux/swap.h b/include/linux/swap.h
index 2fee51a..487bc3b 100644
--- a/include/linux/swap.h
+++ b/include/linux/swap.h
@@ -65,6 +65,12 @@ static inline int current_is_kswapd(void)
#define MAX_SWAPFILES \
((1 << MAX_SWAPFILES_SHIFT) - SWP_MIGRATION_NUM - SWP_HWPOISON_NUM)

+#define RECLAIM_OFF 0
+#define RECLAIM_ZONE (1<<0) /* Run shrink_inactive_list on the zone */
+#define RECLAIM_WRITE (1<<1) /* Writeout pages during reclaim */
+#define RECLAIM_SWAP (1<<2) /* Swap pages out during reclaim */
+#define RECLAIM_CACHE (1<<3) /* Reclaim even though file cache purpose allocation */
+
/*
* Magic header for a swap area. The first part of the union is
* what the swap magic looks like for the old (limited to 128MB)
diff --git a/mm/filemap.c b/mm/filemap.c
index 3d4df44..97298c0 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -468,6 +468,7 @@ struct page *__page_cache_alloc(gfp_t gfp)
if (cpuset_do_page_mem_spread()) {
get_mems_allowed();
n = cpuset_mem_spread_node();
+ gfp &= ~__GFP_RECLAIMABLE;
page = alloc_pages_exact_node(n, gfp, 0);
put_mems_allowed();
return page;
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 8587c10..f81c28f 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -1646,9 +1646,15 @@ zonelist_scan:
classzone_idx, alloc_flags))
goto try_this_zone;

- if (zone_reclaim_mode == 0)
+ if (zone_reclaim_mode == RECLAIM_OFF)
goto this_zone_full;

+ if (!(zone_reclaim_mode & RECLAIM_CACHE) &&
+ (gfp_mask & GFP_MOVABLE_MASK) == GFP_MOVABLE_MASK) {
+ inc_zone_state(zone, NR_ZONE_CACHE_AVOID);
+ goto try_next_zone;
+ }
+
ret = zone_reclaim(zone, gfp_mask, order);
switch (ret) {
case ZONE_RECLAIM_NOSCAN:
diff --git a/mm/vmscan.c b/mm/vmscan.c
index c391c32..6f63eea 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -2558,11 +2558,6 @@ module_init(kswapd_init)
*/
int zone_reclaim_mode __read_mostly;

-#define RECLAIM_OFF 0
-#define RECLAIM_ZONE (1<<0) /* Run shrink_inactive_list on the zone */
-#define RECLAIM_WRITE (1<<1) /* Writeout pages during reclaim */
-#define RECLAIM_SWAP (1<<2) /* Swap pages out during reclaim */
-
/*
* Priority for ZONE_RECLAIM. This determines the fraction of pages
* of a node considered for each zone_reclaim. 4 scans 1/16th of
@@ -2646,6 +2641,8 @@ static int __zone_reclaim(struct zone *zone, gfp_t gfp_mask, unsigned int order)
};
unsigned long nr_slab_pages0, nr_slab_pages1;

+ inc_zone_state(zone, NR_ZONE_RECLAIM);
+
cond_resched();
/*
* We need to be able to allocate from the reserves for RECLAIM_SWAP
diff --git a/mm/vmstat.c b/mm/vmstat.c
index f389168..8988688 100644
--- a/mm/vmstat.c
+++ b/mm/vmstat.c
@@ -740,6 +740,8 @@ static const char * const vmstat_text[] = {
"numa_local",
"numa_other",
#endif
+ "zone_cache_avoid",
+ "zone_reclaim",

#ifdef CONFIG_VM_EVENT_COUNTERS
"pgpgin",
--
1.6.5.2




--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/