Re: [PATCH 0/13] Parallel struct page initialisation v4

From: Waiman Long
Date: Mon May 04 2015 - 23:32:55 EST


On 05/04/2015 05:30 PM, Andrew Morton wrote:
On Fri, 01 May 2015 20:09:21 -0400 Waiman Long<waiman.long@xxxxxx> wrote:

On 05/01/2015 06:02 PM, Waiman Long wrote:
Bad news!

I tried your patch on a 24-TB DragonHawk and got an out of memory
panic. The kernel log messages were:
...

[ 81.360287] [<ffffffff8151b0c9>] dump_stack+0x68/0x77
[ 81.365942] [<ffffffff8151ae1e>] panic+0xb9/0x219
[ 81.371213] [<ffffffff810785c3>] ?
__blocking_notifier_call_chain+0x63/0x80
[ 81.378971] [<ffffffff811384ce>] __out_of_memory+0x34e/0x350
[ 81.385292] [<ffffffff811385ee>] out_of_memory+0x5e/0x90
[ 81.391230] [<ffffffff8113ce9e>] __alloc_pages_slowpath+0x6be/0x740
[ 81.398219] [<ffffffff8113d15c>] __alloc_pages_nodemask+0x23c/0x250
[ 81.405212] [<ffffffff81186346>] kmem_getpages+0x56/0x110
[ 81.411246] [<ffffffff81187f44>] fallback_alloc+0x164/0x200
[ 81.417474] [<ffffffff81187cfd>] ____cache_alloc_node+0x8d/0x170
[ 81.424179] [<ffffffff811887bb>] kmem_cache_alloc_trace+0x17b/0x240
[ 81.431169] [<ffffffff813d5f3a>] init_memory_block+0x3a/0x110
[ 81.437586] [<ffffffff81b5f687>] memory_dev_init+0xd7/0x13d
[ 81.443810] [<ffffffff81b5f2af>] driver_init+0x2f/0x37
[ 81.449556] [<ffffffff81b1599b>] do_basic_setup+0x29/0xd5
[ 81.455597] [<ffffffff81b372c4>] ? sched_init_smp+0x140/0x147
[ 81.462015] [<ffffffff81b15c55>] kernel_init_freeable+0x20e/0x297
[ 81.468815] [<ffffffff81512ea0>] ? rest_init+0x80/0x80
[ 81.474565] [<ffffffff81512ea9>] kernel_init+0x9/0xf0
[ 81.480216] [<ffffffff8151f788>] ret_from_fork+0x58/0x90
[ 81.486156] [<ffffffff81512ea0>] ? rest_init+0x80/0x80
[ 81.492350] ---[ end Kernel panic - not syncing: Out of memory and
no killable processes...
[ 81.492350]

-Longman
I increased the pre-initialized memory per node in update_defer_init()
of mm/page_alloc.c from 2G to 4G. Now I am able to boot the 24-TB
machine without error. The 12-TB has 0.75TB/node, while the 24-TB
machine has 1.5TB/node. I would suggest something like pre-initializing
1G per 0.25TB/node. In this way, it will scale properly with the memory
size.
We're using more than 2G before we've even completed do_basic_setup()?
Where did it all go?

I think they may be used in the allocation of the hash tables like:

[ 2.367440] Dentry cache hash table entries: 2147483648 (order: 22, 17179869184 bytes)
[ 11.522768] Inode-cache hash table entries: 2147483648 (order: 22, 17179869184 bytes)
[ 18.598513] Mount-cache hash table entries: 67108864 (order: 17, 536870912 bytes)
[ 18.667485] Mountpoint-cache hash table entries: 67108864 (order: 17, 536870912 bytes)

The size of those hash tables do scale somewhat linearly with the amount of total memory available.

Before the patch, the boot time from elilo prompt to ssh login was 694s.
After the patch, the boot up time was 346s, a saving of 348s (about 50%).
Having to guesstimate the amount of memory which is needed for a
successful boot will be painful. Any number we choose will be wrong
99% of the time.

If the kswapd threads have started, all we need to do is to wait: take
a little nap in the allocator's page==NULL slowpath.

I'm not seeing any reason why we can't start kswapd much earlier -
right at the start of do_basic_setup()?

I think we can, we just have to change the hash table allocator to do that.

Cheers,
Longman
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/