Re: [RFC PATCH 0/14] Parallel memory initialisation

From: Mel Gorman
Date: Wed Apr 15 2015 - 11:44:36 EST


On Wed, Apr 15, 2015 at 10:50:45AM -0400, Waiman Long wrote:
> On 04/15/2015 09:38 AM, Mel Gorman wrote:
> >On Wed, Apr 15, 2015 at 09:15:50AM -0400, Waiman Long wrote:
> >>><SNIP>
> >>>Patches are against 4.0-rc7.
> >>>
> >>> Documentation/kernel-parameters.txt | 8 +
> >>> arch/ia64/mm/numa.c | 19 +-
> >>> arch/x86/Kconfig | 2 +
> >>> include/linux/memblock.h | 18 ++
> >>> include/linux/mm.h | 8 +-
> >>> include/linux/mmzone.h | 37 +++-
> >>> init/main.c | 1 +
> >>> mm/Kconfig | 29 +++
> >>> mm/bootmem.c | 6 +-
> >>> mm/internal.h | 23 ++-
> >>> mm/memblock.c | 34 ++-
> >>> mm/mm_init.c | 9 +-
> >>> mm/nobootmem.c | 7 +-
> >>> mm/page_alloc.c | 398 +++++++++++++++++++++++++++++++-----
> >>> mm/vmscan.c | 6 +-
> >>> 15 files changed, 507 insertions(+), 98 deletions(-)
> >>>
> >>I had included your patch with the 4.0 kernel and booted up a
> >>16-socket 12-TB machine. I measured the elapsed time from the elilo
> >>prompt to the availability of ssh login. Without the patch, the
> >>bootup time was 404s. It was reduced to 298s with the patch. So
> >>there was about 100s reduction in bootup time (1/4 of the total).
> >>
> >Cool, thanks for testing. Would you be able to state if this is really
> >important or not? Does booting 100s second faster on a 12TB machine really
> >matter? I can then add that justification to the changelog to avoid a
> >conversation with Andrew that goes something like
> >
> >Andrew: Why are we doing this?
> >Mel: Because we can and apparently people might want it.
> >Andrew: What's the maintenance cost of this?
> >Mel: Magic beans
> >
> >I prefer talking to Andrew when it's harder to predict what he'll say.
>
> Booting 100s faster is certainly something that is nice to have.
> Right now, more time is spent in the firmware POST portion of the
> bootup process than in the OS boot.

I'm not surprised. On two different 1TB machines, I've seen a post time
of 2 minutes and one of 35. No idea what it's doing for 35 minutes....
plotting world domination probably.

> So I would say this patch isn't
> really critical right now as machines with that much memory are
> relatively rare. However, if we look forward to the near future,
> some new memory technology like persistent memory is coming and
> machines with large amount of memory (whether persistent or not)
> will become more common. This patch will certainly be useful if we
> look forward into the future.
>

Whether persistent memory needs struct pages or not is up in the air and
I'm not getting stuck in that can of worms. 100 seconds off kernel init
time is a starting point. I can try pushing it on on that basis but I
really would like to see SGI and Intel people also chime in on how it
affects their really large machines.

> >>However, there were 2 bootup problems in the dmesg log that needed
> >>to be addressed.
> >>1. There were 2 vmalloc allocation failures:
> >>[ 2.284686] vmalloc: allocation failure, allocated 16578404352 of
> >>17179873280 bytes
> >>[ 10.399938] vmalloc: allocation failure, allocated 7970922496 of
> >>8589938688 bytes
> >>
> >>2. There were 2 soft lockup warnings:
> >>[ 57.319453] NMI watchdog: BUG: soft lockup - CPU#1 stuck for 23s!
> >>[swapper/0:1]
> >>[ 85.409263] NMI watchdog: BUG: soft lockup - CPU#1 stuck for 22s!
> >>[swapper/0:1]
> >>
> >>Once those problems are fixed, the patch should be in a pretty good
> >>shape. I have attached the dmesg log for your reference.
> >>
> >The obvious conclusion is that initialising 1G per node is not enough for
> >really large machines. Can you try this on top? It's untested but should
> >work. The low value was chosen because it happened to work and I wanted
> >to get test coverage on common hardware but broke is broke.
> >
> >diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> >index f2c96d02662f..6b3bec304e35 100644
> >--- a/mm/page_alloc.c
> >+++ b/mm/page_alloc.c
> >@@ -276,9 +276,9 @@ static inline bool update_defer_init(pg_data_t *pgdat,
> > if (pgdat->first_deferred_pfn != ULONG_MAX)
> > return false;
> >
> >- /* Initialise at least 1G per zone */
> >+ /* Initialise at least 32G per node */
> > (*nr_initialised)++;
> >- if (*nr_initialised> (1UL<< (30 - PAGE_SHIFT))&&
> >+ if (*nr_initialised> (32UL<< (30 - PAGE_SHIFT))&&
> > (pfn& (PAGES_PER_SECTION - 1)) == 0) {
> > pgdat->first_deferred_pfn = pfn;
> > return false;
>
> I will try this out when I can get hold of the 12-TB machine again.
>

Thanks.

> The vmalloc allocation failures were for the following hash tables:
> - Dentry cache hash table entries
> - Inode-cache hash table entries
>
> Those hash tables scale linearly with the amount of memory available
> in the system. So instead of hardcoding a certain value, why don't
> we make it a certain % of the total memory but bottomed out to 1G at
> the low end?
>

Because then it becomes what percentage is the right percentage and what
happens if it's a percentage of total memory but the NUMA nodes are not
all the same size?. I want to start simple until there is more data on
what these really large machines look like and if it ever fails in the
field, there is the command-line switch until a patch is available.

--
Mel Gorman
SUSE Labs
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/