Re: [PATCH v3 01/11] mm/memremap: Get rid of memmap_init_zone_device()

From: David Hildenbrand
Date: Thu Aug 29 2019 - 12:55:53 EST


On 29.08.19 18:39, Alexander Duyck wrote:
> On Thu, Aug 29, 2019 at 12:00 AM David Hildenbrand <david@xxxxxxxxxx> wrote:
>>
>> As far as I can see, the original split wanted to speed up a duplicate
>> initialization. We now only initialize once - and we really should
>> initialize under the lock, when resizing the zones.
>
> What do you mean by duplicate initialization? Basically the issue was
> that we can have systems with massive memory footprints and I was just
> trying to get the initialization time under a minute. The compromise I
> made was to split the initialization so that we only initialized the
> pages in the altmap and defer the rest so that they can be initialized
> in parallel.
>
> What this patch does is serialize the initialization so it will likely
> take 2 to 4 minutes or more to initialize memory on a system where I
> had brought the init time under about 30 seconds.
>
>> As soon as we drop the lock we might have memory unplug running, trying
>> to shrink the zone again. In case the memmap was not properly initialized,
>> the zone/node shrinking code might get false negatives when search for
>> the new zone/node boundaries - bad. We suddenly could no longer span the
>> memory we just added.
>
> The issue as I see it is that we can have systems with multiple nodes
> and each node can populate a fairly significant amount of data. By
> pushing this all back under the hotplug lock you are serializing the
> initialization for each node resulting in a 4x or 8x increase in
> initialization time on some of these bigger systems.
>
>> Also, once we want to fix set_zone_contiguous(zone) inside
>> move_pfn_range_to_zone() to actually work with ZONE_DEVICE (instead of
>> always immediately stopping and never setting zone->contiguous) we have
>> to have the whole memmap initialized at that point. (not sure if we
>> want that in the future, just a note)
>>
>> Let's just keep things simple and initialize the memmap when resizing
>> the zones under the lock.
>>
>> If this is a real performance issue, we have to watch out for
>> alternatives.
>
> My concern is that this is going to become a performance issue, but I
> don't have access to the hardware at the moment to test how much of
> one. I'll have to check to see if I can get access to a system to test
> this patch set.
>

Thanks for having a look - I already dropped this patch again. We will
rather stop shrinking the ZONE_DEVICE. So assume this patch is gone.

[...]

> So if you are moving this all under the lock then this is going to
> serialize initialization and it is going to be quite expensive on bit
> systems. I was only ever able to get the init time down to something
> like ~20s with the optimized function. Since that has been torn apart
> and you are back to doing things with memmap_init_zone we are probably
> looking at more like 25-30s for each node, and on a 4 node system we
> are looking at 2 minutes or so which may lead to issues if people are
> mounting this.
>
> Instead of forcing this all to be done under the hotplug lock is there
> some way we could do this under the zone span_seqlock instead to
> achieve the same result?

I guess the right approach really is as Michal suggest to not shrink at
all (at least ZONE_DEVICE) :)

>
>> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
>> index c5d62f1c2851..44038665fe8e 100644
>> --- a/mm/page_alloc.c
>> +++ b/mm/page_alloc.c
>> @@ -5845,7 +5845,7 @@ overlap_memmap_init(unsigned long zone, unsigned long *pfn)
>> */
>> void __meminit memmap_init_zone(unsigned long size, int nid, unsigned long zone,
>> unsigned long start_pfn, enum memmap_context context,
>> - struct vmem_altmap *altmap)
>> + struct dev_pagemap *pgmap)
>> {
>> unsigned long pfn, end_pfn = start_pfn + size;
>> struct page *page;
>> @@ -5853,24 +5853,6 @@ void __meminit memmap_init_zone(unsigned long size, int nid, unsigned long zone,
>> if (highest_memmap_pfn < end_pfn - 1)
>> highest_memmap_pfn = end_pfn - 1;
>>
>> -#ifdef CONFIG_ZONE_DEVICE
>> - /*
>> - * Honor reservation requested by the driver for this ZONE_DEVICE
>> - * memory. We limit the total number of pages to initialize to just
>> - * those that might contain the memory mapping. We will defer the
>> - * ZONE_DEVICE page initialization until after we have released
>> - * the hotplug lock.
>> - */
>> - if (zone == ZONE_DEVICE) {
>> - if (!altmap)
>> - return;
>> -
>> - if (start_pfn == altmap->base_pfn)
>> - start_pfn += altmap->reserve;
>> - end_pfn = altmap->base_pfn + vmem_altmap_offset(altmap);
>> - }
>> -#endif
>> -
>> for (pfn = start_pfn; pfn < end_pfn; pfn++) {
>> /*
>> * There can be holes in boot-time mem_map[]s handed to this
>> @@ -5892,6 +5874,20 @@ void __meminit memmap_init_zone(unsigned long size, int nid, unsigned long zone,
>> if (context == MEMMAP_HOTPLUG)
>> __SetPageReserved(page);
>>
>> +#ifdef CONFIG_ZONE_DEVICE
>> + if (zone == ZONE_DEVICE) {
>> + WARN_ON_ONCE(!pgmap);
>> + /*
>> + * ZONE_DEVICE pages union ->lru with a ->pgmap back
>> + * pointer and zone_device_data. It is a bug if a
>> + * ZONE_DEVICE page is ever freed or placed on a driver
>> + * private list.
>> + */
>> + page->pgmap = pgmap;
>> + page->zone_device_data = NULL;
>> + }
>> +#endif
>> +
>
> So I am not sure this is right. Part of the reason for the split is
> that the pages that were a part of the altmap had an LRU setup, not
> the pgmap/zone_device_data setup. This is changing that.

Yeah, you might be right, we would have to handle the altmap part
separately.

>
> Also, I am more a fan of just testing pgmap and if it is present then
> assign page->pgmap and reset zone_device_data. Then you can avoid the
> test for zone on every iteration and the WARN_ON_ONCE check, or at
> least you could move it outside the loop so we don't incur the cost
> with each page. Are there situations where you would have pgmap but
> not a ZONE_DEVICE page?

Don't think so. But I am definitely not a ZONE_DEVICE expert.

>
>> /*
>> * Mark the block movable so that blocks are reserved for
>> * movable at startup. This will force kernel allocations
>> @@ -5951,14 +5947,6 @@ void __ref memmap_init_zone_device(struct zone *zone,
>> */
>> __SetPageReserved(page);
>>
>> - /*
>> - * ZONE_DEVICE pages union ->lru with a ->pgmap back pointer
>> - * and zone_device_data. It is a bug if a ZONE_DEVICE page is
>> - * ever freed or placed on a driver-private list.
>> - */
>> - page->pgmap = pgmap;
>> - page->zone_device_data = NULL;
>> -
>> /*
>> * Mark the block movable so that blocks are reserved for
>> * movable at startup. This will force kernel allocations
>
> Shouldn't you be removing this function instead of just gutting it?
> I'm kind of surprised you aren't getting warnings about unused code
> since you also pulled the declaration for it from the header.
>

Don't ask me what went wrong here, I guess I messed this up while rebasing.

--

Thanks,

David / dhildenb