Re: [PATCH v9 08/12] mm/sparsemem: Support sub-section hotplug

From: Dan Williams
Date: Fri Jun 07 2019 - 11:42:39 EST


On Fri, Jun 7, 2019 at 1:34 AM Oscar Salvador <osalvador@xxxxxxx> wrote:
>
> On Wed, Jun 05, 2019 at 02:58:42PM -0700, Dan Williams wrote:
> > The libnvdimm sub-system has suffered a series of hacks and broken
> > workarounds for the memory-hotplug implementation's awkward
> > section-aligned (128MB) granularity. For example the following backtrace
> > is emitted when attempting arch_add_memory() with physical address
> > ranges that intersect 'System RAM' (RAM) with 'Persistent Memory' (PMEM)
> > within a given section:
> >
> > WARNING: CPU: 0 PID: 558 at kernel/memremap.c:300 devm_memremap_pages+0x3b5/0x4c0
> > devm_memremap_pages attempted on mixed region [mem 0x200000000-0x2fbffffff flags 0x200]
> > [..]
> > Call Trace:
> > dump_stack+0x86/0xc3
> > __warn+0xcb/0xf0
> > warn_slowpath_fmt+0x5f/0x80
> > devm_memremap_pages+0x3b5/0x4c0
> > __wrap_devm_memremap_pages+0x58/0x70 [nfit_test_iomap]
> > pmem_attach_disk+0x19a/0x440 [nd_pmem]
> >
> > Recently it was discovered that the problem goes beyond RAM vs PMEM
> > collisions as some platform produce PMEM vs PMEM collisions within a
> > given section. The libnvdimm workaround for that case revealed that the
> > libnvdimm section-alignment-padding implementation has been broken for a
> > long while. A fix for that long-standing breakage introduces as many
> > problems as it solves as it would require a backward-incompatible change
> > to the namespace metadata interpretation. Instead of that dubious route
> > [1], address the root problem in the memory-hotplug implementation.
> >
> > [1]: https://lore.kernel.org/r/155000671719.348031.2347363160141119237.stgit@xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
> > Cc: Michal Hocko <mhocko@xxxxxxxx>
> > Cc: Vlastimil Babka <vbabka@xxxxxxx>
> > Cc: Logan Gunthorpe <logang@xxxxxxxxxxxx>
> > Cc: Oscar Salvador <osalvador@xxxxxxx>
> > Cc: Pavel Tatashin <pasha.tatashin@xxxxxxxxxx>
> > Signed-off-by: Dan Williams <dan.j.williams@xxxxxxxxx>
> > ---
> > include/linux/memory_hotplug.h | 2
> > mm/memory_hotplug.c | 7 -
> > mm/page_alloc.c | 2
> > mm/sparse.c | 225 +++++++++++++++++++++++++++-------------
> > 4 files changed, 155 insertions(+), 81 deletions(-)
> >
> [...]
> > @@ -325,6 +332,15 @@ static void __meminit sparse_init_one_section(struct mem_section *ms,
> > unsigned long pnum, struct page *mem_map,
> > struct mem_section_usage *usage)
> > {
> > + /*
> > + * Given that SPARSEMEM_VMEMMAP=y supports sub-section hotplug,
> > + * ->section_mem_map can not be guaranteed to point to a full
> > + * section's worth of memory. The field is only valid / used
> > + * in the SPARSEMEM_VMEMMAP=n case.
> > + */
> > + if (IS_ENABLED(CONFIG_SPARSEMEM_VMEMMAP))
> > + mem_map = NULL;
>
> Will this be a problem when reading mem_map with the crash-tool?
> I do not expect it to be, but I am not sure if crash internally tries
> to read ms->section_mem_map and do some sort of translation.
> And since ms->section_mem_map SECTION_HAS_MEM_MAP, it might be that it expects
> a valid mem_map?

I don't know, but I can't imagine it would because it's much easier to
do mem_map relative translations by simple PAGE_OFFSET arithmetic.

> > +static void section_deactivate(unsigned long pfn, unsigned long nr_pages,
> > + struct vmem_altmap *altmap)
> > +{
> > + DECLARE_BITMAP(map, SUBSECTIONS_PER_SECTION) = { 0 };
> > + DECLARE_BITMAP(tmp, SUBSECTIONS_PER_SECTION) = { 0 };
> > + struct mem_section *ms = __pfn_to_section(pfn);
> > + bool early_section = is_early_section(ms);
> > + struct page *memmap = NULL;
> > + unsigned long *subsection_map = ms->usage
> > + ? &ms->usage->subsection_map[0] : NULL;
> > +
> > + subsection_mask_set(map, pfn, nr_pages);
> > + if (subsection_map)
> > + bitmap_and(tmp, map, subsection_map, SUBSECTIONS_PER_SECTION);
> > +
> > + if (WARN(!subsection_map || !bitmap_equal(tmp, map, SUBSECTIONS_PER_SECTION),
> > + "section already deactivated (%#lx + %ld)\n",
> > + pfn, nr_pages))
> > + return;
> > +
> > + /*
> > + * There are 3 cases to handle across two configurations
> > + * (SPARSEMEM_VMEMMAP={y,n}):
> > + *
> > + * 1/ deactivation of a partial hot-added section (only possible
> > + * in the SPARSEMEM_VMEMMAP=y case).
> > + * a/ section was present at memory init
> > + * b/ section was hot-added post memory init
> > + * 2/ deactivation of a complete hot-added section
> > + * 3/ deactivation of a complete section from memory init
> > + *
> > + * For 1/, when subsection_map does not empty we will not be
> > + * freeing the usage map, but still need to free the vmemmap
> > + * range.
> > + *
> > + * For 2/ and 3/ the SPARSEMEM_VMEMMAP={y,n} cases are unified
> > + */
> > + bitmap_xor(subsection_map, map, subsection_map, SUBSECTIONS_PER_SECTION);
> > + if (bitmap_empty(subsection_map, SUBSECTIONS_PER_SECTION)) {
> > + unsigned long section_nr = pfn_to_section_nr(pfn);
> > +
> > + if (!early_section) {
> > + kfree(ms->usage);
> > + ms->usage = NULL;
> > + }
> > + memmap = sparse_decode_mem_map(ms->section_mem_map, section_nr);
> > + ms->section_mem_map = sparse_encode_mem_map(NULL, section_nr);
> > + }
> > +
> > + if (early_section && memmap)
> > + free_map_bootmem(memmap);
> > + else
> > + depopulate_section_memmap(pfn, nr_pages, altmap);
> > +}
> > +
> > +static struct page * __meminit section_activate(int nid, unsigned long pfn,
> > + unsigned long nr_pages, struct vmem_altmap *altmap)
> > +{
> > + DECLARE_BITMAP(map, SUBSECTIONS_PER_SECTION) = { 0 };
> > + struct mem_section *ms = __pfn_to_section(pfn);
> > + struct mem_section_usage *usage = NULL;
> > + unsigned long *subsection_map;
> > + struct page *memmap;
> > + int rc = 0;
> > +
> > + subsection_mask_set(map, pfn, nr_pages);
> > +
> > + if (!ms->usage) {
> > + usage = kzalloc(mem_section_usage_size(), GFP_KERNEL);
> > + if (!usage)
> > + return ERR_PTR(-ENOMEM);
> > + ms->usage = usage;
> > + }
> > + subsection_map = &ms->usage->subsection_map[0];
> > +
> > + if (bitmap_empty(map, SUBSECTIONS_PER_SECTION))
> > + rc = -EINVAL;
> > + else if (bitmap_intersects(map, subsection_map, SUBSECTIONS_PER_SECTION))
> > + rc = -EEXIST;
> > + else
> > + bitmap_or(subsection_map, map, subsection_map,
> > + SUBSECTIONS_PER_SECTION);
> > +
> > + if (rc) {
> > + if (usage)
> > + ms->usage = NULL;
> > + kfree(usage);
> > + return ERR_PTR(rc);
> > + }
>
> We should not be really looking at subsection_map stuff when running on
> !CONFIG_SPARSE_VMEMMAP, right?
> Would it make sense to hide the bitmap dance behind
>
> if(IS_ENABLED(CONFIG_SPARSEMEM_VMEMMAP)) ?
>
> Sorry for nagging here

No worries, its a valid question. The bitmap dance is still valid it
will just happen on section boundaries instead of subsection. If
anything breaks that's beneficial additional testing that we got from
the SPARSEMEM sub-case for the SPARSEMEM_VMEMMAP superset-case. That's
the gain for keeping them unified, what's the practical gain from
hiding this bit manipulation from the SPARSEMEM case?

>
> > /**
> > - * sparse_add_one_section - add a memory section
> > + * sparse_add_section - add a memory section, or populate an existing one
> > * @nid: The node to add section on
> > * @start_pfn: start pfn of the memory range
> > + * @nr_pages: number of pfns to add in the section
> > * @altmap: device page map
> > *
> > * This is only intended for hotplug.
>
> Below this, the return codes are specified:
>
> ---
> * Return:
> * * 0 - On success.
> * * -EEXIST - Section has been present.
> * * -ENOMEM - Out of memory.
> */
> ---
>
> We can get rid of -EEXIST since we do not return that anymore.

Good catch.