Re: [PATCH -next] mm/hotplug: skip bad PFNs from pfn_to_online_page()

From: Aneesh Kumar K.V
Date: Fri Jun 14 2019 - 12:55:29 EST


On 6/14/19 10:06 PM, Dan Williams wrote:
On Fri, Jun 14, 2019 at 9:26 AM Aneesh Kumar K.V
<aneesh.kumar@xxxxxxxxxxxxx> wrote:

On 6/14/19 9:52 PM, Dan Williams wrote:
On Fri, Jun 14, 2019 at 9:18 AM Aneesh Kumar K.V
<aneesh.kumar@xxxxxxxxxxxxx> wrote:

On 6/14/19 9:05 PM, Oscar Salvador wrote:
On Fri, Jun 14, 2019 at 02:28:40PM +0530, Aneesh Kumar K.V wrote:
Can you check with this change on ppc64. I haven't reviewed this series yet.
I did limited testing with change . Before merging this I need to go
through the full series again. The vmemmap poplulate on ppc64 needs to
handle two translation mode (hash and radix). With respect to vmemap
hash doesn't setup a translation in the linux page table. Hence we need
to make sure we don't try to setup a mapping for a range which is
arleady convered by an existing mapping.

diff --git a/arch/powerpc/mm/init_64.c b/arch/powerpc/mm/init_64.c
index a4e17a979e45..15c342f0a543 100644
--- a/arch/powerpc/mm/init_64.c
+++ b/arch/powerpc/mm/init_64.c
@@ -88,16 +88,23 @@ static unsigned long __meminit vmemmap_section_start(unsigned long page)
* which overlaps this vmemmap page is initialised then this page is
* initialised already.
*/
-static int __meminit vmemmap_populated(unsigned long start, int page_size)
+static bool __meminit vmemmap_populated(unsigned long start, int page_size)
{
unsigned long end = start + page_size;
start = (unsigned long)(pfn_to_page(vmemmap_section_start(start)));

- for (; start < end; start += (PAGES_PER_SECTION * sizeof(struct page)))
- if (pfn_valid(page_to_pfn((struct page *)start)))
- return 1;
+ for (; start < end; start += (PAGES_PER_SECTION * sizeof(struct page))) {

- return 0;
+ struct mem_section *ms;
+ unsigned long pfn = page_to_pfn((struct page *)start);
+
+ if (pfn_to_section_nr(pfn) >= NR_MEM_SECTIONS)
+ return 0;

I might be missing something, but is this right?
Having a section_nr above NR_MEM_SECTIONS is invalid, but if we return 0 here,
vmemmap_populate will go on and populate it.

I should drop that completely. We should not hit that condition at all.
I will send a final patch once I go through the full patch series making
sure we are not breaking any ppc64 details.

Wondering why we did the below

#if defined(ARCH_SUBSECTION_SHIFT)
#define SUBSECTION_SHIFT (ARCH_SUBSECTION_SHIFT)
#elif defined(PMD_SHIFT)
#define SUBSECTION_SHIFT (PMD_SHIFT)
#else
/*
* Memory hotplug enabled platforms avoid this default because they
* either define ARCH_SUBSECTION_SHIFT, or PMD_SHIFT is a constant, but
* this is kept as a backstop to allow compilation on
* !ARCH_ENABLE_MEMORY_HOTPLUG archs.
*/
#define SUBSECTION_SHIFT 21
#endif

why not

#if defined(ARCH_SUBSECTION_SHIFT)
#define SUBSECTION_SHIFT (ARCH_SUBSECTION_SHIFT)
#else
#define SUBSECTION_SHIFT SECTION_SHIFT

That should be SECTION_SIZE_SHIFT

#endif

ie, if SUBSECTION is not supported by arch we have one sub-section per
section?

A couple comments:

The only reason ARCH_SUBSECTION_SHIFT exists is because PMD_SHIFT on
PowerPC was a non-constant value. However, I'm planning to remove the
distinction in the next rev of the patches. Jeff rightly points out
that having a variable subsection size per arch will lead to
situations where persistent memory namespaces are not portable across
archs. So I plan to just make SUBSECTION_SHIFT 21 everywhere.



persistent memory namespaces are not portable across archs because they
have PAGE_SIZE dependency.

We can fix that by reserving mem_map capacity assuming the smallest
PAGE_SIZE across archs.

Then we have dependencies like the page size
with which we map the vmemmap area.

How does that lead to cross-arch incompatibility? Even on a single
arch the vmemmap area will be mapped with 2MB pages for 128MB aligned
spans of pmem address space and 4K pages for subsections.

I am not sure I understood that details. On ppc64 vmemmap can be mapped by either 16M, 2M, 64K depending on the translation mode (hash or radix). Doesn't that imply our reserve area size will vary between these configs? I was thinking we should let the arch pick the largest value as alignment and align things based on that. Otherwise if you align the vmemmap/altmap area to 2M and we move to a platform that map the vmemmap area using 16MB pagesize we fail right? In other words if you want to build a portable pmem region, we have to configure these alignment correctly.

Also the label area storage is completely hidden in firmware right? So the portability will be limited to platforms that support same firmware?



Why not let the arch
arch decide the SUBSECTION_SHIFT and default to one subsection per
section if arch is not enabled to work with subsection.

Because that keeps the implementation from ever reaching a point where
a namespace might be able to be moved from one arch to another. If we
can squash these arch differences then we can have a common tool to
initialize namespaces outside of the kernel. The one wrinkle is
device-dax that wants to enforce the mapping size, but I think we can
have a module-option compatibility override for that case for the
admin to say "yes, I know this namespace is defined for 2MB x86 pages,
but I want to force enable it with 64K pages on PowerPC"

But then you can't say I want to enable this with 16M pages on PowerPC.
But I understood what you are suggesting here.

-aneesh