Re: [PATCH v12 7/7] x86/crash: Add x86 crash hotplug support

From: Eric DeVolder
Date: Thu Oct 27 2022 - 15:25:13 EST




On 10/26/22 09:48, Baoquan He wrote:
On 10/25/22 at 12:31pm, Borislav Petkov wrote:
On Thu, Oct 13, 2022 at 10:57:28AM +0800, Baoquan He wrote:
The concern to range number mainly is on Virt guest systems.

And why would virt emulate 1K hotpluggable DIMM slots and not emulate a
real machine?

Well, currently, mem hotpug is an important feature on virt system to
dynamically increase/shrink memory on the system. If only emulating real
machine, it won't be different than bare metal system.

IIRC, the ballon driver or virtio-mem feature can add memory board, e.g
1G, block size is 128M, 8 blocks added. When shrinking this 1G memory
later, it will take best effort way to hot remove memory. Means if any
memory block is occupied, it will be kept there. Finally we could only
remove every second blocks, 4 blocks altogether. Then the left
un-removed blocks will produce 4 separate memory regions. Like this, a
virt guest could have many memory regions in kernel after memory
being added/removed.

If I am wrong, Please correct me, David.


On baremetal system, basically only very high end server support
memory hotplug. I ever visited customer's lab and saw one server,
it owns 8 slots, on each slot a box containing about 20 cpus and 2T
memory at most can be plugged in at one time. So people won't make too
many slots for hotplugging since it's too expensive.

There you have it - the persuading argument.

So after re-reading the exchanges, many times, I think I realized that I have introduced confusion by using "hotplug", and specifically "memory hotplug" and DIMMs in the same breath, and thus that perhaps equated hotplug with ACPI DIMMs in these discussions.

Allow me to state that "hotplug" in this patch series refers to CPU and memory hot un/plug, and does *not* explicitly refer to any particular underlying technology to generate CPU and/or memory hot un/plug events.

For example, I have been using DIMMs as my example because that has been my test vehicle for exercising this code; as such I think the discussion cornered itself into real world vs virt discussion about DIMMs, with no end in sight. To be plain, this patch series does not intend to convey or change anything specific about ACPI DIMMs.

In reality, when I state "hotplug" in these patches, I am talking generically and therefore inclusive of any technology that can hot un/plug CPUs or memory. For memory specifically, this includes ACPI DIMMs (whether baremetal or QEMU), ballooning, virtio-mem, PPC dlpar (per David), Microsoft DynamicMemory, and the upcoming CXL.mem technology. Probably others that I am not aware. Any of these technologies can add or remove memory from a bare metal and/or virtual system.

I apologize.

What is important is the number of memory regions (ie. /proc/iomem entries) that can be considered to be the maximum. There is no kernel definition of such. The need to identify a maximum number is so that the buffer containing the elfcorehdr can be sized and allocated at kdump load time, *once*. This elfcorehdr buffer is then modified/updated repeatedly as hot un/plug events occur. It is *not* re-allocated on each hot un/plug event; that is what the current solution does, sort of.


I checked user space kexec code, the maximum memory range number is
honored to x86_64 because of a HPE SGI system. After that, nobody
complains about it. Please see below user space kexec-tools commit in
https://git.kernel.org/pub/scm/utils/kernel/kexec/kexec-tools.git

The memory ranges may be not all made by different DIMM slots, could be
firmware reservatoin, e.g efi/BIOS diggged out physical memory,
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

I don't know what that means.

If it is firmware crap, you want to exclude that from kdump anyway.

Yes, now assume we have a HPE SGI system and it has memory hotplug
capacity. The system itself has already got memory regions more than
1024. Then when we hot add extra memory board, we want to include the
newly added memory regions into elfcorehdr so that it will be dumped out
in kdump kernel.

That's why I earlier suggested 2048 for number of memory regions.


Now CONFIG_NR_CPUS has the maximum number as 8192. And user space
kexec-tools has maximum memory range number as 2048. We can take
the current 8192 + 2048 = 10K as default value conservatively. Or
take 8192 + 2048 * 2 = 12K which has two times of maximum memory range
bumber in kexec-tools. What do you think?

I still think that we should stick to reality and support what is
possible not what is potentially and theoretically there.

Yes, agree. We should try to get a number which satisfies needs in
reality.



For Kconfig CRASH_MAX_MEMORY_RANGES in this patch, I have three items to
suggest:

1) the name is not good, it doesn't reflect the fact that it's the
number of program headers of elfcorehdr which includes the cpu note
numbers and memory region numers.

The total number of program headers is, generally speaking, the number of CPUs and the number of memory regions (plug one for VMCOREINFO and maybe a few others). The NR_CPUS_DEFAULT conveys the maximum number of CPUs possible, and likewise the CRASH_MAX_MEMORY_RANGES is intended to convey the maximum number of memory regions possible.

It is not misnamed, imho; rather, I think due to the confusion I outline above, misunderstood.


2) default cpu number, I suggest 512 or 1024. The biggest number I
ever saw in reality is 384. On virt system, it won't be too big. Below
is abstracted from arch/x86/Kconfig. A smaller one is also OK, we can
enlarge it when people really have a super machine and run into the
problem.

config NR_CPUS_DEFAULT
int
depends on X86_64
default 8192 if MAXSMP
default 64 if SMP
default 1 if !SMP

I'm all for making this a sane number, I'm just not sure this patch series is the place to do this?


3) For memory regions, I would suggest 2048. Likewise, smaller value is
also fine, we can enlarge it when a real system run into this.

As David points out, if the memblock size if 128MiB, then 2048 allows for 256GiB, which I believe to be too low for a maximum default. I'd at least target 1TiB with 128MiB memblock size, which would put the number at 8192.

Should the memblock size be 2GiB, for really big systems, then 8192 entries allow handling 16GiB.

With sizeof(elf64_phdr) of 64 bytes, that means the elfcorehdr buffer/memory segment is essentially 512KiB.

Be aware, in reality, that if the system was fully populated, it would not actually consume all 8192 phdrs. Rather /proc/iomem would essentially show a large contiguous address space which would require just a single phdr. The reason to consider having the larger number of phdrs is so that if the memory becomes fragmented due to hot un/plug events, then you need the phdrs to record the sparse mapping. The sadistic case is that every other memblock is offlined/removed, not likely, but possible.



I made a draft here for reference, with my undertanding. Please feel
free to change it.

+config CRASH_ELF_CORE_PHDRS_NUM
+ depends on CRASH_DUMP && KEXEC_FILE && (HOTPLUG_CPU || MEMORY_HOTPLUG)
+ int
+ default 3072
+ help
+ For the kexec_file_load path, specify the default number of
+ phdr for the vmcore. E.g the memory regions represented by the
+ 'System RAM' entries in /proc/iomem, the cpu notes of each
+ present cpu stored in /sys/devices/system/cpu/cpuX/crash_notes.

Thanks


I'd prefer keeping CRASH_MAX_MEMORY_RANGES as that allow the maximum phdr number value to be reflective of CPUs and/or memory; not all systems support both CPU and memory hotplug. For example, I have queued up this change to reflect this:

if (IS_ENABLED(CONFIG_HOTPLUG_CPU) || IS_ENABLED(CONFIG_MEMORY_HOTPLUG)) {
/* Ensure elfcorehdr segment large enough for hotplug changes */
unsigned long pnum = 2; /* VMCOREINFO and kernel_map */
if (IS_ENABLED(CONFIG_HOTPLUG_CPU))
pnum += CONFIG_NR_CPUS_DEFAULT;
if (IS_ENABLED(CONFIG_MEMORY_HOTPLUG))
pnum += CONFIG_CRASH_MAX_MEMORY_RANGES;
if (pnum < (unsigned long)PN_XNUM) {
kbuf.memsz = pnum * sizeof(Elf64_Phdr);
kbuf.memsz += sizeof(Elf64_Ehdr);
image->elfcorehdr_index = image->nr_segments;
image->elfcorehdr_index_valid = true;
/* Mark as usable to crash kernel, else crash kernel fails on boot *
image->elf_headers_sz = kbuf.memsz;
} else {
pr_err("number of Phdrs %lu exceeds max %lu\n", pnum, (unsigned long
}
}

I hope this helps clarify my intentions with this patch series.
Thanks,
eric