Re: [PATCH v1 00/10] mm: online/offline 4MB chunks controlled by device driver

From: David Hildenbrand
Date: Fri May 25 2018 - 11:09:14 EST

Next message: Arnaldo Carvalho de Melo: "Re: [PATCH] perf test 39 (Session topology) dumps core on s390"
Previous message: Jason Gunthorpe: "Re: [PATCH V2 rdma-next 3/4] RDMA/hns: Add reset process for RoCE in hip08"
In reply to: David Hildenbrand: "Re: [PATCH v1 00/10] mm: online/offline 4MB chunks controlled by device driver"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]

>> So, no, virtio-mem is not a balloon driver :)
> [...]
>>>> 1. "hotplug should simply not depend on kdump at all"
>>>>
>>>> In theory yes. In the current state we already have to trigger kdump to
>>>> reload whenever we add/remove a memory block.
>>>
>>> More details please.
>>
>> I just had another look at the whole complexity of
>> makedumfile/kdump/uevents and I'll follow up with a detailed description.
>>
>> kdump.service is definitely reloaded when setting a memory block
>> online/offline (not when adding/removing as I wrongly claimed before).
>>
>> I'll follow up with a more detailed description and all the pointers.
>
> Please make sure to describe what is the architecture then. I have no
> idea what kdump.servise is supposed to do for example.

Giving a high level description, going into applicable details:

Dump tools always generate the dump file from /proc/vmcore inside the
kexec environment. This is a vmcore dump in ELF format, with required
and optional headers and notes.

1. Core collectors

The tool that writes /proc/vmcore into a file is called "core collector".

"This allows you to specify the command to copy the vmcore. You could
use the dump filtering program makedumpfile, the default one, to
retrieve your core, which on some arches can drastically reduce core
file size." [1]

E.g. under RHEL, the only supported core collector is in fact
makedumpfile [2][3], which is e.g. able to exclude e.g. hwpoison pages,
which could result otherwise in a crash if you simply copy /proc/vmcore
into a file on harddisk.

2. vmcoreinfo

/proc/vmcore can optionally contain a vmcoreinfo, that exposes some
magic variables necessary to e.g. find and interpret segments but also
struct pages. This is generated in "kernel/crash_core.c" in the crashed
linux kernel.

...
VMCOREINFO_SYMBOL_ARRAY(mem_section);
VMCOREINFO_LENGTH(mem_section, NR_SECTION_ROOTS);
...
VMCOREINFO_NUMBER(PG_hwpoison);
...
VMCOREINFO_NUMBER(PAGE_BUDDY_MAPCOUNT_VALUE);
...

If not available, it is e.g. tried to extract relevant
symbols/variables/pointers from vmlinux (similar like e.g. GDB).

3. PM_LOAD / Memory holes

Each vmcore contains "PM_LOAD" sections. These sections define which
physical memory areas are available in the vmcore (and to which virtual
addresses they translate). Generated e.g. in "kernel/kexec_file.c" - and
in some other places "git grep Elf64_Phdr".

This information is generated in the crashed kernel.

arch/x86/kernel/crash.c:
walk_system_ram_res() is effectively used to generate PM_LOAD segments

arch/s390/kernel/crash_dump.c:
for_each_mem_range() is effectively used to generate PM_LOAD
information

At this point, I don't see how offline sections are treated. I assume
they are always also included. So PT_LOAD will include all memory, no
matter if online or offline.

4. Reloading kexec/kdump.service

The important thing is that the vmcore *excluding* the actual memory has
to be prepared by the *old* kernel. The kexec kernel will allow to
- Read the prepared vmcore (contained in kexec kernel)
- Read the memory

So dump tools only have the vmcore (esp. PT_LOAD) to figure out which
physical memory was available in the *old* system. The kexec kernel
neither reads or interprets segments/struct pages from the old kernel
(and there would be no way to really do it). All it does is allow to
read old memory as defined in the prepared vmcore. If that memory is not
accessible or broken (hwpoison), we will crash the system.

So what does this imply? vmcore (including PT_LOAD sections) has to be
regenerated every time memory is added/removed from the system.
Otherwise the data contained in the prepared vmcore is stale. As far as
I understand this cannot be done by the actual kernel when
adding/removing memory but has to be done by user space.

The same is e.g. also true when hot(un)plugging CPUs.

This is done by reloading kexec, resulting in a regeneration of the
vmcore. UDEV events are used to reload kdump.service and therefore
regenerate. This events are triggered when onlining/offlining a memory
block.

...
SUBSYSTEM=="memory", ACTION=="online", PROGRAM="/bin/systemctl
try-restart kdump.service"
SUBSYSTEM=="memory", ACTION=="offline", PROGRAM="/bin/systemctl
try-restart kdump.service"
...

For "online", this is the right thing to do.

I am right now not 100% if that is the right thing to do for "offline".
I guess we should regenerate actually after "remove" events, but I
didn't follow the details. Otherwise it could happen that the vmcore is
regenerated before the actual removal of memory blocks. So the
applicable memory blocks would still be included as PT_LOAD in the
vmcore. If we then remove the actual DIMM then, trying to dump the
vmcore will result in reading invalid memory. But maybe I am missing
something there.

5. Access to vmcore / memory in the kexec environment

fs/proc/vmcore.c: contains the code for parsing vmcore in the kexec
kernel, prepared by the crashed kernel. The kexec kernel provides read
access to /proc/vmcore on this basis.

All PT_LOAD sections will be converted and stored in "vmcore_list".

When reading the vmcore, this list will be used to actually provide
access to the original crash memory (__read_vmcore()).

So only memory that was originally in vmcore PT_LOAD will be allowed to
be red.

read_from_oldmem() will perform the actual read. At that point we have
no control over old page flags or segments. Just a straight memory read.

There is special handling for e.g. XEN in there: pfn_is_ram() can be
used to hinder reading inflated memory. (register_oldmem_pfn_is_ram)

However reusing that for virtio-mem with multiple devices and queues and
such might not be possible. It is the last resort :)

6. makedumpfile

makedumpfile can exclude free (buddy) pages, hwpoison pages and some
more. It will *not* exclude reserved pages or balloon (e.g.
virtio-balloon) inflated pages. So it will read inflated pages and if
they are zero, save a compressed zero page. However it will (read)
access that memory.

makedumpfile was adapted to the new SECTION_IS_ONLINE bit (to mask the
right section address), offline sections will *not* be excluded. So also
all memory in offline sections will be accessed and dumped - unless
pages don't fall into PT_LOAD sections ("memory hole"), in this case
they are not accessed.

7. Further information

Some more details can be found in "Documentation/kdump/kdump.txt".

"All of the necessary information about the system kernel's core image
is encoded in the ELF format, and stored in a reserved area of memory
before a crash. The physical address of the start of the ELF header is
passed to the dump-capture kernel through the elfcorehdr= boot
parameter."
-> I am pretty sure this is why the kexec reload from user space is
necessary

"For s390x there are two kdump modes: If a ELF header is specified with
the elfcorehdr= kernel parameter, it is used by the kdump kernel as it
is done on all other architectures. If no elfcorehdr= kernel parameter
is specified, the s390x kdump kernel dynamically creates the header.
The second mode has the advantage that for CPU and memory hotplug,
kdump has not to be reloaded with kexec_load()."

Any experts, please jump in :)

[1] https://www.systutorials.com/docs/linux/man/5-kdump/
[2] https://sourceforge.net/projects/makedumpfile/
[3] git://git.code.sf.net/p/makedumpfile/code

--

Thanks,

David / dhildenb

Next message: Arnaldo Carvalho de Melo: "Re: [PATCH] perf test 39 (Session topology) dumps core on s390"
Previous message: Jason Gunthorpe: "Re: [PATCH V2 rdma-next 3/4] RDMA/hns: Add reset process for RoCE in hip08"
In reply to: David Hildenbrand: "Re: [PATCH v1 00/10] mm: online/offline 4MB chunks controlled by device driver"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]