Re: /proc/vmcore and wrong PAGE_OFFSET

From: Donald Buczek
Date: Wed Aug 28 2019 - 11:08:13 EST


On 8/20/19 11:21 PM, Donald Buczek wrote:
Dear Linux folks,

I'm investigating a problem, that the crash utility fails to work with our crash dumps:

ÂÂÂ buczek@kreios:/mnt$ crash vmlinux crash.vmcore
ÂÂÂ crash 7.2.6
ÂÂÂ Copyright (C) 2002-2019Â Red Hat, Inc.
ÂÂÂ Copyright (C) 2004, 2005, 2006, 2010Â IBM Corporation
ÂÂÂ Copyright (C) 1999-2006Â Hewlett-Packard Co
ÂÂÂ Copyright (C) 2005, 2006, 2011, 2012Â Fujitsu Limited
ÂÂÂ Copyright (C) 2006, 2007Â VA Linux Systems Japan K.K.
ÂÂÂ Copyright (C) 2005, 2011Â NEC Corporation
ÂÂÂ Copyright (C) 1999, 2002, 2007Â Silicon Graphics, Inc.
ÂÂÂ Copyright (C) 1999, 2000, 2001, 2002Â Mission Critical Linux, Inc.
ÂÂÂ This program is free software, covered by the GNU General Public License,
ÂÂÂ and you are welcome to change it and/or distribute copies of it under
 certain conditions. Enter "help copying" to see the conditions.
 This program has absolutely no warranty. Enter "help warranty" for details.
ÂÂÂ GNU gdb (GDB) 7.6
ÂÂÂ Copyright (C) 2013 Free Software Foundation, Inc.
ÂÂÂ License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html>
ÂÂÂ This is free software: you are free to change and redistribute it.
 There is NO WARRANTY, to the extent permitted by law. Type "show copying"
ÂÂÂ and "show warranty" for details.
ÂÂÂ This GDB was configured as "x86_64-unknown-linux-gnu"...
ÂÂÂ crash: read error: kernel virtual address: ffff89807ff77000Â type: "memory section root table"

The crash file is a copy of /dev/vmcore taken by a crashkernel after a sysctl-forced panic.

It looks to me, that 0xffff89807ff77000 is not readable, because the virtual addresses stored in the elf header of the dump file are off by 0x0000008000000000:

ÂÂÂ buczek@kreios:/mnt$ readelf -a crash.vmcore | grep LOAD | perl -lane 'printf "%s (%016x)\n",$_,hex($F[2])-hex($F[3])'
ÂÂÂÂÂ LOADÂÂÂÂÂÂÂÂÂÂ 0x000000000000d000 0xffffffff81000000 0x000001007d000000 (fffffeff04000000)
ÂÂÂÂÂ LOADÂÂÂÂÂÂÂÂÂÂ 0x0000000001c33000 0xffff880000001000 0x0000000000001000 (ffff880000000000)
ÂÂÂÂÂ LOADÂÂÂÂÂÂÂÂÂÂ 0x0000000001cc1000 0xffff880000090000 0x0000000000090000 (ffff880000000000)
ÂÂÂÂÂ LOADÂÂÂÂÂÂÂÂÂÂ 0x0000000001cd1000 0xffff880000100000 0x0000000000100000 (ffff880000000000)
ÂÂÂÂÂ LOADÂÂÂÂÂÂÂÂÂÂ 0x0000000001cd2070 0xffff880000100070 0x0000000000100070 (ffff880000000000)
ÂÂÂÂÂ LOADÂÂÂÂÂÂÂÂÂÂ 0x0000000019bd2000 0xffff880038000000 0x0000000038000000 (ffff880000000000)
ÂÂÂÂÂ LOADÂÂÂÂÂÂÂÂÂÂ 0x000000004e6a1000 0xffff88006ffff000 0x000000006ffff000 (ffff880000000000)
ÂÂÂÂÂ LOADÂÂÂÂÂÂÂÂÂÂ 0x000000004e6a2000 0xffff880100000000 0x0000000100000000 (ffff880000000000)
ÂÂÂÂÂ LOADÂÂÂÂÂÂÂÂÂÂ 0x0000001fcda22000 0xffff882080000000 0x0000002080000000 (ffff880000000000)
ÂÂÂÂÂ LOADÂÂÂÂÂÂÂÂÂÂ 0x0000003fcd9a2000 0xffff884080000000 0x0000004080000000 (ffff880000000000)
ÂÂÂÂÂ LOADÂÂÂÂÂÂÂÂÂÂ 0x0000005fcd922000 0xffff886080000000 0x0000006080000000 (ffff880000000000)
ÂÂÂÂÂ LOADÂÂÂÂÂÂÂÂÂÂ 0x0000007fcd8a2000 0xffff888080000000 0x0000008080000000 (ffff880000000000)
ÂÂÂÂÂ LOADÂÂÂÂÂÂÂÂÂÂ 0x0000009fcd822000 0xffff88a080000000 0x000000a080000000 (ffff880000000000)
ÂÂÂÂÂ LOADÂÂÂÂÂÂÂÂÂÂ 0x000000bfcd7a2000 0xffff88c080000000 0x000000c080000000 (ffff880000000000)
ÂÂÂÂÂ LOADÂÂÂÂÂÂÂÂÂÂ 0x000000dfcd722000 0xffff88e080000000 0x000000e080000000 (ffff880000000000)
ÂÂÂÂÂ LOADÂÂÂÂÂÂÂÂÂÂ 0x000000fc4d722000 0xffff88fe00000000 0x000000fe00000000 (ffff880000000000)

(Columns are File offset, Virtual Address, Physical Address and computed offset).

I would expect the offset between the virtual and the physical address to be PAGE_OFFSET, which is 0xffff88800000000 on x86_64, not 0xffff880000000000. Unlike /proc/vmcore, /proc/kcore shows the same physical memory (of the last memory section above) with a correct offset:

ÂÂÂ buczek@kreios:/mnt$ sudo readelf -a /proc/kcore | grep 0x000000fe00000000 | perl -lane 'printf "%s (%016x)\n",$_,hex($F[2])-hex($F[3])'
ÂÂÂÂÂ LOADÂÂÂÂÂÂÂÂÂÂ 0x0000097e00004000 0xffff897e00000000 0x000000fe00000000 (ffff888000000000)

The failing address 0xffff89807ff77000 happens to be at the end of the last memory section. It is the mem_section array, which crash wants to load and which is visible in the running system:

ÂÂÂ buczek@kreios:/mnt$ sudo gdb vmlinux /proc/kcore
ÂÂÂ [...]
ÂÂÂ (gdb) print mem_section
ÂÂÂ $1 = (struct mem_section **) 0xffff89807ff77000
ÂÂÂ (gdb) print *mem_section
ÂÂÂ $2 = (struct mem_section *) 0xffff88a07f37b000
ÂÂÂ (gdb) print **mem_section
ÂÂÂ $3 = {section_mem_map = 18446719884453740551, pageblock_flags = 0xffff88a07f36f040}

I can read the same information from the crash dump, if I account for the 0x0000008000000000 error:

ÂÂÂ buczek@kreios:/mnt$ gdb vmlinux crash.vmcore
ÂÂÂ [...]
ÂÂÂ (gdb) print mem_section
ÂÂÂ $1 = (struct mem_section **) 0xffff89807ff77000
ÂÂÂ (gdb) print *mem_section
ÂÂÂ Cannot access memory at address 0xffff89807ff77000
ÂÂÂ (gdb) set $t=(struct mem_section **) ((char *)mem_section - 0x0000008000000000)
ÂÂÂ (gdb) print *$t
ÂÂÂ $2 = (struct mem_section *) 0xffff88a07f37b000
ÂÂÂ (gdb) set $s=(struct mem_section *)((char *)*$t - 0x0000008000000000 )
ÂÂÂ (gdb) print *$s
ÂÂÂ $3 = {section_mem_map = 18446719884453740551, pageblock_flags = 0xffff88a07f36f040}

In the above example, the running kernel, the crashed kernel and the crashkernel are all the same 4.19.57 compilation. But I've tried with several other versions ( crashkernel 4.4, running kernel from 4.0 to linux master) with the same result.

The machine in the above example has several numa nodes (this is why there are so many LOAD headers). But I've tried this with a small kvm virtual machine and got the same result.

ÂÂÂ buczek@kreios:/mnt/linux-4.19.57-286.x86_64/build$ grep RANDOMIZE_BASE .config
ÂÂÂ # CONFIG_RANDOMIZE_BASE is not set
ÂÂÂ buczek@kreios:/mnt/linux-4.19.57-286.x86_64/build$ grep SPARSEMEM .config
ÂÂÂ CONFIG_ARCH_SPARSEMEM_ENABLE=y
ÂÂÂ CONFIG_ARCH_SPARSEMEM_DEFAULT=y
ÂÂÂ CONFIG_SPARSEMEM_MANUAL=y
ÂÂÂ CONFIG_SPARSEMEM=y
ÂÂÂ CONFIG_SPARSEMEM_EXTREME=y
ÂÂÂ CONFIG_SPARSEMEM_VMEMMAP_ENABLE=y
ÂÂÂ CONFIG_SPARSEMEM_VMEMMAP=y
ÂÂÂ buczek@kreios:/mnt/linux-4.19.57-286.x86_64/build$ grep PAGE_TABLE_ISOLATION .config
ÂÂÂ CONFIG_PAGE_TABLE_ISOLATION=y

Any ideas?

Donald

To answer my own question for the records:

Our kexec command line is

/usr/sbin/kexec -p /boot/bzImage.crash --initrd=/boot/grub/initramfs.igz --command-line="root=LABEL=root ro console=ttyS1,115200n8 console=tty0 irqpoll nr_cpus=1 reset_devices panic=5 CRASH"

So we neither gave -s (--kexec-file-syscall) nor -a ( --kexec-syscall-auto ). For this reason, kexec used the kexec_load() syscall instead of the newer kexec_file_load syscall.

With kexec_load(), the elf headers for the crash, which include program header for the old system ram, are not computed by the kernel, but by the userspace program from kexec-tools.

Linux kernel commit d52888aa ("x86/mm: Move LDT remap out of KASLR region on 5-level paging") changed the base of the direct mapping from 0xffff880000000000 to 0xffff888000000000. This was merged into v4.20-rc2.

kexec-tools, however, still has the old address hard coded:

buczek@avaritia:/scratch/cluster/buczek/kexec-tools (master)$ git grep X86_64_PAGE_OFFSET
kexec/arch/i386/crashdump-x86.c: elf_info->page_offset = X86_64_PAGE_OFFSET_PRE_2_6_27;
kexec/arch/i386/crashdump-x86.c: elf_info->page_offset = X86_64_PAGE_OFFSET;
kexec/arch/i386/crashdump-x86.h:#define X86_64_PAGE_OFFSET_PRE_2_6_27 0xffff810000000000ULL
kexec/arch/i386/crashdump-x86.h:#define X86_64_PAGE_OFFSET 0xffff880000000000ULL

Best
Donald