[PATCH 1/2 v3] kdump: add the vmcoreinfo documentation

From: Lianbo Jiang
Date: Sun Dec 16 2018 - 08:16:41 EST


This document lists some variables that export to vmcoreinfo, and briefly
describles what these variables indicate. It should be instructive for
many people who do not know the vmcoreinfo, and it would normalize the
exported variable as a standard ABI between kernel and use-space.

Suggested-by: Borislav Petkov <bp@xxxxxxx>
Signed-off-by: Lianbo Jiang <lijiang@xxxxxxxxxx>
---
Documentation/kdump/vmcoreinfo.txt | 456 +++++++++++++++++++++++++++++
1 file changed, 456 insertions(+)
create mode 100644 Documentation/kdump/vmcoreinfo.txt

diff --git a/Documentation/kdump/vmcoreinfo.txt b/Documentation/kdump/vmcoreinfo.txt
new file mode 100644
index 000000000000..d71260bf383a
--- /dev/null
+++ b/Documentation/kdump/vmcoreinfo.txt
@@ -0,0 +1,456 @@
+================================================================
+ Documentation for VMCOREINFO
+================================================================
+
+=======================
+What is the VMCOREINFO?
+=======================
+It is a special ELF note section. The VMCOREINFO contains the first
+kernel's various information, for example, structure size, page size,
+symbol values and field offset, etc. These data are packed into an ELF
+note section, and these data will also help user-space tools(e.g. crash
+makedumpfile) analyze the first kernel's memory usage.
+
+In general, makedumpfile can dump the VMCOREINFO contents from vmlinux
+in the first kernel. For example:
+# makedumpfile -g VMCOREINFO -x vmlinux
+
+================
+Common variables
+================
+
+init_uts_ns.name.release
+========================
+The number of OS release. Based on this version number, people can find
+the source code for the corresponding version. When analyzing the vmcore,
+people must read the source code to find the reason why the kernel crashed.
+
+PAGE_SIZE
+=========
+The size of a page. It is the smallest unit of data for memory management
+in kernel. It is usually 4k bytes and the page is aligned in 4k bytes,
+which is very important for computing address.
+
+init_uts_ns
+===========
+This is the UTS namespace, which is used to isolate two specific elements
+of the system that relate to the uname system call. The UTS namespace is
+named after the data structure used to store information returned by the
+uname system call.
+
+User-space tools can get the kernel name, host name, kernel release number,
+kernel version, architecture name and OS type from the 'init_uts_ns'.
+
+node_online_map
+===============
+It is a macro definition, actually it is an array node_states[N_ONLINE],
+and it represents the set of online node in a system, one bit position
+per node number.
+
+This is used to keep track of which nodes are in the system and online.
+
+swapper_pg_dir
+=============
+It generally indicates the pgd for the kernel. When mmu is enabled in
+config file, the 'swapper_pg_dir' is valid.
+
+The 'swapper_pg_dir' helps to translate the virtual address to a physical
+address.
+
+_stext
+======
+It is an assemble symbol that defines the beginning of the text section.
+In general, the '_stext' indicates the kernel start address. This is used
+to convert a virtual address to a physical address when the virtual address
+does not belong to the 'vmalloc' address.
+
+vmap_area_list
+==============
+It stores the virtual area list, makedumpfile can get the vmalloc start
+value from this variable. This value is necessary for vmalloc translation.
+
+mem_map
+=======
+Physical addresses are translated to struct pages by treating them as an
+index into the mem_map array. Shifting a physical address PAGE_SHIFT bits
+to the right will treat it as a PFN from physical address 0, which is also
+an index within the mem_map array.
+
+In short, it can map the address to struct page.
+
+contig_page_data
+================
+Makedumpfile can get the pglist_data structure from this symbol
+'contig_page_data'. The pglist_data structure is used to describe the
+memory layout.
+
+User-space tools can use this symbols for excluding free pages.
+
+mem_section|(mem_section, NR_SECTION_ROOTS)|(mem_section, section_mem_map)
+==========================================================================
+Export the address of 'mem_section' array, and it's length, structure size,
+and the 'section_mem_map' offset.
+
+It exists in the sparse memory mapping model, and it is also somewhat
+similar to the mem_map variable, both of them will help to translate
+the address.
+
+page
+====
+The size of a 'page' structure. In kernel, the page is an important data
+structure, it is widely used to compute the continuous memory.
+
+pglist_data
+===========
+The size of a 'pglist_data' structure. This value will be used to check if
+the 'pglist_data' structure is valid. It is also one of the conditions for
+checking the memory type.
+
+zone
+====
+The size of a 'zone' structure. This value is often used to check if the
+'zone' structure is found. It is necessary structures for excluding free
+pages.
+
+free_area
+=========
+The size of a 'free_area' structure. It indicates whether the 'free_area'
+structure is valid or not. This is useful for excluding free pages.
+
+list_head
+=========
+The size of a 'list_head' structure. It depends on this value when
+iterating the free list.
+
+nodemask_t
+==========
+The size of a 'nodemask_t' type. This value is used to compute the number
+of online nodes.
+
+(page, flags|_refcount|mapping|lru|_mapcount|private|compound_dtor|
+ compound_order|compound_head)
+===================================================================
+User-space tools can compute their values based on the offset of these
+variables. The variables are helpful to exclude unnecessary pages.
+
+(pglist_data, node_zones|nr_zones|node_mem_map|node_start_pfn|node_
+ spanned_pages|node_id)
+===================================================================
+On NUMA machines, each NUMA node has a pg_data_t to describe it's memory
+layout. On UMA machines there is a single pglist_data which describes the
+whole memory.
+
+These values are used to check the memory type, and they are also helpful
+to compute the virtual address for memory map.
+
+(zone, free_area|vm_stat|spanned_pages)
+=======================================
+Each node is divided up into a number of blocks called zones which
+represent ranges within memory. A zone is described by a structure zone.
+Each zone type is suitable for a different type of usage.
+
+User-space tools can compute their values based on the offset of these
+variables.
+
+(free_area, free_list)
+======================
+Offset of the free_list's member. This value is used to compute the number
+of free pages.
+
+Each zone has a free_area structure array called free_area[MAX_ORDER].
+The fields in this structure are simple, the free_list represents a linked
+list of free page blocks.
+
+(list_head, next|prev)
+======================
+Offsets of the list_head's members. In general, the list_head is used to
+define a circular linked list. User-space tools often need to traverse
+the lists to get specific pages.
+
+(vmap_area, va_start|list)
+==========================
+Offsets of the vmap_area's members. They indicate the vmalloc layer
+information. Makedumpfile can get the start address of vmalloc region.
+
+(zone.free_area, MAX_ORDER)
+===========================
+It indicates the maximum number of the array free_area. This macro is
+used to the zone buddy allocator. User-space tools use this value to
+iterate the free_area.
+
+log_buf
+=======
+In general, console output is written to the ring buffer 'log_buf' at
+index 'log_first_idx'. It can get kernel log from the log_buf.
+
+log_buf_len
+===========
+Length of a 'log_buf'. Makedumpfile can read the number of strings
+from the log_buf.
+
+log_first_idx
+=============
+Index of the first record stored in the buffer 'log_buf'. This value
+tells the user-space tools the place where to read the strings in the
+log_buf.
+
+clear_idx
+=========
+The index that the next printk record to read after the last 'clear'
+command. It indicates the first record after the last SYSLOG_ACTION
+_CLEAR, like issued by 'dmesg -c'.
+
+log_next_idx
+============
+The index of the next record to store in the buffer 'log_buf'. It helps
+to compute the index of current strings position.
+
+printk_log
+==========
+The size of a structure 'printk_log'. It helps to compute the size of
+messages, and extract dmesg log.
+
+(printk_log, ts_nsec|len|text_len|dict_len)
+===========================================
+It represents these field offsets in the structure 'printk_log'. User
+space tools can parse it and detect any changes to structure down the
+line.
+
+(free_area.free_list, MIGRATE_TYPES)
+====================================
+The number of migrate types for pages. The free_list is divided into
+the array, it needs to know the number of the array.
+
+NR_FREE_PAGES
+=============
+On linux-2.6.21 or later, the number of free_pages is in
+vm_stat[NR_FREE_PAGES]. It can get the number of free pages from the
+array.
+
+PG_lru|PG_private|PG_swapcache|PG_swapbacked|PG_slab|
+PG_hwpoision|PG_head_mask
+=====================================================
+It means the attribute of a page. These flags will be used to filter
+the free pages.
+
+PAGE_BUDDY_MAPCOUNT_VALUE or ~PG_buddy
+======================================
+The 'PG_buddy' flag indicates that the page is free and in the buddy
+system. Makedumpfile can exclude the free pages managed by a buddy.
+
+HUGETLB_PAGE_DTOR
+=================
+The 'HUGETLB_PAGE_DTOR' flag indicates the hugetlbfs pages. Makedumpfile
+will exclude these pages.
+
+================
+x86_64 variables
+================
+
+phys_base
+=========
+In x86_64, the 'phys_base' is necessary to convert virtual address of
+exported kernel symbol to physical address.
+
+init_top_pgt
+============
+The 'init_top_pgt' used to walk through the whole page table and convert
+virtual address to physical address.
+
+pgtable_l5_enabled
+==================
+User-space tools need to know whether the crash kernel was in 5-level
+paging mode or not.
+
+node_data
+=========
+This is a struct 'pglist_data' array, it stores all numa nodes information.
+In general, Makedumpfile can get the pglist_data structure from symbol
+'node_data'.
+
+(node_data, MAX_NUMNODES)
+=========================
+The number of this 'node_data' array. It means the maximum number of the
+nodes in system.
+
+KERNELOFFSET
+============
+Randomize the address of the kernel image. This is the offset of KASLR in
+VMCOREINFO ELF notes. It is used to compute the page offset in x86_64. If
+KASLE is disabled, this value is zero.
+
+KERNEL_IMAGE_SIZE
+=================
+The size of 'KERNEL_IMAGE_SIZE', currently unused.
+
+The old MODULES_VADDR need be decided by KERNEL_IMAGE_SIZE when kaslr
+enabled. Now MODULES_VADDR is not needed any more since Pratyush makes
+all VA to PA converting done by page table lookup.
+
+PAGE_OFFLINE_MAPCOUNT_VALUE(~PG_offline)
+========================================
+The value of 'PG_offline' flag can be used for marking pages as logically
+offline. Makedumpfile can directly skip pages that are logically offline.
+
+sme_mask
+========
+For AMD machine with SME feature, it indicates the secure memory encryption
+mask. Makedumpfile tools need to know whether the crash kernel was encrypted
+or not. If SME is enabled in the first kernel, the crash kernel's page
+table(pgd/pud/pmd/pte) contains the memory encryption mask, so need to
+remove the sme mask to obtain the true physical address.
+
+=============
+x86 variables
+=============
+
+X86_PAE
+=======
+It means the physical address extension. It has the cost of more
+page table lookup overhead, and also consumes more page table space
+per process. This flag will be used to check whether the PAE was
+enabled in crash kernel or not when converting virtual address to
+physical address.
+
+==============
+ia64 variables
+==============
+
+pgdat_list|(pgdat_list, MAX_NUMNODES)
+=====================================
+This is a struct 'pg_data_t' array, it stores all numa nodes information.
+And the 'MAX_NUMNODES' indicates the number of the nodes.
+
+node_memblk|(node_memblk, NR_NODE_MEMBLKS)
+==========================================
+List of node memory chunks. Filled when parsing SRAT table to obtain
+information about memory nodes. The 'NR_NODE_MEMBLKS' indicates the number
+of node memory chunks.
+
+These values are used to compute the number of nodes in crash kernel.
+
+node_memblk_s|(node_memblk_s, start_paddr)|(node_memblk_s, size)
+================================================================
+The size of a struct 'node_memblk_s', and the offsets of the
+node_memblk_s's members. It helps to compute the number of nodes.
+
+PGTABLE_3|PGTABLE_4
+===================
+User-space tools need to know whether the crash kernel was in 3-level or
+4-level paging mode. This flag can help to distinguish the page table.
+
+===============
+arm64 variables
+===============
+
+VA_BITS
+=======
+The maximum number of bits for virtual addresses. This value helps to
+compute the virtual memory ranges.
+
+kimage_voffset
+==============
+The offset between the kernel virtual and physical mappings. This value
+helps to translate virtual address to physical address.
+
+PHYS_OFFSET
+===========
+It indicates the physical address of the start of memory. It is similar
+with the kimage_voffset, which is used to translate virtual address to
+physical address.
+
+KERNELOFFSET
+============
+It is similar to x86_64.
+
+=============
+arm variables
+=============
+
+ARM_LPAE
+========
+It indicates whether the crash kernel support the large physical address
+extension. This value will tell you how to translate virtual address to
+physical address.
+
+==============
+s390 variables
+==============
+
+lowcore_ptr
+==========
+An array with a pointer to the lowcore of every CPU. This value
+helps to print the psw and all registers information.
+
+high_memory
+===========
+It can get the vmalloc_start address from the high_memory symbol.
+
+(lowcore_ptr, NR_CPUS)
+======================
+The maximum number of cpus.
+
+TODO.
+
+powerpc variables
+=================
+
+node_data|(node_data, MAX_NUMNODES)
+===================================
+Please refer to common variables.
+
+contig_page_data
+================
+Please refer to common variables.
+
+vmemmap_list
+============
+The 'vmemmap_list' maintains the entire vmemmap physical mapping. It
+can get vmemmap list count and populate vmemmap regions info. If the
+vmemmap address translation information is stored in crash kernel,
+which helps to translate vmemmap kernel virtual addresses.
+
+mmu_vmemmap_psize
+=================
+The size of a page. It will try to use this page sizes for vmemmap if
+support. This value helps to translate virtual address to physical
+address.
+
+mmu_psize_defs
+==============
+It stores a variety of pages, such as the page size is 4k, 64k, or 16M.
+
+It depends on this value when making vtop translations.
+
+vmemmap_backing|(vmemmap_backing, list)|(vmemmap_backing, phys)|
+(vmemmap_backing, virt_addr)
+================================================================
+The vmemmap virtual address space management does not have a traditional
+page table to track which virtual struct pages are backed by physical
+mapping. The virtual to physical mappings are tracked in a simple linked
+list format.
+
+And user-space tools need to know the offset of 'list', 'phys' and
+'virt_addr'. It depends on these values when computing the count of
+vmemmap regions.
+
+mmu_psize_def|(mmu_psize_def, shift)
+====================================
+The size of a struct 'mmu_psize_def', and the offset of mmu_psize_def's
+member.
+
+These values help to make the vtop translations.
+
+============
+sh variables
+============
+
+node_data|(node_data, MAX_NUMNODES)
+===================================
+It is similar to X86_64, please refer to above description.
+
+X2TLB
+=====
+It indicates whether the crash kernel enables the extended mode of the SH.
+
+TODO.
--
2.17.1