Re: [PATCH 1/2 v5] kdump: add the vmcoreinfo documentation

From: lijiang
Date: Mon Jan 07 2019 - 04:39:17 EST


在 2019年01月07日 15:55, Hatayama, Daisuke 写道:
> Hi,
>
>> -----Original Message-----
>> From: linux-kernel-owner@xxxxxxxxxxxxxxx
>> [mailto:linux-kernel-owner@xxxxxxxxxxxxxxx] On Behalf Of Lianbo Jiang
>> Sent: Monday, January 7, 2019 10:48 AM
>> To: linux-kernel@xxxxxxxxxxxxxxx
>> Cc: kexec@xxxxxxxxxxxxxxxxxxx; tglx@xxxxxxxxxxxxx; mingo@xxxxxxxxxx;
>> bp@xxxxxxxxx; x86@xxxxxxxxxx; akpm@xxxxxxxxxxxxxxxxxxxx; bhe@xxxxxxxxxx;
>> dyoung@xxxxxxxxxx; linux-doc@xxxxxxxxxxxxxxx; k-hagio@xxxxxxxxxxxxx;
>> anderson@xxxxxxxxxx
>> Subject: [PATCH 1/2 v5] kdump: add the vmcoreinfo documentation
>>
>> This document lists some variables that export to vmcoreinfo, and briefly
>> describles what these variables indicate. It should be instructive for
>> many people who do not know the vmcoreinfo, and it also normalizes the
>
> I agree to this part, but
>
>> exported variables as a convention between kernel and use-space.
>
> I don't agree to this part.
>
> The meaning of each symbol is decided by each feature in the kernel,
> not by vmcoreinfo. I suspect anyone mistakenly understand this document is
> ABI enforcing each symbol works as described. We can change symbols and
> their meaning regardless of this document.
>
> Oh, I found this topic has already been discussed at v3, and
> you removed "ABI" in the patch description at v4.
>
> But it seems still confusing to me.
> I think the explicit description saying that this is for user-land tools,
> they treats each symbol as described,
> and the document never affect implementation of each kernel components,
> is necessary in e.g. "Purpose of this document" section?
>

Thanks for your advice.

If this part could make the document become a rope tied around our necks, i would
like to remove this part from the patch log in next post.

Regards,
Lianbo

>>
>> Suggested-by: Borislav Petkov <bp@xxxxxxx>
>> Signed-off-by: Lianbo Jiang <lijiang@xxxxxxxxxx>
>> ---
>> Documentation/kdump/vmcoreinfo.txt | 500 +++++++++++++++++++++++++++++
>> 1 file changed, 500 insertions(+)
>> create mode 100644 Documentation/kdump/vmcoreinfo.txt
>>
>> diff --git a/Documentation/kdump/vmcoreinfo.txt
>> b/Documentation/kdump/vmcoreinfo.txt
>> new file mode 100644
>> index 000000000000..8e444586b87b
>> --- /dev/null
>> +++ b/Documentation/kdump/vmcoreinfo.txt
>> @@ -0,0 +1,500 @@
>> +================================================================
>> + VMCOREINFO
>> +================================================================
>> +
>> +=======================
>> +What is the VMCOREINFO?
>> +=======================
>> +
>> +VMCOREINFO is a special ELF note section. It contains various
>> +information from the kernel like structure size, page size, symbol
>> +values, field offsets, etc. These data are packed into an ELF note
>> +section and used by user-space tools like crash and makedumpfile to
>> +analyze a kernel's memory layout.
>> +
>> +================
>> +Common variables
>> +================
>> +
>> +init_uts_ns.name.release
>> +------------------------
>> +
>> +The version of the Linux kernel. Used to find the corresponding source
>> +code from which the kernel has been built.
>> +
>> +PAGE_SIZE
>> +---------
>> +
>> +The size of a page. It is the smallest unit of data for memory
>> +management in kernel. It is usually 4096 bytes and a page is aligned
>> +on 4096 bytes. Used for computing page addresses.
>> +
>> +init_uts_ns
>> +-----------
>> +
>> +This is the UTS namespace, which is used to isolate two specific
>> +elements of the system that relate to the uname(2) system call. The UTS
>> +namespace is named after the data structure used to store information
>> +returned by the uname(2) system call.
>> +
>> +User-space tools can get the kernel name, host name, kernel release
>> +number, kernel version, architecture name and OS type from it.
>> +
>> +node_online_map
>> +---------------
>> +
>> +An array node_states[N_ONLINE] which represents the set of online node
>> +in a system, one bit position per node number. Used to keep track of
>> +which nodes are in the system and online.
>> +
>> +swapper_pg_dir
>> +-------------
>> +
>> +The global page directory pointer of the kernel. Used to translate
>> +virtual to physical addresses.
>> +
>> +_stext
>> +------
>> +
>> +Defines the beginning of the text section. In general, _stext indicates
>> +the kernel start address. Used to convert a virtual address from the
>> +direct kernel map to a physical address.
>> +
>> +vmap_area_list
>> +--------------
>> +
>> +Stores the virtual area list. makedumpfile can get the vmalloc start
>> +value from this variable. This value is necessary for vmalloc translation.
>> +
>> +mem_map
>> +-------
>> +
>> +Physical addresses are translated to struct pages by treating them as
>> +an index into the mem_map array. Right-shifting a physical address
>> +PAGE_SHIFT bits converts it into a page frame number which is an index
>> +into that mem_map array.
>> +
>> +Used to map an address to the corresponding struct page.
>> +
>> +contig_page_data
>> +----------------
>> +
>> +Makedumpfile can get the pglist_data structure from this symbol, which
>> +is used to describe the memory layout.
>> +
>> +User-space tools use this to exclude free pages when dumping memory.
>> +
>> +mem_section|(mem_section, NR_SECTION_ROOTS)|(mem_section, section_mem_map)
>> +-------------------------------------------------------------------------
>> -
>> +
>> +The address of the mem_section array, its length, structure size, and
>> +the section_mem_map offset.
>> +
>> +It exists in the sparse memory mapping model, and it is also somewhat
>> +similar to the mem_map variable, both of them are used to translate an
>> +address.
>> +
>> +page
>> +----
>> +
>> +The size of a page structure. struct page is an important data structure
>> +and it is widely used to compute the contiguous memory.
>> +
>> +pglist_data
>> +-----------
>> +
>> +The size of a pglist_data structure. This value will be used to check
>> +if the pglist_data structure is valid. It is also used for checking the
>> +memory type.
>> +
>> +zone
>> +----
>> +
>> +The size of a zone structure. This value is often used to check if the
>> +zone structure has been found. It is also used for excluding free pages.
>> +
>> +free_area
>> +---------
>> +
>> +The size of a free_area structure. It indicates whether the free_area
>> +structure is valid or not. Useful for excluding free pages.
>> +
>> +list_head
>> +---------
>> +
>> +The size of a list_head structure. Used when iterating lists in a
>> +post-mortem analysis session.
>> +
>> +nodemask_t
>> +----------
>> +
>> +The size of a nodemask_t type. Used to compute the number of online
>> +nodes.
>> +
>> +(page, flags|_refcount|mapping|lru|_mapcount|private|compound_dtor|
>> + compound_order|compound_head)
>> +-------------------------------------------------------------------
>> +
>> +User-space tools can compute their values based on the offset of these
>> +variables. The variables are helpful to exclude unnecessary pages.
>> +
>> +(pglist_data, node_zones|nr_zones|node_mem_map|node_start_pfn|node_
>> + spanned_pages|node_id)
>> +-------------------------------------------------------------------
>> +
>> +On NUMA machines, each NUMA node has a pg_data_t to describe its memory
>> +layout. On UMA machines there is a single pglist_data which describes the
>> +whole memory.
>> +
>> +These values are used to check the memory type, and they are also helpful
>> +to compute the virtual address for memory map.
>> +
>> +(zone, free_area|vm_stat|spanned_pages)
>> +---------------------------------------
>> +
>> +Each node is divided into a number of blocks called zones which
>> +represent ranges within memory. A zone is described by a structure zone.
>> +Each zone type is suitable for a different type of usage.
>> +
>> +User-space tools can compute required values based on the offset of these
>> +variables.
>> +
>> +(free_area, free_list)
>> +----------------------
>> +
>> +Offset of the free_list's member. This value is used to compute the number
>> +of free pages.
>> +
>> +Each zone has a free_area structure array called free_area[MAX_ORDER].
>> +The fields in this structure are simple, the free_list represents a linked
>> +list of free page blocks.
>> +
>> +(list_head, next|prev)
>> +----------------------
>> +
>> +Offsets of the list_head's members. list_head is used to define a
>> +circular linked list. User-space tools need these in order to traverse
>> +lists.
>> +
>> +(vmap_area, va_start|list)
>> +--------------------------
>> +
>> +Offsets of the vmap_area's members. They indicate the vmalloc layer
>> +information. Makedumpfile gets the start address of the vmalloc region.
>> +
>> +(zone.free_area, MAX_ORDER)
>> +---------------------------
>> +
>> +It indicates the maximum number of the array free_area. This macro is
>> +used by the zone buddy allocator. User-space tools use this value to
>> +iterate the free_area.
>> +
>> +log_buf
>> +-------
>> +
>> +Console output is written to the ring buffer log_buf at index
>> +log_first_idx. Used to get the kernel log.
>> +
>> +log_buf_len
>> +-----------
>> +
>> +Length of a log_buf. Used to read the number of strings from the
>> +log_buf.
>> +
>> +log_first_idx
>> +-------------
>> +
>> +Index of the first record stored in the buffer log_buf. Used by
>> +user-space tools to read the strings in the log_buf.
>> +
>> +clear_idx
>> +---------
>> +
>> +The index that the next printk() record to read after the last clear
>> +command. It indicates the first record after the last SYSLOG_ACTION
>> +_CLEAR, like issued by 'dmesg -c'. Used by user-space tools to dump
>> +the dmesg log.
>> +
>> +log_next_idx
>> +------------
>> +
>> +The index of the next record to store in the buffer log_buf. Used to
>> +compute the index of the current string position.
>> +
>> +printk_log
>> +----------
>> +
>> +The size of a structure printk_log. Used to compute the size of
>> +messages, and extract dmesg log. It can output human readable text.
>> +Encapsulate header information for log_buf, such as timestamp, syslog
>> +level, etc.
>> +
>> +(printk_log, ts_nsec|len|text_len|dict_len)
>> +-------------------------------------------
>> +
>> +It represents field offsets in struct printk_log. User space tools can
>> +parse it and check whether the values of printk_log's members have been
>> +changed.
>> +
>> +(free_area.free_list, MIGRATE_TYPES)
>> +------------------------------------
>> +
>> +The number of migrate types for pages. The free_list is divided into
>> +the array, it needs to know the number of the array when makedumpfile
>> +computes the number of free pages.
>> +
>> +NR_FREE_PAGES
>> +-------------
>> +
>> +On linux-2.6.21 or later, the number of free_pages is in
>> +vm_stat[NR_FREE_PAGES]. Used to get the number of free pages.
>> +
>> +PG_lru|PG_private|PG_swapcache|PG_swapbacked|PG_slab|PG_hwpoision
>> +|PG_head_mask|PAGE_BUDDY_MAPCOUNT_VALUE(~PG_buddy)
>> +|PAGE_OFFLINE_MAPCOUNT_VALUE(~PG_offline)
>> +-----------------------------------------------------------------
>> +
>> +Page attributes. These flags are used to filter various unnecessary
>> +pages.
>> +
>> +HUGETLB_PAGE_DTOR
>> +-----------------
>> +
>> +The HUGETLB_PAGE_DTOR flag denotes hugetlbfs pages. Makedumpfile
>> +excludes these pages.
>> +
>> +======
>> +x86_64
>> +======
>> +
>> +phys_base
>> +---------
>> +
>> +Used to convert the virtual address of an exported kernel symbol to its
>> +physical address.
>> +
>> +init_top_pgt
>> +------------
>> +
>> +Used to walk through the whole page table and convert virtual addresses
>> +to physical addresses. The init_top_pgt is somewhat similar to the
>> +swapper_pg_dir, but it is only used in x86_64.
>> +
>> +pgtable_l5_enabled
>> +------------------
>> +
>> +User-space tools need to know whether the crash kernel was in 5-level
>> +paging mode.
>> +
>> +node_data
>> +---------
>> +
>> +This is a struct pglist_data array and stores all numa nodes
>> +information. Makedumpfile gets the pglist_data structure from it.
>> +
>> +(node_data, MAX_NUMNODES)
>> +-------------------------
>> +
>> +The maximum number of the nodes in system.
>> +
>> +KERNELOFFSET
>> +------------
>> +
>> +The kernel randomization offset. Used to compute the page offset. If
>> +KASLR is disabled, this value is zero.
>> +
>> +KERNEL_IMAGE_SIZE
>> +-----------------
>> +
>> +Currently unused by Makedumpfile. Used to compute the module virtual
>> +address by Crash.
>> +
>> +sme_mask
>> +--------
>> +
>> +For AMD machine with SME feature, it indicates the secure memory
>> +encryption mask. Makedumpfile tools need to know whether the crash
>> +kernel was encrypted. If SME is enabled in the first kernel, the crash
>> +kernel's page table (pgd/pud/pmd/pte) contains the memory encryption
>> +mask and this is used to remove the SME mask to obtain the true physical
>> +address.
>> +
>> +Currently, the sme_mask stores the value of sme_me_mask(bit 47). If need,
>> +the bit(sme_mask) might be redefined in the future, but the bit 63 will
>> +be reserved.
>> +
>> +For example:
>> +[ misc ][ enc bit ][ other misc SME info ]
>> +0000_0000_0000_0000_1000_0000_0000_0000_0000_0000_..._0000
>> +63 59 55 51 47 43 39 35 31 27 ... 3
>> +
>> +======
>> +x86_32
>> +======
>> +
>> +X86_PAE
>> +-------
>> +
>> +Denotes whether physical address extensions are enabled. It has the cost
>> +of more page table lookup overhead, and also consumes more page table
>> +space per process. Used to check whether PAE was enabled in the crash
>> +kernel when converting virtual addresses to physical addresses.
>> +
>> +====
>> +ia64
>> +====
>> +
>> +pgdat_list|(pgdat_list, MAX_NUMNODES)
>> +-------------------------------------
>> +
>> +pg_data_t array storing all numa nodes information. MAX_NUMNODES
>> +indicates the number of the nodes.
>> +
>> +node_memblk|(node_memblk, NR_NODE_MEMBLKS)
>> +------------------------------------------
>> +
>> +List of node memory chunks. Filled when parsing SRAT table to obtain
>> +information about memory nodes. NR_NODE_MEMBLKS indicates the number
>> +of node memory chunks.
>> +
>> +These values are used to compute the number of nodes in the crash kernel.
>> +
>> +node_memblk_s|(node_memblk_s, start_paddr)|(node_memblk_s, size)
>> +----------------------------------------------------------------
>> +
>> +The size of a struct node_memblk_s and the offsets of the
>> +node_memblk_s's members. Used to compute the number of nodes.
>> +
>> +PGTABLE_3|PGTABLE_4
>> +-------------------
>> +
>> +User-space tools need to know whether the crash kernel was in 3-level or
>> +4-level paging mode. Used to distinguish the page table.
>> +
>> +=====
>> +ARM64
>> +=====
>> +
>> +VA_BITS
>> +-------
>> +
>> +The maximum number of bits for virtual addresses. Used to compute the
>> +virtual memory ranges.
>> +
>> +kimage_voffset
>> +--------------
>> +
>> +The offset between the kernel virtual and physical mappings. Used to
>> +translate virtual to physical addresses.
>> +
>> +PHYS_OFFSET
>> +-----------
>> +
>> +Indicates the physical address of the start of memory. Similar to
>> +kimage_voffset, which is used to translate virtual address to physical
>> +address.
>> +
>> +KERNELOFFSET
>> +------------
>> +
>> +The kernel randomization offset. Used to compute the page offset. If
>> +KASLR is disabled, this value is zero.
>> +
>> +====
>> +arm
>> +====
>> +
>> +ARM_LPAE
>> +--------
>> +
>> +It indicates whether the crash kernel supports large physical address
>> +extensions. Used to translate virtual address to physical address.
>> +
>> +====
>> +s390
>> +====
>> +
>> +lowcore_ptr
>> +----------
>> +
>> +An array with a pointer to the lowcore of every CPU. Used to print the
>> +psw and all registers information.
>> +
>> +high_memory
>> +-----------
>> +
>> +Used to get the vmalloc_start address from the high_memory symbol.
>> +
>> +(lowcore_ptr, NR_CPUS)
>> +----------------------
>> +
>> +The maximum number of CPUs.
>> +
>> +=======
>> +powerpc
>> +=======
>> +
>> +
>> +node_data|(node_data, MAX_NUMNODES)
>> +-----------------------------------
>> +
>> +See above.
>> +
>> +contig_page_data
>> +----------------
>> +
>> +See above.
>> +
>> +vmemmap_list
>> +------------
>> +
>> +The vmemmap_list maintains the entire vmemmap physical mapping. It can
>> +get vmemmap list count and populate vmemmap regions info. If the vmemmap
>> +address translation information is stored in the crash kernel, it helps
>> +to translate vmemmap kernel virtual addresses.
>> +
>> +mmu_vmemmap_psize
>> +-----------------
>> +
>> +The size of a page. Used to translate address to physical addresses.
>> +
>> +mmu_psize_defs
>> +--------------
>> +
>> +Page size definitions, i.e. 4k, 64k, or 16M.
>> +
>> +Used to make vtop translations.
>> +
>> +vmemmap_backing|(vmemmap_backing, list)|(vmemmap_backing, phys)|
>> +(vmemmap_backing, virt_addr)
>> +----------------------------------------------------------------
>> +
>> +The vmemmap virtual address space management does not have a traditional
>> +page table to track which virtual struct pages are backed by physical
>> +mapping. The virtual to physical mappings are tracked in a simple linked
>> +list format.
>> +
>> +User-space tools need to know the offset of list, phys and virt_addr
>> +when computing the count of vmemmap regions.
>> +
>> +mmu_psize_def|(mmu_psize_def, shift)
>> +------------------------------------
>> +
>> +The size of a struct mmu_psize_def and the offset of mmu_psize_def's
>> +member.
>> +
>> +Used in vtop translations.
>> +
>> +==
>> +sh
>> +==
>> +
>> +node_data|(node_data, MAX_NUMNODES)
>> +-----------------------------------
>> +
>> +See above.
>> +
>> +X2TLB
>> +-----
>> +
>> +Indicates whether the crash kernel enables SH extended mode.
>> --
>> 2.17.1
>>
>
>