Re: [PATCH v15 00/23] Generic page walk and ptdump

From: Qian Cai
Date: Mon Nov 04 2019 - 14:35:54 EST


On Fri, 2019-11-01 at 14:09 +0000, Steven Price wrote:
> Many architectures current have a debugfs file for dumping the kernel
> page tables. Currently each architecture has to implement custom
> functions for this because the details of walking the page tables used
> by the kernel are different between architectures.
>
> This series extends the capabilities of walk_page_range() so that it can
> deal with the page tables of the kernel (which have no VMAs and can
> contain larger huge pages than exist for user space). A generic PTDUMP
> implementation is the implemented making use of the new functionality of
> walk_page_range() and finally arm64 and x86 are switch to using it,
> removing the custom table walkers.
>
> To enable a generic page table walker to walk the unusual mappings of
> the kernel we need to implement a set of functions which let us know
> when the walker has reached the leaf entry. After a suggestion from Will
> Deacon I've chosen the name p?d_leaf() as this (hopefully) describes
> the purpose (and is a new name so has no historic baggage). Some
> architectures have p?d_large macros but this is easily confused with
> "large pages".
>
> This series ends with a generic PTDUMP implemention for arm64 and x86.
>
> Mostly this is a clean up and there should be very little functional
> change. The exceptions are:
>
> * arm64 PTDUMP debugfs now displays pages which aren't present (patch 22).
>
> * arm64 has the ability to efficiently process KASAN pages (which
> previously only x86 implemented). This means that the combination of
> KASAN and DEBUG_WX is now useable.
>
> Also available as a git tree:
> git://linux-arm.org/linux-sp.git walk_page_range/v15
>
> Changes since v14:
> https://lore.kernel.org/lkml/20191028135910.33253-1-steven.price@xxxxxxx/
> * Switch walk_page_range() into two functions, the existing
> walk_page_range() now still requires VMAs (and treats areas without a
> VMA as a 'hole'). The new walk_page_range_novma() ignores VMAs and
> will report the actual page table layout. This fixes the previous
> breakage of /proc/<pid>/pagemap
> * New patch at the end of the series which reduces the 'level' numbers
> by 1 to simplify the code slightly
> * Added tags

Does this new version also take care of this boot crash seen with v14? Suppose
it is now breaking CONFIG_EFI_PGT_DUMP=y? The full config is,

https://raw.githubusercontent.com/cailca/linux-mm/master/x86.config

[ÂÂÂ10.550313][ÂÂÂÂT0] Switched APIC routing to physical flat.
[ÂÂÂ10.563899][ÂÂÂÂT0] ..TIMER: vector=0x30 apic1=0 pin1=2 apic2=-1 pin2=-1
[ÂÂÂ10.614633][ÂÂÂÂT0] clocksource: tsc-early: mask: 0xffffffffffffffff
max_cycles: 0x1fa6f481074, max_idle_ns: 440795311917 ns
[ÂÂÂ10.625979][ÂÂÂÂT0] Calibrating delay loop (skipped), value calculated using
timer frequency.. 4391.73 BogoMIPS (lpj=21958690)
[ÂÂÂ10.635990][ÂÂÂÂT0] pid_max: default: 131072 minimum: 1024
[ÂÂÂ11.259736][ÂÂÂÂT0] ---[ User Space ]---
[ÂÂÂ11.263737][ÂÂÂÂT0] 0x0000000000000000-
0x0000000000001000ÂÂÂÂÂÂÂÂÂÂÂ4KÂÂÂÂÂRWÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂxÂÂpte
[ÂÂÂ11.266028][ÂÂÂÂT0] 0x0000000000001000-
0x0000000000200000ÂÂÂÂÂÂÂÂ2044KÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂpte
[ÂÂÂ11.275992][ÂÂÂÂT0] 0x0000000000200000-
0x0000000004000000ÂÂÂÂÂÂÂÂÂÂ62MÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂpmd
[ÂÂÂ11.285998][ÂÂÂÂT0] 0x0000000004000000-
0x0000000004076000ÂÂÂÂÂÂÂÂÂ472KÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂpte
[ÂÂÂ11.296019][ÂÂÂÂT0] 0x0000000004076000-
0x0000000004200000ÂÂÂÂÂÂÂÂ1576KÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂpte
[ÂÂÂ11.305997][ÂÂÂÂT0] 0x0000000004200000-
0x0000000011000000ÂÂÂÂÂÂÂÂÂ206MÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂpmd
[ÂÂÂ11.316008][ÂÂÂÂT0] 0x0000000011000000-
0x0000000011100000ÂÂÂÂÂÂÂÂÂÂÂ1MÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂpte
[ÂÂÂ11.326008][ÂÂÂÂT0] 0x0000000011100000-
0x0000000011200000ÂÂÂÂÂÂÂÂÂÂÂ1MÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂpte
[ÂÂÂ11.335990][ÂÂÂÂT0] 0x0000000011200000-
0x0000000011800000ÂÂÂÂÂÂÂÂÂÂÂ6MÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂpmd
[ÂÂÂ11.346054][ÂÂÂÂT0]
==================================================================
[ÂÂÂ11.354074][ÂÂÂÂT0] BUG: KASAN: wild-memory-access in
ptdump_pte_entry+0x39/0x60
[ÂÂÂ11.355975][ÂÂÂÂT0] Read of size 8 at addr 000f887fee5ff000 by task
swapper/0/0
[ÂÂÂ11.355975][ÂÂÂÂT0]Â
[ÂÂÂ11.355975][ÂÂÂÂT0] CPU: 0 PID: 0 Comm: swapper/0 Not tainted 5.4.0-rc5-mm1+
#1
[ÂÂÂ11.355975][ÂÂÂÂT0] Hardware name: HPE ProLiant DL385 Gen10/ProLiant DL385
Gen10, BIOS A40 07/10/2019
[ÂÂÂ11.355975][ÂÂÂÂT0] Call Trace:
[ÂÂÂ11.355975][ÂÂÂÂT0]ÂÂdump_stack+0xa0/0xea
[ÂÂÂ11.355975][ÂÂÂÂT0]ÂÂ__kasan_report.cold.7+0xb0/0xc0
[ÂÂÂ11.355975][ÂÂÂÂT0]ÂÂ? note_page+0x7f8/0xa70
[ÂÂÂ11.355975][ÂÂÂÂT0]ÂÂ? ptdump_pte_entry+0x39/0x60
[ÂÂÂ11.355975][ÂÂÂÂT0]ÂÂ? ptdump_walk_pgd_level_core+0x1b0/0x1b0
[ÂÂÂ11.355975][ÂÂÂÂT0]ÂÂkasan_report+0x12/0x20
[ÂÂÂ11.355975][ÂÂÂÂT0]ÂÂ__asan_load8+0x71/0xa0
[ÂÂÂ11.355975][ÂÂÂÂT0]ÂÂptdump_pte_entry+0x39/0x60
[ÂÂÂ11.355975][ÂÂÂÂT0]ÂÂwalk_pgd_range+0xb75/0xce0
[ÂÂÂ11.355975][ÂÂÂÂT0]ÂÂ__walk_page_range+0x206/0x230
[ÂÂÂ11.355975][ÂÂÂÂT0]ÂÂ? vmacache_find+0x3a/0x170
[ÂÂÂ11.355975][ÂÂÂÂT0]ÂÂwalk_page_range+0x136/0x210
[ÂÂÂ11.355975][ÂÂÂÂT0]ÂÂ? __walk_page_range+0x230/0x230
[ÂÂÂ11.355975][ÂÂÂÂT0]ÂÂ? find_held_lock+0xca/0xf0
[ÂÂÂ11.355975][ÂÂÂÂT0]ÂÂptdump_walk_pgd+0x76/0xd0
[ÂÂÂ11.355975][ÂÂÂÂT0]ÂÂptdump_walk_pgd_level_core+0x13b/0x1b0
[ÂÂÂ11.355975][ÂÂÂÂT0]ÂÂ? hugetlb_get_unmapped_area+0x5b0/0x5b0
[ÂÂÂ11.355975][ÂÂÂÂT0]ÂÂ? trace_hardirqs_on+0x3a/0x160
[ÂÂÂ11.355975][ÂÂÂÂT0]ÂÂ? ptdump_walk_pgd_level_core+0x1b0/0x1b0
[ÂÂÂ11.355975][ÂÂÂÂT0]ÂÂ? efi_delete_dummy_variable+0xa9/0xd0
[ÂÂÂ11.355975][ÂÂÂÂT0]ÂÂ? __enc_copy+0x90/0x90
[ÂÂÂ11.355975][ÂÂÂÂT0]ÂÂptdump_walk_pgd_level+0x15/0x20
[ÂÂÂ11.355975][ÂÂÂÂT0]ÂÂefi_dump_pagetable+0x35/0x37
[ÂÂÂ11.355975][ÂÂÂÂT0]ÂÂefi_enter_virtual_mode+0x72a/0x737
[ÂÂÂ11.355975][ÂÂÂÂT0]ÂÂstart_kernel+0x607/0x6a9
[ÂÂÂ11.355975][ÂÂÂÂT0]ÂÂ? thread_stack_cache_init+0xb/0xb
[ÂÂÂ11.355975][ÂÂÂÂT0]ÂÂ? idt_setup_from_table+0xd9/0x130
[ÂÂÂ11.355975][ÂÂÂÂT0]ÂÂx86_64_start_reservations+0x24/0x26
[ÂÂÂ11.355975][ÂÂÂÂT0]ÂÂx86_64_start_kernel+0xf4/0xfb
[ÂÂÂ11.355975][ÂÂÂÂT0]ÂÂsecondary_startup_64+0xb6/0xc0
[ÂÂÂ11.355975][ÂÂÂÂT0]
==================================================================
[ÂÂÂ11.355975][ÂÂÂÂT0] Disabling lock debugging due to kernel taint
[ÂÂÂ11.355991][ÂÂÂÂT0] general protection fault: 0000 [#1] SMP DEBUG_PAGEALLOC
KASAN NOPTI
[ÂÂÂ11.364049][ÂÂÂÂT0] CPU: 0 PID: 0 Comm: swapper/0 Tainted:
GÂÂÂÂBÂÂÂÂÂÂÂÂÂÂÂÂÂ5.4.0-rc5-mm1+ #1
[ÂÂÂ11.365975][ÂÂÂÂT0] Hardware name: HPE ProLiant DL385 Gen10/ProLiant DL385
Gen10, BIOS A40 07/10/2019
[ÂÂÂ11.365975][ÂÂÂÂT0] RIP: 0010:ptdump_pte_entry+0x39/0x60
[ÂÂÂ11.365975][ÂÂÂÂT0] Code: 55 41 54 49 89 fc 48 8d 79 18 53 48 89 cb e8 5e 0e
fa ff 48 8b 5b 18 48 89 df e8 52 0e fa ff 4c 89 e7 4c 8b 2b e8 47 0e fa ff <49>
8b 0c 24 4c 89 f6 48 89 df ba 05 00 00 00 e8 03 1d 9b 00 31 c0
[ÂÂÂ11.365975][ÂÂÂÂT0] RSP: 0000:ffffffffaf8079d0 EFLAGS: 00010282
[ÂÂÂ11.365975][ÂÂÂÂT0] RAX: 0000000000000000 RBX: ffffffffaf807cf0 RCX:
ffffffffae374306
[ÂÂÂ11.365975][ÂÂÂÂT0] RDX: 0000000000000007 RSI: dffffc0000000000 RDI:
ffffffffafef2bf4
[ÂÂÂ11.365975][ÂÂÂÂT0] RBP: ffffffffaf8079f0 R08: fffffbfff5fdbb22 R09:
fffffbfff5fdbb22
[ÂÂÂ11.365975][ÂÂÂÂT0] R10: fffffbfff5fdbb21 R11: ffffffffafedd90b R12:
000f887fee5ff000
[ÂÂÂ11.365975][ÂÂÂÂT0] R13: ffffffffae2aee40 R14: 0000000011a00000 R15:
0000000011a01000
[ÂÂÂ11.365975][ÂÂÂÂT0] FS:ÂÂ0000000000000000(0000) GS:ffff888843400000(0000)
knlGS:0000000000000000
[ÂÂÂ11.365975][ÂÂÂÂT0] CS:ÂÂ0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ÂÂÂ11.365975][ÂÂÂÂT0] CR2: ffff8890779ff000 CR3: 0000000baf412000 CR4:
00000000000406b0
[ÂÂÂ11.365975][ÂÂÂÂT0] Call Trace:
[ÂÂÂ11.365975][ÂÂÂÂT0]ÂÂwalk_pgd_range+0xb75/0xce0
[ÂÂÂ11.365975][ÂÂÂÂT0]ÂÂ__walk_page_range+0x206/0x230
[ÂÂÂ11.365975][ÂÂÂÂT0]ÂÂ? vmacache_find+0x3a/0x170
[ÂÂÂ11.365975][ÂÂÂÂT0]ÂÂwalk_page_range+0x136/0x210
[ÂÂÂ11.365975][ÂÂÂÂT0]ÂÂ? __walk_page_range+0x230/0x230
[ÂÂÂ11.365975][ÂÂÂÂT0]ÂÂ? find_held_lock+0xca/0xf0
[ÂÂÂ11.365975][ÂÂÂÂT0]ÂÂptdump_walk_pgd+0x76/0xd0
[ÂÂÂ11.365975][ÂÂÂÂT0]ÂÂptdump_walk_pgd_level_core+0x13b/0x1b0
[ÂÂÂ11.365975][ÂÂÂÂT0]ÂÂ? hugetlb_get_unmapped_area+0x5b0/0x5b0
[ÂÂÂ11.365975][ÂÂÂÂT0]ÂÂ? trace_hardirqs_on+0x3a/0x160
[ÂÂÂ11.365975][ÂÂÂÂT0]ÂÂ? ptdump_walk_pgd_level_core+0x1b0/0x1b0
[ÂÂÂ11.365975][ÂÂÂÂT0]ÂÂ? efi_delete_dummy_variable+0xa9/0xd0
[ÂÂÂ11.365975][ÂÂÂÂT0]ÂÂ? __enc_copy+0x90/0x90
[ÂÂÂ11.365975][ÂÂÂÂT0]ÂÂptdump_walk_pgd_level+0x15/0x20
[ÂÂÂ11.365975][ÂÂÂÂT0]ÂÂefi_dump_pagetable+0x35/0x37
[ÂÂÂ11.365975][ÂÂÂÂT0]ÂÂefi_enter_virtual_mode+0x72a/0x737
[ÂÂÂ11.365975][ÂÂÂÂT0]ÂÂstart_kernel+0x607/0x6a9
[ÂÂÂ11.365975][ÂÂÂÂT0]ÂÂ? thread_stack_cache_init+0xb/0xb
[ÂÂÂ11.365975][ÂÂÂÂT0]ÂÂ? idt_setup_from_table+0xd9/0x130
[ÂÂÂ11.365975][ÂÂÂÂT0]ÂÂx86_64_start_reservations+0x24/0x26
[ÂÂÂ11.365975][ÂÂÂÂT0]ÂÂx86_64_start_kernel+0xf4/0xfb
[ÂÂÂ11.365975][ÂÂÂÂT0]ÂÂsecondary_startup_64+0xb6/0xc0
[ÂÂÂ11.365975][ÂÂÂÂT0] Modules linked in:
[ÂÂÂ11.365988][ÂÂÂÂT0] ---[ end trace 8e90dc89e2468d55 ]---
[ÂÂÂ11.375984][ÂÂÂÂT0] RIP: 0010:ptdump_pte_entry+0x39/0x60
[ÂÂÂ11.381335][ÂÂÂÂT0] Code: 55 41 54 49 89 fc 48 8d 79 18 53 48 89 cb e8 5e 0e
fa ff 48 8b 5b 18 48 89 df e8 52 0e fa ff 4c 89 e7 4c 8b 2b e8 47 0e fa ff <49>
8b 0c 24 4c 89 f6 48 89 df ba 05 00 00 00 e8 03 1d 9b 00 31 c0
[ÂÂÂ11.385982][ÂÂÂÂT0] RSP: 0000:ffffffffaf8079d0 EFLAGS: 00010282
[ÂÂÂ11.395982][ÂÂÂÂT0] RAX: 0000000000000000 RBX: ffffffffaf807cf0 RCX:
ffffffffae374306
[ÂÂÂ11.403864][ÂÂÂÂT0] RDX: 0000000000000007 RSI: dffffc0000000000 RDI:
ffffffffafef2bf4
[ÂÂÂ11.405982][ÂÂÂÂT0] RBP: ffffffffaf8079f0 R08: fffffbfff5fdbb22 R09:
fffffbfff5fdbb22
[ÂÂÂ11.415982][ÂÂÂÂT0] R10: fffffbfff5fdbb21 R11: ffffffffafedd90b R12:
000f887fee5ff000
[ÂÂÂ11.425982][ÂÂÂÂT0] R13: ffffffffae2aee40 R14: 0000000011a00000 R15:
0000000011a01000
[ÂÂÂ11.435982][ÂÂÂÂT0] FS:ÂÂ0000000000000000(0000) GS:ffff888843400000(0000)
knlGS:0000000000000000
[ÂÂÂ11.445982][ÂÂÂÂT0] CS:ÂÂ0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ÂÂÂ11.452466][ÂÂÂÂT0] CR2: ffff8890779ff000 CR3: 0000000baf412000 CR4:
00000000000406b0
[ÂÂÂ11.455981][ÂÂÂÂT0] Kernel panic - not syncing: Fatal exception
[ÂÂÂ11.462246][ÂÂÂÂT0] ---[ end Kernel panic - not syncing: Fatal exception ]---

>
> Changes since v13:
> https://lore.kernel.org/lkml/20191024093716.49420-1-steven.price@xxxxxxx/
> * Fixed typo in arc definition of pmd_leaf() spotted by the kbuild test
> robot
> * Added tags
>
> Changes since v12:
> https://lore.kernel.org/lkml/20191018101248.33727-1-steven.price@xxxxxxx/
> * Correct code format in riscv pud_leaf()/pmd_leaf()
> * v12 may not have reached everyone because of mail server problems
> (which are now hopefully resolved!)
>
> Changes since v11:
> https://lore.kernel.org/lkml/20191007153822.16518-1-steven.price@xxxxxxx/
> * Use "-1" as dummy depth parameter in patch 14.
>
> Changes since v10:
> https://lore.kernel.org/lkml/20190731154603.41797-1-steven.price@xxxxxxx/
> * Rebased to v5.4-rc1 - mainly various updates to deal with the
> splitting out of ops from struct mm_walk.
> * Deal with PGD_LEVEL_MULT not always being constant on x86.
>
> Changes since v9:
> https://lore.kernel.org/lkml/20190722154210.42799-1-steven.price@xxxxxxx/
> * Moved generic macros to first page in the series and explained the
> macro naming in the commit message.
> * mips: Moved macros to pgtable.h as they are now valid for both 32 and 64
> bit
> * x86: Dropped patch which changed the debugfs output for x86, instead
> we have...
> * new patch adding 'depth' parameter to pte_hole. This is used to
> provide the necessary information to output lines for 'holes' in the
> debugfs files
> * new patch changing arm64 debugfs output to include holes to match x86
> * generic ptdump KASAN handling has been simplified and now works with
> CONFIG_DEBUG_VIRTUAL.
>
> Changes since v8:
> https://lore.kernel.org/lkml/20190403141627.11664-1-steven.price@xxxxxxx/
> * Rename from p?d_large() to p?d_leaf()
> * Dropped patches migrating arm64/x86 custom walkers to
> walk_page_range() in favour of adding a generic PTDUMP implementation
> and migrating arm64/x86 to that instead.
> * Rebased to v5.3-rc1
>
> Steven Price (23):
> mm: Add generic p?d_leaf() macros
> arc: mm: Add p?d_leaf() definitions
> arm: mm: Add p?d_leaf() definitions
> arm64: mm: Add p?d_leaf() definitions
> mips: mm: Add p?d_leaf() definitions
> powerpc: mm: Add p?d_leaf() definitions
> riscv: mm: Add p?d_leaf() definitions
> s390: mm: Add p?d_leaf() definitions
> sparc: mm: Add p?d_leaf() definitions
> x86: mm: Add p?d_leaf() definitions
> mm: pagewalk: Add p4d_entry() and pgd_entry()
> mm: pagewalk: Allow walking without vma
> mm: pagewalk: Add test_p?d callbacks
> mm: pagewalk: Add 'depth' parameter to pte_hole
> x86: mm: Point to struct seq_file from struct pg_state
> x86: mm+efi: Convert ptdump_walk_pgd_level() to take a mm_struct
> x86: mm: Convert ptdump_walk_pgd_level_debugfs() to take an mm_struct
> x86: mm: Convert ptdump_walk_pgd_level_core() to take an mm_struct
> mm: Add generic ptdump
> x86: mm: Convert dump_pagetables to use walk_page_range
> arm64: mm: Convert mm/dump.c to use walk_page_range()
> arm64: mm: Display non-present entries in ptdump
> mm: ptdump: Reduce level numbers by 1 in note_page()
>
> arch/arc/include/asm/pgtable.h | 1 +
> arch/arm/include/asm/pgtable-2level.h | 1 +
> arch/arm/include/asm/pgtable-3level.h | 1 +
> arch/arm64/Kconfig | 1 +
> arch/arm64/Kconfig.debug | 19 +-
> arch/arm64/include/asm/pgtable.h | 2 +
> arch/arm64/include/asm/ptdump.h | 8 +-
> arch/arm64/mm/Makefile | 4 +-
> arch/arm64/mm/dump.c | 148 +++-----
> arch/arm64/mm/mmu.c | 4 +-
> arch/arm64/mm/ptdump_debugfs.c | 2 +-
> arch/mips/include/asm/pgtable.h | 5 +
> arch/powerpc/include/asm/book3s/64/pgtable.h | 30 +-
> arch/riscv/include/asm/pgtable-64.h | 7 +
> arch/riscv/include/asm/pgtable.h | 7 +
> arch/s390/include/asm/pgtable.h | 2 +
> arch/sparc/include/asm/pgtable_64.h | 2 +
> arch/x86/Kconfig | 1 +
> arch/x86/Kconfig.debug | 20 +-
> arch/x86/include/asm/pgtable.h | 10 +-
> arch/x86/mm/Makefile | 4 +-
> arch/x86/mm/debug_pagetables.c | 8 +-
> arch/x86/mm/dump_pagetables.c | 343 +++++--------------
> arch/x86/platform/efi/efi_32.c | 2 +-
> arch/x86/platform/efi/efi_64.c | 4 +-
> drivers/firmware/efi/arm-runtime.c | 2 +-
> fs/proc/task_mmu.c | 4 +-
> include/asm-generic/pgtable.h | 20 ++
> include/linux/pagewalk.h | 42 ++-
> include/linux/ptdump.h | 22 ++
> mm/Kconfig.debug | 21 ++
> mm/Makefile | 1 +
> mm/hmm.c | 8 +-
> mm/migrate.c | 5 +-
> mm/mincore.c | 1 +
> mm/pagewalk.c | 126 +++++--
> mm/ptdump.c | 151 ++++++++
> 37 files changed, 586 insertions(+), 453 deletions(-)
> create mode 100644 include/linux/ptdump.h
> create mode 100644 mm/ptdump.c
>