Re: [PATCH] x86/mm: Fix incorrect for loop count calculation in sync_global_pgds

From: Dan Williams
Date: Mon May 01 2017 - 10:40:23 EST


On Mon, May 1, 2017 at 4:41 AM, Baoquan He <bhe@xxxxxxxxxx> wrote:
> Jeff Moyer reported that on his system with two memory regions 0~64G and
> 1T~1T+192G, and kernel option "memmap=192G!1024G" added, enabling kaslr
> will make system hang intermittently during boot. While adding 'nokaslr'
> won't.
>
> This is because the for loop count calculation in sync_global_pgds is
> not correct. When a mapping area crosses pgd entries, we should
> calculate the starting address of region which next pgd covers and assign
> it to next for loop count, but not add PGDIR_SIZE directly. The old
> code works right only if the mapping area is times of PGDIR_SIZE,
> otherwize the end region could be skipped so that it can't be synchronized
> to all other processes from kernel pgd init_mm.pgd.
>
> In Jeff's system, emulated pmem area [1024G, 1216G) is smaller than
> PGDIR_SIZE. While 'nokaslr' works because PAGE_OFFSET is 1T aligned, it
> makes this area be mapped inside one pgd entry. With kaslr enabled,
> this area could cross two pgd entries, then the next pgd entry won't
> be synced to all other processes. That is why we saw empty PGD.
>
> Fix it in this patch.
>
> The back trace is pasted as below:
>
> [ 9.988867] IP: memcpy_erms+0x6/0x10
> [ 9.988868] PGD 0
> [ 9.988868]
> [ 9.988870] Oops: 0000 [#1] SMP
> [ 9.988871] Modules linked in: isci(E) mgag200(E+) drm_kms_helper(E) syscopyarea(E) sysfillrect(E) sysimgblt(E) fb_sys_fops(E) igb(E) ahci(E) ttm(E) libsas(E) libahci(E) scsi_transport_sas(E) ptp(E) pps_core(E) nd_pmem(E) dca(E) drm(E) i2c_algo_bit(E) libata(E) crc32c_intel(E) nd_btt(E) i2c_core(E) dm_mirror(E) dm_region_hash(E) dm_log(E) dm_mod(E)
> [ 9.988886] CPU: 0 PID: 442 Comm: systemd-udevd Tainted: G E 4.11.0-rc5+ #43
> [ 9.988887] Hardware name: Intel Corporation LH Pass/SVRBD-ROW_P, BIOS SE5C600.86B.02.01.SP06.050920141054 05/09/2014
> [ 9.988888] task: ffff9267dc2f8000 task.stack: ffffba92c783c000
> [ 9.988890] RIP: 0010:memcpy_erms+0x6/0x10
> [ 9.988891] RSP: 0018:ffffba92c783f9b8 EFLAGS: 00010286
> [ 9.988892] RAX: ffff925f19e27000 RBX: 0000000000000000 RCX: 0000000000001000
> [ 9.988893] RDX: 0000000000001000 RSI: ffff9387bfff0000 RDI: ffff925f19e27000
> [ 9.988893] RBP: ffffba92c783fa38 R08: 0000000000000000 R09: 0000000017ffff80
> [ 9.988894] R10: 0000000000000000 R11: ffff9387bfff0000 R12: ffff925fde811ed8
> [ 9.988895] R13: 0000002fffff0000 R14: 0000000000001000 R15: ffff925f19e27000
> [ 9.988896] FS: 00007f1ee18e68c0(0000) GS:ffff925fdec00000(0000) knlGS:0000000000000000
> [ 9.988896] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> [ 9.988897] CR2: ffff9387bfff0000 CR3: 000000081ba28000 CR4: 00000000001406f0
> [ 9.988897] Call Trace:
> [ 9.988902] ? pmem_do_bvec+0x93/0x290 [nd_pmem]
> [ 9.988904] ? radix_tree_node_alloc.constprop.20+0x85/0xc0
> [ 9.988905] ? radix_tree_node_alloc.constprop.20+0x85/0xc0
> [ 9.988907] pmem_rw_page+0x3a/0x60 [nd_pmem]
> [ 9.988909] bdev_read_page+0x81/0xb0
> [ 9.988911] do_mpage_readpage+0x56f/0x770
> [ 9.988912] ? I_BDEV+0x20/0x20
> [ 9.988915] ? lru_cache_add+0xe/0x10
> [ 9.988917] mpage_readpages+0x148/0x1e0
> [ 9.988917] ? I_BDEV+0x20/0x20
> [ 9.988918] ? I_BDEV+0x20/0x20
> [ 9.988921] ? alloc_pages_current+0x88/0x120
> [ 9.988923] blkdev_readpages+0x1d/0x20
> [ 9.988924] __do_page_cache_readahead+0x1ce/0x2c0
> [ 9.988926] force_page_cache_readahead+0xa2/0x100
> [ 9.988927] page_cache_sync_readahead+0x3f/0x50
> [ 9.988930] generic_file_read_iter+0x60d/0x8c0
> [ 9.988931] blkdev_read_iter+0x37/0x40
> [ 9.988933] __vfs_read+0xe0/0x150
> [ 9.988934] vfs_read+0x8c/0x130
> [ 9.988936] SyS_read+0x55/0xc0
> [ 9.988939] entry_SYSCALL_64_fastpath+0x1a/0xa9
> [ 9.988940] RIP: 0033:0x7f1ee0822480
> [ 9.988941] RSP: 002b:00007ffcf9e741f8 EFLAGS: 00000246 ORIG_RAX: 0000000000000000
> [ 9.988942] RAX: ffffffffffffffda RBX: 0000000000000000 RCX: 00007f1ee0822480
> [ 9.988943] RDX: 0000000000000040 RSI: 0000561b7e1aabc8 RDI: 0000000000000008
> [ 9.988943] RBP: 0000561b7e1a86a0 R08: 0000000000000005 R09: 0000000000000068
> [ 9.988944] R10: 00007ffcf9e73f80 R11: 0000000000000246 R12: 0000000000000000
> [ 9.988945] R13: 0000000000000001 R14: 0000561b7e1a61b0 R15: 0000561b7e1a55e0
> [ 9.988946] Code: ff 90 90 90 90 eb 1e 0f 1f 00 48 89 f8 48 89 d1 48 c1 e9 03 83 e2 07 f3 48 a5 89 d1 f3 a4 c3 66 0f 1f 44 00 00 48 89 f8 48 89 d1 <f3> a4 c3 0f 1f 80 00 00 00 00 48 89 f8 48 83 fa 20 72 7e 40 38
> [ 9.988962] RIP: memcpy_erms+0x6/0x10 RSP: ffffba92c783f9b8
> [ 9.988962] CR2: ffff9387bfff0000
> [ 9.989022] ---[ end trace fe34c0fc0fe685ab ]---
> [ 9.998690] Kernel panic - not syncing: Fatal exception
> [ 10.004708] Kernel Offset: 0x11000000 from 0xffffffff81000000 (relocation range: 0xffffffff80000000-0xffffffffbfffffff)
>
> Reported-by: Jeff Moyer <jmoyer@xxxxxxxxxx>
> Signed-off-by: Baoquan He <bhe@xxxxxxxxxx>
> Cc: Thomas Gleixner <tglx@xxxxxxxxxxxxx>
> Cc: Ingo Molnar <mingo@xxxxxxxxxx>
> Cc: "H. Peter Anvin" <hpa@xxxxxxxxx>
> Cc: x86@xxxxxxxxxx
> Cc: Kees Cook <keescook@xxxxxxxxxxxx>
> Cc: Thomas Garnier <thgarnie@xxxxxxxxxx>
> Cc: Andrew Morton <akpm@xxxxxxxxxxxxxxxxxxxx>
> Cc: Yasuaki Ishimatsu <yasu.isimatu@xxxxxxxxx>
> Cc: Jinbum Park <jinb.park7@xxxxxxxxx>
> Cc: Dave Hansen <dave.hansen@xxxxxxxxxxxxxxx>
> Cc: "Kirill A. Shutemov" <kirill.shutemov@xxxxxxxxxxxxxxx>
> Cc: Yinghai Lu <yinghai@xxxxxxxxxx>
> Cc: Dan Williams <dan.j.williams@xxxxxxxxx>
> Cc: Dave Young <dyoung@xxxxxxxxxx>
> ---

Good catch!

> arch/x86/mm/init_64.c | 6 ++++--
> 1 file changed, 4 insertions(+), 2 deletions(-)
>
> diff --git a/arch/x86/mm/init_64.c b/arch/x86/mm/init_64.c
> index 15173d3..dbf4f00 100644
> --- a/arch/x86/mm/init_64.c
> +++ b/arch/x86/mm/init_64.c
> @@ -94,12 +94,14 @@ __setup("noexec32=", nonx32_setup);
> */
> void sync_global_pgds(unsigned long start, unsigned long end)
> {
> - unsigned long address;
> + unsigned long address, address_next;
>
> - for (address = start; address <= end; address += PGDIR_SIZE) {
> + for (address = start; address <= end; address = address_next) {
> const pgd_t *pgd_ref = pgd_offset_k(address);
> struct page *page;
>
> + address_next = (address & PGDIR_MASK) + PGDIR_SIZE;
> +

Let's change this to put the next address calculation in the for loop
directly and use the ALIGN macro. Something like:

for (address = start; address <= end; address = ALIGN(address + 1, PGDIR_SIZE))