Re: [PATCH v3] x86/mm: Fix incorrect for loop count calculation in sync_global_pgds

From: Thomas Garnier
Date: Thu May 04 2017 - 12:28:02 EST


On Wed, May 3, 2017 at 7:35 PM, Dan Williams <dan.j.williams@xxxxxxxxx> wrote:
> On Wed, May 3, 2017 at 7:25 PM, Baoquan He <bhe@xxxxxxxxxx> wrote:
>> Jeff Moyer reported that on his system with two memory regions 0~64G and
>> 1T~1T+192G, and kernel option "memmap=192G!1024G" added, enabling kaslr
>> will make system hang intermittently during boot. While adding 'nokaslr'
>> won't.
>>
>> This is because the for loop count calculation in sync_global_pgds is
>> not correct. When a mapping area crosses pgd entries, we should
>> calculate the starting address of region which next pgd covers and assign
>> it to next for loop count, but not add PGDIR_SIZE directly. The old
>> code works right only if the mapping area is times of PGDIR_SIZE,
>> otherwize the end region could be skipped so that it can't be synchronized
>> to all other processes from kernel pgd init_mm.pgd.
>>
>> In Jeff's system, emulated pmem area [1024G, 1216G) is smaller than
>> PGDIR_SIZE. While 'nokaslr' works because PAGE_OFFSET is 1T aligned, it
>> makes this area be mapped inside one pgd entry. With kaslr enabled,
>> this area could cross two pgd entries, then the next pgd entry won't
>> be synced to all other processes. That is why we saw empty PGD.
>>
>> Fix it in this patch.
>>
>> The back trace is pasted as below:
>>
>> [ 9.988867] IP: memcpy_erms+0x6/0x10
>> [ 9.988868] PGD 0
>> [ 9.988868]
>> [ 9.988870] Oops: 0000 [#1] SMP
>> [ 9.988871] Modules linked in: isci(E) mgag200(E+) drm_kms_helper(E) syscopyarea(E) sysfillrect(E) sysimgblt(E) fb_sys_fops(E) igb(E) ahci(E) ttm(E) libsas(E) libahci(E) scsi_transport_sas(E) ptp(E) pps_core(E) nd_pmem(E) dca(E) drm(E) i2c_algo_bit(E) libata(E) crc32c_intel(E) nd_btt(E)
>> i2c_core(E) dm_mirror(E) dm_region_hash(E) dm_log(E) dm_mod(E)
>> [ 9.988886] CPU: 0 PID: 442 Comm: systemd-udevd Tainted: G E 4.11.0-rc5+ #43
>> [ 9.988887] Hardware name: Intel Corporation LH Pass/SVRBD-ROW_P, BIOS SE5C600.86B.02.01.SP06.050920141054 05/09/2014
>> [ 9.988888] task: ffff9267dc2f8000 task.stack: ffffba92c783c000
>> [ 9.988890] RIP: 0010:memcpy_erms+0x6/0x10
>> [ 9.988891] RSP: 0018:ffffba92c783f9b8 EFLAGS: 00010286
>> [ 9.988892] RAX: ffff925f19e27000 RBX: 0000000000000000 RCX: 0000000000001000
>> [ 9.988893] RDX: 0000000000001000 RSI: ffff9387bfff0000 RDI: ffff925f19e27000
>> [ 9.988893] RBP: ffffba92c783fa38 R08: 0000000000000000 R09: 0000000017ffff80
>> [ 9.988894] R10: 0000000000000000 R11: ffff9387bfff0000 R12: ffff925fde811ed8
>> [ 9.988895] R13: 0000002fffff0000 R14: 0000000000001000 R15: ffff925f19e27000
>> [ 9.988896] FS: 00007f1ee18e68c0(0000) GS:ffff925fdec00000(0000) knlGS:0000000000000000
>> [ 9.988896] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
>> [ 9.988897] CR2: ffff9387bfff0000 CR3: 000000081ba28000 CR4: 00000000001406f0
>> [ 9.988897] Call Trace:
>> [ 9.988902] ? pmem_do_bvec+0x93/0x290 [nd_pmem]
>> [ 9.988904] ? radix_tree_node_alloc.constprop.20+0x85/0xc0
>> [ 9.988905] ? radix_tree_node_alloc.constprop.20+0x85/0xc0
>> [ 9.988907] pmem_rw_page+0x3a/0x60 [nd_pmem]
>> [ 9.988909] bdev_read_page+0x81/0xb0
>> [ 9.988911] do_mpage_readpage+0x56f/0x770
>> [ 9.988912] ? I_BDEV+0x20/0x20
>> [ 9.988915] ? lru_cache_add+0xe/0x10
>> [ 9.988917] mpage_readpages+0x148/0x1e0
>> [ 9.988917] ? I_BDEV+0x20/0x20
>> [ 9.988918] ? I_BDEV+0x20/0x20
>> [ 9.988921] ? alloc_pages_current+0x88/0x120
>> [ 9.988923] blkdev_readpages+0x1d/0x20
>> [ 9.988924] __do_page_cache_readahead+0x1ce/0x2c0
>> [ 9.988926] force_page_cache_readahead+0xa2/0x100
>> [ 9.988927] page_cache_sync_readahead+0x3f/0x50
>> [ 9.988930] generic_file_read_iter+0x60d/0x8c0
>> [ 9.988931] blkdev_read_iter+0x37/0x40
>> [ 9.988933] __vfs_read+0xe0/0x150
>> [ 9.988934] vfs_read+0x8c/0x130
>> [ 9.988936] SyS_read+0x55/0xc0
>> [ 9.988939] entry_SYSCALL_64_fastpath+0x1a/0xa9
>> [ 9.988940] RIP: 0033:0x7f1ee0822480
>> [ 9.988941] RSP: 002b:00007ffcf9e741f8 EFLAGS: 00000246 ORIG_RAX: 0000000000000000
>> [ 9.988942] RAX: ffffffffffffffda RBX: 0000000000000000 RCX: 00007f1ee0822480
>> [ 9.988943] RDX: 0000000000000040 RSI: 0000561b7e1aabc8 RDI: 0000000000000008
>> [ 9.988943] RBP: 0000561b7e1a86a0 R08: 0000000000000005 R09: 0000000000000068
>> [ 9.988944] R10: 00007ffcf9e73f80 R11: 0000000000000246 R12: 0000000000000000
>> [ 9.988945] R13: 0000000000000001 R14: 0000561b7e1a61b0 R15: 0000561b7e1a55e0
>> [ 9.988946] Code: ff 90 90 90 90 eb 1e 0f 1f 00 48 89 f8 48 89 d1 48 c1 e9 03 83 e2 07 f3 48 a5 89 d1 f3 a4 c3 66 0f 1f 44 00 00 48 89 f8 48 89 d1 <f3> a4 c3 0f 1f 80 00 00 00 00 48 89 f8 48 83 fa 20 72 7e 40 38
>> [ 9.988962] RIP: memcpy_erms+0x6/0x10 RSP: ffffba92c783f9b8
>> [ 9.988962] CR2: ffff9387bfff0000
>> [ 9.989022] ---[ end trace fe34c0fc0fe685ab ]---
>> [ 9.998690] Kernel panic - not syncing: Fatal exception
>> [ 10.004708] Kernel Offset: 0x11000000 from 0xffffffff81000000 (relocation range: 0xffffffff80000000-0xffffffffbfffffff)
>>
>> Reported-by: Jeff Moyer <jmoyer@xxxxxxxxxx>
>> Signed-off-by: Baoquan He <bhe@xxxxxxxxxx>
>> Cc: Thomas Gleixner <tglx@xxxxxxxxxxxxx>
>> Cc: Ingo Molnar <mingo@xxxxxxxxxx>
>> Cc: "H. Peter Anvin" <hpa@xxxxxxxxx>
>> Cc: x86@xxxxxxxxxx
>> Cc: Kees Cook <keescook@xxxxxxxxxxxx>
>> Cc: Thomas Garnier <thgarnie@xxxxxxxxxx>
>> Cc: Andrew Morton <akpm@xxxxxxxxxxxxxxxxxxxx>
>> Cc: Yasuaki Ishimatsu <yasu.isimatu@xxxxxxxxx>
>> Cc: Jinbum Park <jinb.park7@xxxxxxxxx>
>> Cc: Dave Hansen <dave.hansen@xxxxxxxxxxxxxxx>
>> Cc: "Kirill A. Shutemov" <kirill.shutemov@xxxxxxxxxxxxxxx>
>> Cc: Yinghai Lu <yinghai@xxxxxxxxxx>
>> Cc: Dan Williams <dan.j.williams@xxxxxxxxx>
>> Cc: Dave Young <dyoung@xxxxxxxxxx>
>
> I think this needs a "Fixes:" tag and Cc: <stable@xxxxxxxxxxxxxxx>.

Agreed.

>
> Other than that:
>
> Reviewed-by: Dan Williams <dan.j.williams@xxxxxxxxx>

Thanks again!

Reviewed-by: Thomas Garnier <thgarnie@xxxxxxxxxx>
--
Thomas