Re: [tip:x86/mm] x86, mm: NX protection for kernel data

From: Siarhei Liakh
Date: Thu Mar 11 2010 - 22:12:35 EST


On Sat, Mar 6, 2010 at 2:44 PM, Siarhei Liakh <sliakh.lkml@xxxxxxxxx> wrote:
> On Mon, Feb 22, 2010 at 12:21 PM, Ingo Molnar <mingo@xxxxxxx> wrote:
>>
>> * H. Peter Anvin <hpa@xxxxxxxxx> wrote:
>>
>>> On 02/22/2010 03:01 AM, Ingo Molnar wrote:
>>> >>
>>> >>> Commit-ID:  01ab31371da90a795b774d87edf2c21bb3a64dda
>>> >>> Gitweb:     http://git.kernel.org/tip/01ab31371da90a795b774d87edf2c21bb3a64dda
[ . . . ]
> I was able to narrow down the issue to spinlock debugging. More
> specifically, DEBUG_SPINLOCK=y seem to be somehow incompatible with
> kernel's RW-data being NX.
[ . . . ]
> Kernel crash dump:
> ============================================
> [    2.844000] EXT3-fs (sda1): warning: maximal mount count reached,
> running e2fsck is recommended
> [    2.848000] EXT3-fs (sda1): using internal journal
> [    2.849556] EXT3-fs (sda1): recovery complete
> [    2.852000] EXT3-fs (sda1): mounted filesystem with ordered data mode
> [    2.854168] VFS: Mounted root (ext3 filesystem) on device 8:1.
> [    2.856000] Freeing unused kernel memory (init): 540k freed
> [    2.857056] NX-protecting the kernel data: 0xc15b3000 - 0xc1834000, 641 pages
> [    2.860328] do_page_fault - entry
> [    2.862554] do_page_fault: 0xc17ebdb8
> [    2.864000] do_page_fault - kernel space
> [    2.864000] do_page_fault - about to call bad_area_nosemaphore()
> [    2.864000] BUG: unable to handle kernel paging request at c17ebdb8
> [    2.864000] IP: [<c12609f7>] do_raw_spin_unlock+0x5e/0x71
> [    2.864000] *pdpt = 00000000018c0001 *pde = 80000000016001e1
> [    2.864000] Oops: 0003 [#1] SMP
> [    2.864000] last sysfs file:
> [    2.864000] Modules linked in:
> [    2.864000]
> [    2.864000] Pid: 1, comm: swapper Not tainted 2.6.33-tip+ #41 /
> [    2.864000] EIP: 0060:[<c12609f7>] EFLAGS: 00010046 CPU: 0
> [    2.864000] EIP is at do_raw_spin_unlock+0x5e/0x71
> [    2.864000] EAX: 00000000 EBX: c17ebdac ECX: 00000001 EDX: 00000c0b
> [    2.864000] ESI: 00000246 EDI: c18c0058 EBP: f780fe14 ESP: f780fe10
> [    2.864000]  DS: 007b ES: 007b FS: 00d8 GS: 00e0 SS: 0068
> [    2.864000] Process swapper (pid: 1, ti=f780f000 task=f7826000
> task.ti=f780f000)
> [    2.864000] Stack:
> [    2.864000]  c17ebdac f780fe24 c15ad3f2 00000000 00000000 f780ff18
> c1017a57 00000000
> [    2.864000] <0> 016001e3 00000000 016001e3 f77a8004 00000001
> 00000000 00000163 80000000
> [    2.864000] <0> 00000000 ffffffff ffffffff 80000000 000001e1
> 80000000 00000000 80000000
> [    2.864000] Call Trace:
> [    2.864000]  [<c15ad3f2>] ? _raw_spin_unlock_irqrestore+0x20/0x3c
> [    2.864000]  [<c1017a57>] ? __change_page_attr_set_clr+0x65c/0x945
> [    2.864000]  [<c1092245>] ? vm_unmap_aliases+0x17b/0x186
> [    2.864000]  [<c15b3000>] ? _etext+0x0/0x24
> [    2.864000]  [<c1017eb4>] ? change_page_attr_set_clr+0x174/0x312
> [    2.864000]  [<c15b3000>] ? _etext+0x0/0x24
> [    2.864000]  [<c10182d1>] ? set_memory_nx+0x2d/0x32
> [    2.864000]  [<c10163ab>] ? mark_nxdata_nx+0x37/0x41
> [    2.864000]  [<c15b3000>] ? _etext+0x0/0x24
> [    2.864000]  [<c1834000>] ? i386_start_kernel+0x0/0xaa
> [    2.864000]  [<c101649d>] ? free_initmem+0x1c/0x1e
> [    2.864000]  [<c1001148>] ? init_post+0xd/0x121
> [    2.864000]  [<c1834401>] ? kernel_init+0x1d5/0x1df
> [    2.864000]  [<c183422c>] ? kernel_init+0x0/0x1df
> [    2.864000]  [<c1002e66>] ? kernel_thread_helper+0x6/0x10
> [    2.864000] Code: 54 8b c1 39 43 0c 74 0c ba 74 e1 73 c1 89 d8 e8
> 31 ff ff ff 64 a1 d8 6b 8b c1 39 43 08 74 0c ba 80 e1 73 c1 89 d8 e8
> 1a ff ff ff <c7> 43 0c ff ff ff ff c7 43 08 ff ff ff ff fe 03 5b 5d c3
> 55 89
> [    2.864000] EIP: [<c12609f7>] do_raw_spin_unlock+0x5e/0x71 SS:ESP
> 0068:f780fe10
> [    2.864000] CR2: 00000000c17ebdb8
> [    2.864000] ---[ end trace 0d94f53e9dfe82f9 ]---
> [    2.948071] swapper used greatest stack depth: 1804 bytes left
> [    2.952000] Kernel panic - not syncing: Attempted to kill init!
> ============================================
>
> looking for c17ebdb8 in system.map points to a location in pgd_lock:
> ============================================
> $grep c17ebd System.map
> c17ebd68 d bios_check_work
> c17ebda8 d highmem_pages
> c17ebdac D pgd_lock
> c17ebdc8 D pgd_list
> c17ebdd0 D show_unhandled_signals
> c17ebdd4 d cpa_lock
> c17ebdf0 d memtype_lock
> ============================================
>
> I've looked at the lock debugging and could not find any place that
> would look like an attempt to execute data. This would lead me to
> think that calling set_memory_nx from kernel_init somehow confuses the
> lock debugging subsystem, or set_memory_nx does not change page
> attributes in a safe manner (for example when a lock is stored inside
> the page whose attributes are being changed).

I've done some extra debugging and it really does look like the crash
happens when we are setting NX on a large page which has pgd_lock
inside it.

Here is a trace of printk's that I added to troubleshoot this issue:
=========================
[ 3.072003] try_preserve_large_page - enter
[ 3.073185] try_preserve_large_page - address: 0xc1600000
[ 3.074513] try_preserve_large_page - 2M page
[ 3.075606] try_preserve_large_page - about to call static_protections
[ 3.076000] try_preserve_large_page - back from static_protections
[ 3.076000] try_preserve_large_page - past loop
[ 3.076000] try_preserve_large_page - new_prot != old_prot
[ 3.076000] try_preserve_large_page - the address is aligned and
the number of pages covers the full range
[ 3.076000] try_preserve_large_page - about to call __set_pmd_pte
[ 3.076000] __set_pmd_pte - enter
[ 3.076000] __set_pmd_pte - address: 0xc1600000
[ 3.076000] __set_pmd_pte - about to call
set_pte_atomic(*0xc18c0058(low=0x16001e3, high=0x0), (low=0x16001e1,
high=0x80000000))
[lock-up here]
=========================
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/