Re: [Xen-devel] Xen PV domain regression with KASLR enabled (kernel 3.16)
From: Stefan Bader
Date: Fri Aug 22 2014 - 05:21:28 EST
On 21.08.2014 18:03, Kees Cook wrote:
> On Tue, Aug 12, 2014 at 2:07 PM, Konrad Rzeszutek Wilk
> <konrad.wilk@xxxxxxxxxx> wrote:
>> On Tue, Aug 12, 2014 at 11:53:03AM -0700, Kees Cook wrote:
>>> On Tue, Aug 12, 2014 at 11:05 AM, Stefan Bader
>>> <stefan.bader@xxxxxxxxxxxxx> wrote:
>>>> On 12.08.2014 19:28, Kees Cook wrote:
>>>>> On Fri, Aug 8, 2014 at 7:35 AM, Stefan Bader <stefan.bader@xxxxxxxxxxxxx> wrote:
>>>>>> On 08.08.2014 14:43, David Vrabel wrote:
>>>>>>> On 08/08/14 12:20, Stefan Bader wrote:
>>>>>>>> Unfortunately I have not yet figured out why this happens, but can confirm by
>>>>>>>> compiling with or without CONFIG_RANDOMIZE_BASE being set that without KASLR all
>>>>>>>> is ok, but with it enabled there are issues (actually a dom0 does not even boot
>>>>>>>> as a follow up error).
>>>>>>>>
>>>>>>>> Details can be seen in [1] but basically this is always some portion of a
>>>>>>>> vmalloc allocation failing after hitting a freshly allocated PTE space not being
>>>>>>>> PTE_NONE (usually from a module load triggered by systemd-udevd). In the
>>>>>>>> non-dom0 case this repeats many times but ends in a guest that allows login. In
>>>>>>>> the dom0 case there is a more fatal error at some point causing a crash.
>>>>>>>>
>>>>>>>> I have not tried this for a normal PV guest but for dom0 it also does not help
>>>>>>>> to add "nokaslr" to the kernel command-line.
>>>>>>>
>>>>>>> Maybe it's overlapping with regions of the virtual address space
>>>>>>> reserved for Xen? What the the VA that fails?
>>>>>>>
>>>>>>> David
>>>>>>>
>>>>>> Yeah, there is some code to avoid some regions of memory (like initrd). Maybe
>>>>>> missing p2m tables? I probably need to add debugging to find the failing VA (iow
>>>>>> not sure whether it might be somewhere in the stacktraces in the report).
>>>>>>
>>>>>> The kernel-command line does not seem to be looked at. It should put something
>>>>>> into dmesg and that never shows up. Also today's random feature is other PV
>>>>>> guests crashing after a bit somewhere in the check_for_corruption area...
>>>>>
>>>>> Right now, the kaslr code just deals with initrd, cmdline, etc. If
>>>>> there are other reserved regions that aren't listed in the e820, it'll
>>>>> need to locate and skip them.
>>>>>
>>>>> -Kees
>>>>>
>>>> Making my little steps towards more understanding I figured out that it isn't
>>>> the code that does the relocation. Even with that completely disabled there were
>>>> the vmalloc issues. What causes it seems to be the default of the upper limit
>>>> and that this changes the split between kernel and modules to 1G+1G instead of
>>>> 512M+1.5G. That is the reason why nokaslr has no effect.
>>>
>>> Oh! That's very interesting. There must be some assumption in Xen
>>> about the kernel VM layout then?
>>
>> No. I think most of the changes that look at PTE and PMDs are are all
>> in arch/x86/xen/mmu.c. I wonder if this is xen_cleanhighmap being
>> too aggressive
>
> (Sorry I had to cut our chat short at Kernel Summit!)
>
> I sounded like there was another region of memory that Xen was setting
> aside for page tables? But Stefan's investigation seems to show this
> isn't about layout at boot (since the kaslr=0 case means no relocation
> is done). Sounds more like the split between kernel and modules area,
> so I'm not sure how the memory area after the initrd would be part of
> this. What should next steps be, do you think?
Maybe layout, but not about placement of the kernel. Basically leaving KASLR
enabled but shrink the possible range back to the original kernel/module split
is fine as well.
I am bouncing between feeling close to understand to being confused. Konrad
suggested xen_cleanhighmap being overly aggressive. But maybe its the other way
round. The warning that occurs first indicates that PTE that was obtained for
some vmalloc mapping is not unused (0) as it is expected. So it feels rather
like some cleanup has *not* been done.
Let me think aloud a bit... What seems to cause this, is the change of the
kernel/module split from 512M:1.5G to 1G:1G (not exactly since there is 8M
vsyscalls and 2M hole at the end). Which in vaddr terms means:
Before:
ffffffff80000000 - ffffffff9fffffff (=512 MB) kernel text mapping, from phys 0
ffffffffa0000000 - ffffffffff5fffff (=1526 MB) module mapping space
After:
ffffffff80000000 - ffffffffbfffffff (=1024 MB) kernel text mapping, from phys 0
ffffffffc0000000 - ffffffffff5fffff (=1014 MB) module mapping space
Now, *if* I got this right, this means the kernel starts on a vaddr that is
pointed at by:
PGD[510]->PUD[510]->PMD[0]->PTE[0]
In the old layout the module vaddr area would start in the same PUD area, but
with the change the kernel would cover PUD[510] and the module vaddr + vsyscalls
and the hole would cover PUD[511].
xen_cleanhighmap operates only on the kernel_level2_pgt which (speculating a bit
since I am not sure I understand enough details) I believe is the one PMD
pointed at by PGD[510]->PUD[510]. That could mean that before the change
xen_cleanhighmap may touch some (the initial 512M) of the module vaddr space but
not after the change. Maybe that also means it always should have covered more
but this would not be observed as long as modules would not claim more than
512M? I still need to check the vaddr ranges for which xen_cleanhighmap is
actually called. The modules vaddr space would normally not be touched (only
with DEBUG set). I moved that to be unconditionally done but then this might be
of no use when it needs to cover a different PMD...
Really not sure here. But maybe a starter for others...
-Stefan
>
> -Kees
>
>
>>>
>>> -Kees
>>>
>>> --
>>> Kees Cook
>>> Chrome OS Security
>>>
>>> _______________________________________________
>>> Xen-devel mailing list
>>> Xen-devel@xxxxxxxxxxxxx
>>> http://lists.xen.org/xen-devel
>
>
>
Attachment:
signature.asc
Description: OpenPGP digital signature