Re: [PATCH] xen/x86: Adjust stack pointer in xen_sysexit

From: Andy Lutomirski
Date: Mon Nov 16 2015 - 14:03:51 EST


On Mon, Nov 16, 2015 at 8:25 AM, Boris Ostrovsky
<boris.ostrovsky@xxxxxxxxxx> wrote:
> On 11/15/2015 01:02 PM, Andy Lutomirski wrote:
>>
>> On Nov 13, 2015 5:23 PM, "Boris Ostrovsky" <boris.ostrovsky@xxxxxxxxxx>
>> wrote:
>>>
>>>
>>>
>>> On 11/13/2015 06:26 PM, Andy Lutomirski wrote:
>>>>
>>>> On Fri, Nov 13, 2015 at 3:18 PM, Boris Ostrovsky
>>>> <boris.ostrovsky@xxxxxxxxxx> wrote:
>>>>>
>>>>> After 32-bit syscall rewrite, and specifically after commit
>>>>> 5f310f739b4c
>>>>> ("x86/entry/32: Re-implement SYSENTER using the new C path"), the stack
>>>>> frame that is passed to xen_sysexit is no longer a "standard" one (i.e.
>>>>> it's not pt_regs).
>>>>>
>>>>> We need to adjust it so that subsequent xen_iret can use it.
>>>>
>>>> I'm wondering if this should be more straightforward:
>>>>
>>>> movq %rsp, %rdi
>>>> call do_fast_syscall_32
>>>> testl %eax, %eax
>>>> jz .Lsyscall_32_done
>>>>
>>>> /* Opportunistic SYSRET */
>>>> sysret32_from_system_call:
>>>> XEN_DO_SYSRET32
>>>>
>>>> where XEN_DO_SYSRET32 is a simple pv op that, on Xen, jumps to a
>>>> variant of Xen's iret path that knows that the fast path is okay.
>>>
>>>
>>>
>>> This patch is for 32-bit kernel. I actually haven't looked at compat code
>>> (probably because our tests don't try that), I need to do that too.
>>
>> In 4.4, it's almost identical (which was part of the point of this
>> whole series). We use sysret32 instead of sysexit, but the underlying
>> structure is the same: munge the stack frame and register state
>> appropriately to use the fast return instruction in question and then
>> execute it. In both cases, the only real difference from the IRET
>> path is that we're willing to lose the values of some subset of cx,
>> dx, and (on 64-bit kernels) r11.
>
>
>
> So it turned out that for compat mode we don't need to do anything since
> xen_sysret32 doesn't assume any stack format (or, rather, it assumes that it
> can't be used) and builds the IRET frame itself.
>

It's still a waste of effort, though. Also, I'd eventually like the
number of places in Xen code in which rsp/esp is invalid to be exactly
zero, and this approach makes this harder or even impossible.

>
>>
>>> As for XEN_DO_SYSRET32 --- we'd presumably need to have a nop for
>>> baremetal otherwise current paravirt op will use native_usergs_sysret32 (for
>>> compat code). Which means a new pv_op, I think.
>>
>> Agreed, unless...
>>
>> Does Xen have a cpufeature? Using ALTERNATIVE instead of a pvop could
>> be easier to follow and be less code at the same time. Frankly,
>> following the control flow from asm through the pre-paravirt-patching
>> and post-paravirt-patching variants and into the final targets is
>> getting a little bit old, and ALTERNATIVE is crystal clear in
>> comparison (and has all the interesting info inline with the rest of
>> the asm). Of course, it doesn't work early in boot, but that's fine
>> for anything involving user/kernel switches.
>
>
>
> We don't currently have a Xen-specific CPU feature. We could, in principle,
> add it but we can't replace all of current paravirt patching with a single
> feature since PVH guests use a subset of existing pv ops (and in the future
> it may become even more fine-grained).
>
> And I don't think we should go ALTERNATIVE route for one set of features and
> keep pv ops for the rest --- it should be either one or the other.

Does PVH hook into the entry asm code at all? I thought it was just
boot code and drivers.

In any case, someone needs to do some serious review and cleanup on
the whole paravirt op mess. We have a bunch of paravirt ops that
serve little purpose.

The paravirt infrastructure is a bit weird, too: it seems to
effectively have four states for each patch site. There's:

1. The initial state, which is unoptimized and works on native.
Presumably any of these that happen early also need to work, if
slowly, on Xen.

2. The Xen state without text patching. I'm not actually sure why
this exists at all. Are there pvops that need to switch too early for
us to patch the text?

3. The native patched state. This is supposedly optimal, but it
results in a few more NOPs than are really needed.

4. The Xen patched state.

Alternatives have only two states, and the code is much easier to
understand. Also, alternatives avoid things like:

...
SWAPGS
...

The reader surely doesn't remember that this isn't guaranteed to be a
swapgs instruction on native. Using:

ALTERNATIVE "swapgs" "" X86_FEATURE_XENPV

would be safer (it would get rid of the SWAPGS_UNSAFE_STACK mess) and
much clearer. We could hide *that* behind a macro and no one would be
confused. (Well, they'd be confused by the fact that Xen PV handles
gsbase very differently from native, but that has nothing to do with
the macro.)

I think we could convert piecemeal, and I wonder if this new patch for
32-bit native on 4.4 (this is needed for 4.4, right?) would be a good
starting point. Borislav, what do you think? Would you be okay with
adding a Xen PV pseudofeature?

--Andy
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/