Re: [Xen-devel] HVMLite / PVHv2 - using x86 EFI boot entry

From: Luis R. Rodriguez
Date: Thu Apr 14 2016 - 14:40:57 EST


On Wed, Apr 13, 2016 at 09:01:32PM -0400, Konrad Rzeszutek Wilk wrote:
> On Thu, Apr 14, 2016 at 12:23:17AM +0200, Luis R. Rodriguez wrote:
> > On Wed, Apr 13, 2016 at 05:08:01PM -0400, Konrad Rzeszutek Wilk wrote:
> > > On Wed, Apr 13, 2016 at 10:40:55PM +0200, Luis R. Rodriguez wrote:
> > > > On Wed, Apr 13, 2016 at 02:56:29PM -0400, Konrad Rzeszutek Wilk wrote:
> > > > > On Wed, Apr 13, 2016 at 08:29:51PM +0200, Luis R. Rodriguez wrote:
> > > >
> > > > and we also want to address dead code issues which pvops simply folded
> > > > under the rug. The dead code concerns may exist still for hvmlite, so
> > > > unless someone is willing to make a bold claim there is none, its
> > > > something to consider.
> > >
> > > What is this dead code you speak of?
> >
> > Kasan is dead code to Xen. If you boot x86 Xen with Kasan enabled
>
> For Xen PV guests,

That's right. For 5 years this will be a bomb. That went unnoticed
and I feel I have to pull hair now to try to get folks to fix this.

How many other issues will go in which will explode during this 5 year
time line? How can we proactively address a solution to this now so we
avoid this in future ?

Do you believe me now its a real issue?

Fortunately I have a proactive solution for pvops now in my pipeline that
should help avoid us having to blow more things up on Xen but also that should
cause no headaches on behalf of x86 developers. But, reason I have been so
engaged on HVMLite design review is I want to ensure we take the lessons
learned from pvops and avoid this for an architecture that will be def-facto
Xen on Linux 5 years from now.

Not bringing this up or addressing this now for HVMLite / PVH2 would simply be
silly, and since it wasn't addressed in pvops I obviously have to ensure I
convince enough people it was a real issue and ensure that we have enough
semantics available to address it.

Part of the semantics question, which has made my quest hard, was the use of
semantics for virtualization for code on early boot and later in boot has been
rather sloppy, so we have recently needed to address some of these gaps. Some
of these discussions have however been productive, as I'll explain to George
soon regarding his DT questions. The discussion is not over though and we need
to ensure that if we need semantics for HVMLite we'll have them available in
*clean* way. One of the things where early semantics and design to address
these issue help in a proactive manner is to address a clean boot entry -- and
that's also why I've been so pedantic over review of the new HVMlite boot
entry.

> > Xen explodes. Quick question, will Kasan not explode with HVMLite ?
>
> .. but for HVMLite of Xen HVM guest Kasan will run.

Are you sure? Should that mean that Xen HVM should be fine as well. Does that
work? Are we sure?

> > MTRR used to be dead code concern but since we have vetted most of that code
> > now we are pretty certain that code should never run now.
> >
> > KASLR may be -- not sure as I haven't vetted that, but from
> > what I have loosely heard maybe.
> >
> > VGA code will be dead code for HVMlite for sure as the design doc
> > says it will not run VGA, the ACPI flag will be set but the check
> > for that is not yet on Linux. That means the VGA Linux code will
> > be there but we have no way to ensure it will not run nor that
> > anything will muck with it.
>
> <shrugs> The worst it will do is try to read non-existent registers.

Really ?

Is that your position on all other possible dead code that may have been
possible on old Xen PV guests as well ?

As I hinted, after thinking about this for a while I realized that dead code is
likely present on bare metal as well even without virtualization, specially if
you build large single kernels to support a wide array of features which only
late at run time can be determined. Virtualization and the pvops design just
makes this issue much more prominent. If there are other areas of code exposed
that actually may run, but we are not sure may run, I figured some other folks
with a bit more security conscience minds might even simply take the position
it may be a security risk to leave that code exposed. So to take a position
that 'the worst it will do is try to read non-existent registers' -- seems
rather shortsighted here.

Anyway for more details on thoughts on this refer to the this wiki:

http://kernelnewbies.org/KernelProjects/kernel-sandboxing

Since this is now getting off topic please send me your feedback on another
thread for the non-virtualization aspects of this if that interests you. My
point here was rather to highlight the importance of clear semantics due to
virtualization in light of possible dead code.

> The VGA code should be able to handle failures like that and
> not initialize itself when the hardware is dead (or non-existent).

That's right, its through ACPI_FADT_NO_VGA and since its part of the HVMLite
design doc we want HVMlite design to address ACPI_FADT_NO_VGA properly. I've
paved the way for this to be done cleanly and easily now, but that code should
be in place before HVMLite code gets merged.

Does domU for old Xen PV also set ACPI_FADT_NO_VGA as well ? Should it ?

> > To be clear -- dead code concerns still exist even without
> > virtualization solutions, its just that with virtualization
> > this stuff comes up more and there has been no proactive
> > measures to address this. The question of semantics here is
> > to see to what extent we need earlier boot code annotations
> > to ensure we address semantics proactively.
>
> I think what you mean by dead code is another word for
> hardware test coverage?

No, no, its very different given that with virtualization the scope of possible
dead code is significant and at run time you are certain a huge portion of code
should *never ever* run. So for instance we know once we boot bare metal none
of the Xen stuff should ever run, likewise on Xen dom0 we know none of the KVM
/ bare-metal only stuff should never run, when on Xen domU, none of the Xen
domU-only stuff should ever run.

> > > The entrace point in Linux "proper" is startup_32 or startup_64 - the same
> > > path that EFI uses.
> > >
> > > If you were to draw this (very simplified):
> > >
> > > a)- GRUB2 ---------------------\ (creates an bootparam structure)
> > > \
> > > +---- startup_32 or startup_64
> > > b) EFI -> Linux EFI stub -------/
> > > (creates bootparm) /
> > > c) GRUB2-EFI -> Linux EFI----/
> > > stub /
> > > d) HVMLite ----------------/
> > > (creates bootparm)
> >
> > b) and d) might be able to share paths there...
>
> No idea. You would have to look in the assembler code to
> figure that out.

And that's a pain, I get it.

I spotted one place already -- will note to Boris. I think Matt may have more
ideas ;)

> > d) still has its own entry, it does more than create boot params.
>
> d) purpose is to create boot params.

OK good to know that's the only thing we acknowledge it *should* do.

> It may do more as nobody likes to muck in assembler and make bootparams from
> within assembler.

OK -- it does do more and that's where we'd like to avoid duplication if
possible and yet-another-entry (TM).

> > > (I am not sure about the c) - I would have to look in source to
> > > be source). There is also LILO in this, but I am not even sure if
> > > works anymore.
> > >
> > >
> > > What you have is that every entry point creates the bootparams
> > > and ends up calling startup_X. The startup_64 then hit the rest
> > > of the kernel. The startp_X code is the one that would setup
> > > the basic pagetables, segments, etc.
> >
> > Sure.. a full diagram should include both sides and how when using
> > a custom entry one runs the risk of skipping a lot of code setup.
>
> But it does not skip a lot of code setup. It starts exactly
> at the same code startup that _all_ bootstraping code start at.

Its a fair point.

> > There is that and as others have pointed out how certain guests types
> > are assumed to not have certain peripherals, and we have no idea
> > to ensure certain old legacy code may not ever run or be accessed
> > by drivers.
>
> Ok, but that is not at code setup. That is later - when device
> drivers are initialized. This no different than booting on
> some hardware with missing functionality. ACPI, PCI and PnP
> PnP are set there to help OSes discover this.

To a certain extent this is true, but there may things which are missing still.

We really have no idea what the full list of those things are.

It may be that things may have been running for ages without notice of an issue
or that only under certain situations will certain issues or bugs trigger a
failure. For instance, just yesterday I was Cc'd on a brand-spanking new legacy
conflict [0], caused by upstream commit 8c058b0b9c34d8c ("x86/irq: Probe for
PIC presence before allocating descs for legacy IRQs") merged on v4.4 where
some new code used nr_legacy_irqs() -- one proposed solution seems to be that
for Xen code NR_IRQS_LEGACY should be used instead is as it lacks PCI [1] and
another was to peg the legacy requirements as a quirk on the new x86 platform
legacy quirk stuff [2]. Are other uses of nr_legacy_irqs() correct ? Are
we sure ?

[0] http://lkml.kernel.org/r/570F90DF.1020508@xxxxxxxxxx
[1] https://lkml.org/lkml/2016/4/14/532
[2] http://lkml.kernel.org/r/1460592286-300-1-git-send-email-mcgrof@xxxxxxxxxx

> > > > How we address semantics then is *very* important to me.
> > >
> > > Which semantics? How the CPU is going to be at startup_X ? Or
> > > how the CPU is going to be when EFI firmware invokes the EFI stub?
> > > Or when GRUB2 loads Linux?
> >
> > What hypervisor kicked me and what guest type I am.
>
> cpuid software flags have that - and that semantics has been
> there for eons.

We cannot use cpuid early in asm code, I'm looking for something we
can even use on asm early in boot code, on x86 the best option we
have is the boot_params, but I've even have had issues with that
early in code, as I can only access it after load_idt() where I
described my effort to unify Xen PV and x86_64 init paths [3].

[3] http://lkml.kernel.org/r/CAB=NE6VTCRCazcNpCdJ7pN1eD3=x_fcGOdH37MzVpxkKEN5esw@xxxxxxxxxxxxxx

> > Let me elaborate more below.
> >
> > > That (those bootloaders) is clearly defined. The URL I provided
> > > mentions the HVMLite one. The Documentation/x86/boot.c mentions
> > > what the semantics are to expected when providing an bootstrap
> > > (which is what HVMLitel stub code in Linux would write against -
> > > and what EFI stub code had been written against too).
> > > >
> > > > > > I'll elaborate on this but first let's clarify why a new entry is used for
> > > > > > HVMlite to start of with:
> > > > > >
> > > > > > 1) Xen ABI has historically not wanted to set up the boot params for Linux
> > > > > > guests, instead it insists on letting the Linux kernel Xen boot stubs fill
> > > > > > that out for it. This sticking point means it has implicated a boot stub.
> > > > >
> > > > >
> > > > > Which is b/c it has to be OS agnostic. It has nothing to do 'not wanting'.
> > > >
> > > > It can still be OS agnostic and pass on type and custom data pointer.
> > >
> > > Sure. It has that (it MUST otherwise how else would you pass data).
> > > It is documented as well http://xenbits.xen.org/docs/unstable/hypercall/x86_64/include,public,xen.h.html#incontents_startofday
> > > (see " Start of day structure passed to PVH guests in %ebx.")
> >
> > The design doc begs for a custom OS entry point though.
>
> That is what the ELF Note has.

Right, but I'm saying that its rather silly to be adding entry points if
all we want the code to do is copy the boot params for us. The design
doc requires a new entry, and likewise you'd need yet-another-entry if
HVMLite is thrown out the window and come back 5 years later after new
hardware solutions are in place and need to redesign HVMLite. Kind of
where we are with PVH today. Likewise if other paravirtualization
developers want to support Linux and want to copy your strategy they'd
add yet-another-entry-point as well.

This is dumb.

> > If we had a single 'type' and 'custom data' passed to the kernel that
> > should suffice for the default Linux entry point to just pivot off
> > of that and do what it needs without more entry points. Once.
>
> And what about ramdisk? What about multiple ramdisks?
> What about command line? All of that is what bootparams
> tries to unify on Linux. But 'bootparams' is unique to Linux,
> it does not exist on FreeBSD. Hence some stub code to transplant
> OS-agnostic simple data to OS-specific is neccessary.

If we had a Xen ABI option where *all* that I'm asking is you pass
first:

a) hypervisor type
b) custom data pointer

We'd be able to avoid adding *any* entry point and just address
the requirements as I noted with pre / post stubs for the type.
This would require an x86 boot protocol bump, but all the issues
creeping up randomly I think that's worth putting on the table now.

And maybe we don't want it to be hypervisor specific, perhaps there are other
*needs* for custom pre-post startup_32()/startup_64() stubs.

To avoid extending boot_params further I figured perhaps we can look
at EFI as another option instead. If we are going to drop all legacy
PV support from the kernel (not the hypervisor) and require hardware
virtualization 5 years from now on the Linux kernel, it doesn't seem
to me far fetched to at the very least consider using an EFI entry
instead, specially since all it does is set boot params and we can
make re-use this for HVMLite too.

Luis