Re: [Xen-devel] HVMLite / PVHv2 - using x86 EFI boot entry
From: Konrad Rzeszutek Wilk
Date: Thu Apr 14 2016 - 22:04:19 EST
On Thu, Apr 14, 2016 at 10:56:19PM +0200, Luis R. Rodriguez wrote:
> On Thu, Apr 14, 2016 at 03:56:53PM -0400, Konrad Rzeszutek Wilk wrote:
> > On Thu, Apr 14, 2016 at 08:40:48PM +0200, Luis R. Rodriguez wrote:
> > > On Wed, Apr 13, 2016 at 09:01:32PM -0400, Konrad Rzeszutek Wilk wrote:
> > > > On Thu, Apr 14, 2016 at 12:23:17AM +0200, Luis R. Rodriguez wrote:
> > > > > VGA code will be dead code for HVMlite for sure as the design doc
> > > > > says it will not run VGA, the ACPI flag will be set but the check
> > > > > for that is not yet on Linux. That means the VGA Linux code will
> > > > > be there but we have no way to ensure it will not run nor that
> > > > > anything will muck with it.
> > > >
> > > > <shrugs> The worst it will do is try to read non-existent registers.
> > >
> > > Really ?
> > >
> > > Is that your position on all other possible dead code that may have been
> > > possible on old Xen PV guests as well ?
> >
> > This is not just with Xen - it with other device drivers that are being
> > invoked on baremetal and are not present in hardware anymore.
>
> Indeed, however virtualization makes this issue much more prominent.
I suppose - as it only exposes a certain type of platform and nothing
else.
>
> > > As I hinted, after thinking about this for a while I realized that dead code is
> > > likely present on bare metal as well even without virtualization, specially if
> >
> > Yes!
> > > you build large single kernels to support a wide array of features which only
> > > late at run time can be determined. Virtualization and the pvops design just
> > > makes this issue much more prominent. If there are other areas of code exposed
> > > that actually may run, but we are not sure may run, I figured some other folks
> > > with a bit more security conscience minds might even simply take the position
> > > it may be a security risk to leave that code exposed. So to take a position
> > > that 'the worst it will do is try to read non-existent registers' -- seems
> > > rather shortsighted here.
> >
> > Security conscious people trim their CONFIG.
>
> Not all Linux distributions want to do this, the more binaries the
> higher the cost to test / vet.
OK, but Linux distributions have many goals - and are pulled in
different directions so they cannot always achieve the 'low footprint -
small amount of code to do inspection from security standpoint'
>
> > > Anyway for more details on thoughts on this refer to the this wiki:
> > >
> > > http://kernelnewbies.org/KernelProjects/kernel-sandboxing
> > >
> > > Since this is now getting off topic please send me your feedback on another
> > > thread for the non-virtualization aspects of this if that interests you. My
> > > point here was rather to highlight the importance of clear semantics due to
> > > virtualization in light of possible dead code.
> >
> > Thank you.
> > >
> > > > The VGA code should be able to handle failures like that and
> > > > not initialize itself when the hardware is dead (or non-existent).
> > >
> > > That's right, its through ACPI_FADT_NO_VGA and since its part of the HVMLite
> > > design doc we want HVMlite design to address ACPI_FADT_NO_VGA properly. I've
> > > paved the way for this to be done cleanly and easily now, but that code should
> > > be in place before HVMLite code gets merged.
> > >
> > > Does domU for old Xen PV also set ACPI_FADT_NO_VGA as well ? Should it ?
> >
> > It does not. Not sure - it seems to have worked fine for the last ten
> > years?
>
> Maybe HVMLite will need it enabled then too, just for bug parity.
<shrugs> Sure.
>
> > > > > To be clear -- dead code concerns still exist even without
> > > > > virtualization solutions, its just that with virtualization
> > > > > this stuff comes up more and there has been no proactive
> > > > > measures to address this. The question of semantics here is
> > > > > to see to what extent we need earlier boot code annotations
> > > > > to ensure we address semantics proactively.
> > > >
> > > > I think what you mean by dead code is another word for
> > > > hardware test coverage?
> > >
> > > No, no, its very different given that with virtualization the scope of possible
> > > dead code is significant and at run time you are certain a huge portion of code
> > > should *never ever* run. So for instance we know once we boot bare metal none
> > > of the Xen stuff should ever run, likewise on Xen dom0 we know none of the KVM
> > > / bare-metal only stuff should never run, when on Xen domU, none of the Xen
> >
> > What is this 'bare metal only stuff' you speak of? On Xen dom0 most of
> > the baremetal code is running.
>
> A lot, not all. In the past folks added stubs (used to be paravirt_enabled()
> checks) to some code, but we are simply not sure of other possible conflicts.
> This is an known unknown if you will.
>
> > In fact that is how the device drivers work. Or are you talking about low
> > level baremetal code? If so, then PVH/HVMLite does that - it skips pvops so
> > that it can run this 'low-level baremetal code'
>
> Are you telling me that HVMLite has no dead code issues ?
You said earlier that baremetal has dead code issue. Then by extensions
_any_ execution path has dead code issues.
..snip..
> > > > > There is that and as others have pointed out how certain guests types
> > > > > are assumed to not have certain peripherals, and we have no idea
> > > > > to ensure certain old legacy code may not ever run or be accessed
> > > > > by drivers.
> > > >
> > > > Ok, but that is not at code setup. That is later - when device
> > > > drivers are initialized. This no different than booting on
> > > > some hardware with missing functionality. ACPI, PCI and PnP
> > > > PnP are set there to help OSes discover this.
> > >
> > > To a certain extent this is true, but there may things which are missing still.
> >
> > Like?
>
> That's the thing, I had a list of thing to look out for and then things
> I ran across over code inspection. We need more work to be sure we're
> really well covered.
>
> Are you *sure* we have no dead code concerns with HVMLite ?
> If there are dead code concerns are you sure there might not
> be differences between KVM and HVMLite ? Should cpuid be used to
> address differences ? Will that enable to distinguish between
> hybrid versions of HVMLite ? Are we sure ?
HVMLite CPU semantics will be the same as what a baremetal CPU
semantics are.
Platform wise it will be different - as in, instead of say
having a speaker (to emulated it) or RTC clock (again, another
thing to emulate), or say IDE controller (again, another
thing to emulate), or Realtek network card (again, another
thing to emulate) - it has none of those.
[Keep in mind 'another thing to emulate', means 'another
@$@() thing in QEMU that could be a security bug']
So it differs from an consumer x86 platform in that it has
none of the 'legacy' stuff. And it requires PV drivers to
function. And since it requires PV drivers to function
only OSes that have those can use this mode.
>
> > > We really have no idea what the full list of those things are.
> >
> > Ok, it sounds like you have some homework.
>
> We all do.
>
> > > It may be that things may have been running for ages without notice of an issue
> > > or that only under certain situations will certain issues or bugs trigger a
> > > failure. For instance, just yesterday I was Cc'd on a brand-spanking new legacy
> > > conflict [0], caused by upstream commit 8c058b0b9c34d8c ("x86/irq: Probe for
> > > PIC presence before allocating descs for legacy IRQs") merged on v4.4 where
> > > some new code used nr_legacy_irqs() -- one proposed solution seems to be that
> > > for Xen code NR_IRQS_LEGACY should be used instead is as it lacks PCI [1] and
> > > another was to peg the legacy requirements as a quirk on the new x86 platform
> > > legacy quirk stuff [2]. Are other uses of nr_legacy_irqs() correct ? Are
> > > we sure ?
> >
> > And how is this example related to 'early bootup' path?
> >
> > It is not.
>
> For early boot code -- it is not. HVMLite is not merged, and PHV was never
> completed.. so how are you sure we won't have any issues there ?
If we did not have issues we would be out of jobs.
But this is a seperate topic - it is an issue about device drivers and
the assumptions they have. And those assumptions are not always
true (even with normal hardware).
>
> > It is in fact related to PV codepaths - which PVH/HVMLite and HVM guests
> > do not exercise.
>
> Agreed.
>
> > > [0] http://lkml.kernel.org/r/570F90DF.1020508@xxxxxxxxxx
> > > [1] https://lkml.org/lkml/2016/4/14/532
> > > [2] http://lkml.kernel.org/r/1460592286-300-1-git-send-email-mcgrof@xxxxxxxxxx
> > >
> > > > > > > How we address semantics then is *very* important to me.
> > > > > >
> > > > > > Which semantics? How the CPU is going to be at startup_X ? Or
> > > > > > how the CPU is going to be when EFI firmware invokes the EFI stub?
> > > > > > Or when GRUB2 loads Linux?
> > > > >
> > > > > What hypervisor kicked me and what guest type I am.
> > > >
> > > > cpuid software flags have that - and that semantics has been
> > > > there for eons.
> > >
> > > We cannot use cpuid early in asm code, I'm looking for something we
> >
> > ?! Why!?
>
> What existing code uses it? If there is nothing you are still certain
> it should work ? Would that work for old PV guest as well BTW ?
Yeah. For HVM/HVMLite it traps to the hypervisor.
For old PV guests it is unwise to use it as it goes straight to
the hardware (as PV guests run in ring3 - they are considered
'userspace' and the Intel nor AMD do not trap on 'cpuid' in ring3
-unless you run in an VMX container).
>
> > > can even use on asm early in boot code, on x86 the best option we
> > > have is the boot_params, but I've even have had issues with that
> > > early in code, as I can only access it after load_idt() where I
> > > described my effort to unify Xen PV and x86_64 init paths [3].
> >
> > Well, Xen PV skips x86_64_start_kernel..
>
> Yes, and in doing so often times people skip adding Xen PV specific
> code, as was the case with Kasan.
Right. That is an existing problem Xen PV code has.
>
> > > [3] http://lkml.kernel.org/r/CAB=NE6VTCRCazcNpCdJ7pN1eD3=x_fcGOdH37MzVpxkKEN5esw@xxxxxxxxxxxxxx
> > >
> > > > > Let me elaborate more below.
> > > > >
> > > > > > That (those bootloaders) is clearly defined. The URL I provided
> > > > > > mentions the HVMLite one. The Documentation/x86/boot.c mentions
> > > > > > what the semantics are to expected when providing an bootstrap
> > > > > > (which is what HVMLitel stub code in Linux would write against -
> > > > > > and what EFI stub code had been written against too).
> > > > > > >
> > > > > > > > > I'll elaborate on this but first let's clarify why a new entry is used for
> > > > > > > > > HVMlite to start of with:
> > > > > > > > >
> > > > > > > > > 1) Xen ABI has historically not wanted to set up the boot params for Linux
> > > > > > > > > guests, instead it insists on letting the Linux kernel Xen boot stubs fill
> > > > > > > > > that out for it. This sticking point means it has implicated a boot stub.
> > > > > > > >
> > > > > > > >
> > > > > > > > Which is b/c it has to be OS agnostic. It has nothing to do 'not wanting'.
> > > > > > >
> > > > > > > It can still be OS agnostic and pass on type and custom data pointer.
> > > > > >
> > > > > > Sure. It has that (it MUST otherwise how else would you pass data).
> > > > > > It is documented as well http://xenbits.xen.org/docs/unstable/hypercall/x86_64/include,public,xen.h.html#incontents_startofday
> > > > > > (see " Start of day structure passed to PVH guests in %ebx.")
> > > > >
> > > > > The design doc begs for a custom OS entry point though.
> > > >
> > > > That is what the ELF Note has.
> > >
> > > Right, but I'm saying that its rather silly to be adding entry points if
> > > all we want the code to do is copy the boot params for us. The design
> > > doc requires a new entry, and likewise you'd need yet-another-entry if
> > > HVMLite is thrown out the window and come back 5 years later after new
> > > hardware solutions are in place and need to redesign HVMLite. Kind of
> >
> > Why would you need to redesign HVMLite based on hardware solutions?
>
> That's what happened to Xen PV, right ? Are we sure 5 years from now we won't
> have any new hardware virtualization features that will just obsolete HVMLite?
There were no hardware virtualization when Xen PV came about.
If there is hardware virtualization that obsoletes HVMLite that means
it would also obsolete KVM and HVM mode - as HVMLite runs in an VMX
container - the same type that KVM and Xen HVM guests run in.
>
> > The entrace point and the CPU state are pretty well known - it is akin
> > to what GRUB2 bootloader path is (protected mode).
> > > where we are with PVH today. Likewise if other paravirtualization
> > > developers want to support Linux and want to copy your strategy they'd
> > > add yet-another-entry-point as well.
> > >
> > > This is dumb.
> >
> > You saying the EFI entry point is dumb? That instead the EFI
> > firmware should understand Linux bootparams and booted that?
>
> EFI is a standard. Xen is not. And since we are not talking about legacy
And is a standard something that has to come out of a committee?
If so, then Linux bootparams is not a standard. Nor is LILO bootup
path.
> hardware in the future, EFI seems like a sensible option to consider for an
> entry point. Specially given that it may mean that we can ultimately also help
> unify more entry points on Linux in general. I'd prefer to consider using
<chokes>
I can just see that. On non-EFI hardware GRUB2/SYSLINUX would use the EFI entry
point and create an fake firmware.
> EFI configuration tables instead of extending the x86 boot protocol.
What is that? Are you talking about EFI runtime services? Take a look
at the EFI spec and see what you have to implement to emulate this.
>
> > > > > If we had a single 'type' and 'custom data' passed to the kernel that
> > > > > should suffice for the default Linux entry point to just pivot off
> > > > > of that and do what it needs without more entry points. Once.
> > > >
> > > > And what about ramdisk? What about multiple ramdisks?
> > > > What about command line? All of that is what bootparams
> > > > tries to unify on Linux. But 'bootparams' is unique to Linux,
> > > > it does not exist on FreeBSD. Hence some stub code to transplant
> > > > OS-agnostic simple data to OS-specific is neccessary.
> > >
> > > If we had a Xen ABI option where *all* that I'm asking is you pass
> > > first:
> > >
> > > a) hypervisor type
> >
> > Why can't you use cpuid.
>
> I'll evaluate that.
>
> > > b) custom data pointer
> >
> > What is this custom data pointer you speak of?
>
> For Xen this is the en_start_info, the structure that Xen stuffs in
> a copy of its version of what we need to fill the boot_params.
Ok, but that is what we do in some way provide.
I am lost here. You seem to saying you want something that is
already there?
>
> > > We'd be able to avoid adding *any* entry point and just address
> > > the requirements as I noted with pre / post stubs for the type.
> >
> > But you need some entry point to call into Linux. Are you
> > suggesting to use the existing ones? No, the existing one
> > wouldn't understand this.
>
> If we used the boot_parms, yes it would be possible.
...OS agnostic... they are not.
>
> > > This would require an x86 boot protocol bump, but all the issues
> > > creeping up randomly I think that's worth putting on the table now.
> >
> > Aaaah, so you are saying expand the bootparams. In other words
> > make Xen ABI call into Linux using the bootparams structure, similar
> > to how GRUB2 does it.
> >
> > How is that OS agnostic?
>
> That's an issue, I understand. EFI is OS agnostic though.
>
> > > And maybe we don't want it to be hypervisor specific, perhaps there are other
> > > *needs* for custom pre-post startup_32()/startup_64() stubs.
> >
> > Multiboot?
>
> Can you elaborate?
Google Multiboot specification.
>
> > > To avoid extending boot_params further I figured perhaps we can look
> > > at EFI as another option instead. If we are going to drop all legacy
> >
> > But EFI support is _huge_.
>
> I get the sense now. Perhaps we should explore to what extent now really
> at the Hackathon.
Print out the EFI spec and carry it on the plane. The plane will tilt
to one side when trying to take off.
>
> > > PV support from the kernel (not the hypervisor) and require hardware
> > > virtualization 5 years from now on the Linux kernel, it doesn't seem
> > > to me far fetched to at the very least consider using an EFI entry
> > > instead, specially since all it does is set boot params and we can
> > > make re-use this for HVMLite too.
> >
> > But to make that work you have to emulate EFI firmware in the
> > hypervisor. Is that work you are signing up for?
>
> I'll do what is needed, as I have done before. If EFI is on the long
> term roadmap for ARM perhaps there are a few birds to knock with one
> stone here. If there is also interest to support other OSes through
> EFI standard means this also should help make that easier.
>
> Luis