Re: [PATCH v4 06/17] PCI: add SIOV and IMS capability detection

From: Thomas Gleixner
Date: Wed Nov 11 2020 - 20:41:32 EST


Ashok,

On Wed, Nov 11 2020 at 15:03, Ashok Raj wrote:
> On Wed, Nov 11, 2020 at 11:27:28PM +0100, Thomas Gleixner wrote:
>> which is the obvious sane and safe logic. But sure, why am I asking for
>> sane and safe in the context of virtualization?
>
> We can pick how to solve this, and just waiting for you to tell, what
> mechanism you prefer that's less painful and architecturally acceptible for
> virtualization and linux. We are all ears!

Obviously we can't turn the time back. The point I was trying to make is
that the general approach of just bolting things on top of the exiting
maze is bad in general.

Opt-out bits are error prone simply because anything which exists before
that point does not know that it should set that bit. Obvious, right?

CPUID bits are 'Feature available' and not 'Feature not longer
available' for a reason.

So with the introduction of VT this stringent road was left and the
approach was: Don't tell the guest OS that it's not running on bare
metal.

That's a perfectly fine approach for running existing legacy OSes which
do not care at all because they don't know about anything of this
newfangled stuff.

But it's a falls flat on it's nose for anything which comes past that
point simply because there is no reliable way to tell in which context
the OS runs.

The VMM can decide not to set or is not having support for setting the
software CPUID bit which tells the guest OS that it does NOT run on bare
metal and still hand in new fangled PCI devices for which the guest OS
happens to have a driver which then falls flat on it's nose because some
magic functionality is not there.

So we have the following matrix:

VMM Guest OS
Old Old -> Fine, does not support any of that
New Old -> Fine, does not support any of that
New New -> Fine, works as expected
Old New -> FAIL

To fix this we have to come up with heuristics again to figure out which
context we are running in and whether some magic feature can be
supported or not:

probably_on_bare_metal()
{
if (CPUID(FEATURE_HYPERVISOR))
return false;
if (dmi_match_hypervisor_vendor())
return false;

return PROBABLY_RUNNING_ON_BARE_METAL;
}

Yes, it works probably in most cases, but it still works by chance and
that's what I really hate about this; indeed 'hate' is not a strong
enough word.

Why on earth did VT not introduce a reliable way (instruction, CPUID
leaf, MSR, whatever, which can't be manipulated by the VMM to let the OS
figure out where it runs?)

Just because the general approach to these problems is: We can fix that
in software.

No, you can't fix inconsistency in software at all.

This is not the first time that we tell HW folks to stop this 'Fix this
in software' attitude which has caused more problems than it solved.

And you can argue in circles until you are blue, that inconsistency is
not going away.

Everytime new (mis)features are added which need awareness of the OS
whether it runs on bare-metal or in a VM we have this unsolvable dance
of requiring that the underlying VMM has to tell the guest OS NOT to use
it instead of having the guest OS making the simple decision:

if (!definitely_on_bare_metal())
return -ENOTSUPP;

or with a newer version of the guest OS:

if (!definitely_on_bare_metal() && !hypervisor->supportsthis())
return -ENOTSUPP;

I'm halfways content to go with the above probably_on_bare_metal()
function as a replacement for definitely_on_bare_metal() to go forward,
but only for the very simple reason that this is the only option we
have.

Thanks,

tglx