Re: [REPOST PATCH] arm/arm64: KVM: Add PSCI version selection API

From: Marc Zyngier
Date: Mon Apr 09 2018 - 09:21:07 EST


On 09/04/18 14:05, Christoffer Dall wrote:
> On Mon, Apr 09, 2018 at 01:47:50PM +0100, Marc Zyngier wrote:
>> +Drew, who's look at the whole save/restore thing extensively
>>
>> On 09/04/18 13:30, Christoffer Dall wrote:
>>> On Thu, Mar 15, 2018 at 07:26:48PM +0000, Marc Zyngier wrote:
>>>> On 15/03/18 19:13, Peter Maydell wrote:
>>>>> On 15 March 2018 at 19:00, Marc Zyngier <marc.zyngier@xxxxxxx> wrote:
>>>>>> On 06/03/18 09:21, Andrew Jones wrote:
>>>>>>> On Mon, Mar 05, 2018 at 04:47:55PM +0000, Peter Maydell wrote:
>>>>>>>> On 2 March 2018 at 11:11, Marc Zyngier <marc.zyngier@xxxxxxx> wrote:
>>>>>>>>> On Fri, 02 Mar 2018 10:44:48 +0000,
>>>>>>>>> Auger Eric wrote:
>>>>>>>>>> I understand the get/set is called as part of the migration process.
>>>>>>>>>> So my understanding is the benefit of this series is migration fails in
>>>>>>>>>> those cases:
>>>>>>>>>>
>>>>>>>>>>> =0.2 source -> 0.1 destination
>>>>>>>>>> 0.1 source -> >=0.2 destination
>>>>>>>>>
>>>>>>>>> It also fails in the case where you migrate a 1.0 guest to something
>>>>>>>>> that cannot support it.
>>>>>>>>
>>>>>>>> I think it would be useful if we could write out the various
>>>>>>>> combinations of source, destination and what we expect/want to
>>>>>>>> have happen. My gut feeling here is that we're sacrificing
>>>>>>>> exact migration compatibility in favour of having the guest
>>>>>>>> automatically get the variant-2 mitigations, but it's not clear
>>>>>>>> to me exactly which migration combinations that's intended to
>>>>>>>> happen for. Marc?
>>>>>>>>
>>>>>>>> If this wasn't a mitigation issue the desired behaviour would be
>>>>>>>> straightforward:
>>>>>>>> * kernel should default to 0.2 on the basis that
>>>>>>>> that's what it did before
>>>>>>>> * new QEMU version should enable 1.0 by default for virt-2.12
>>>>>>>> and 0.2 for virt-2.11 and earlier
>>>>>>>> * PSCI version info shouldn't appear in migration stream unless
>>>>>>>> it's something other than 0.2
>>>>>>>> But that would leave some setups (which?) unnecessarily without the
>>>>>>>> mitigation, so we're not doing that. The question is, exactly
>>>>>>>> what *are* we aiming for?
>>>>>>>
>>>>>>> The reason Marc dropped this patch from the series it was first introduced
>>>>>>> in was because we didn't have the aim 100% understood. We want the
>>>>>>> mitigation by default, but also to have the least chance of migration
>>>>>>> failure, and when we must fail (because we're not doing the
>>>>>>> straightforward approach listed above, which would prevent failures), then
>>>>>>> we want to fail with the least amount of damage to the user.
>>>>>>>
>>>>>>> I experimented with a couple different approaches and provided tables[1]
>>>>>>> with my results. I even recommended an approach, but I may have changed
>>>>>>> my mind after reading Marc's follow-up[2]. The thread continues from
>>>>>>> there as well with follow-ups from Christoffer, Marc, and myself. Anyway,
>>>>>>> Marc did this repost for us to debate it and work out the best approach
>>>>>>> here.
>>>>>> It doesn't look like we've made much progress on this, which makes me
>>>>>> think that we probably don't need anything of the like.
>>>>>
>>>>> I was waiting for a better explanation from you of what we're trying to
>>>>> achieve. If you want to take the "do nothing" approach then a list
>>>>> also of what migrations succeed/fail/break in that case would also
>>>>> be useful.
>>>>>
>>>>> (I am somewhat lazily trying to avoid having to spend time reverse
>>>>> engineering the "what are we trying to do and what effects are
>>>>> we accepting" parts from the patch and the code that's already gone
>>>>> into the kernel.)
>>>>
>>>> OK, let me (re)state the problem:
>>>>
>>>> For a guest that requests PSCI 0.2 (i.e. all guests from the past 4 or 5
>>>> years), we now silently upgrade the PSCI version to 1.0 allowing the new
>>>> SMCCC to be discovered, and the ARCH_WORKAROUND_1 service to be called.
>>>>
>>>> Things get funny, specially with migration (and the way QEMU works).
>>>>
>>>> If we "do nothing":
>>>>
>>>> (1) A guest migrating from an "old" host to a "new" host will silently
>>>> see its PSCI version upgraded. Not a big deal in my opinion, as 1.0 is a
>>>> strict superset of 0.2 (apart from the version number...).
>>>>
>>>> (2) A guest migrating from a "new" host to an "old" host will silently
>>>> loose its Spectre v2 mitigation. That's quite a big deal.
>>>>
>>>> (3, not related to migration) A guest having a hardcoded knowledge of
>>>> PSCI 0.2 will se that we've changed something, and may decide to catch
>>>> fire. Oh well.
>>>>
>>>> If we take this patch:
>>>>
>>>> (1) still exists
>>>
>>> No problem, IMHO.
>>>
>>>>
>>>> (2) will now fail to migrate. I see this as a feature.
>>>
>>> Yes, I agree. This is actually the most important reason for doing
>>> anything beyond what's already merged.
>>
>> Indeed, and that's the reason I wrote this patch the first place.
>>
>>>
>>>>
>>>> (3) can be worked around by setting the "PSCI version pseudo register"
>>>> to 0.2.
>>>
>>> Nice to have, but we're probably not expecting this to be of major
>>> concern. I initially thought it was a nice debugging feature as well,
>>> but that may be a ridiculous point.
>>>
>>>>
>>>> These are the main things I can think of at the moment.
>>>
>>> So I think we we should merge this patch.
>>>
>>> If userspace then wants to support "migrate from explicitly set v0.2 new
>>> kernel to old kernel", then it must add specific support to filter out
>>> the register from the register list; not that I think anyone will need
>>> that or bother to implement it.
>>>
>>> In other words, I think you should merge this:
>>>
>>> Reviewed-by: Christoffer Dall <cdall@xxxxxxxxxx>
>>>
>>
>> Thanks. One issue is that we've now missed the 4.16 train, and that this
>> effectively is an ABI change (a fairly minor one, but still). Would we
>> consider slapping this as a retrospective fix to 4.16-stable, or keep it
>> as a 4.17 feature?
>
> Given that it fixes a potentially dangerous migration, and it's a fairly
> simple patch, I think it's reasonable to apply as a fix to the next 4.16
> release. Would we be violating any hard-set rules in doing so?

I don't think so, but I'd welcome comments on it.

If nobody shouts by the end of the week, I'll send it in as a fix for
4.17, earmarked for 4.16 backport.

Thanks,

M.
--
Jazz is not dead. It just smells funny...