Re: [PATCH v5 1/2] powerpc/pseries/iommu: Share the per-cpu TCE page with the hypervisor.

From: Alexey Kardashevskiy
Date: Wed Dec 11 2019 - 21:39:53 EST




On 12/12/2019 09:47, Alexey Kardashevskiy wrote:
>
>
> On 12/12/2019 07:31, Michael Roth wrote:
>> Quoting Alexey Kardashevskiy (2019-12-11 02:15:44)
>>>
>>>
>>> On 11/12/2019 02:35, Ram Pai wrote:
>>>> On Tue, Dec 10, 2019 at 04:32:10PM +1100, Alexey Kardashevskiy wrote:
>>>>>
>>>>>
>>>>> On 10/12/2019 16:12, Ram Pai wrote:
>>>>>> On Tue, Dec 10, 2019 at 02:07:36PM +1100, Alexey Kardashevskiy wrote:
>>>>>>>
>>>>>>>
>>>>>>> On 07/12/2019 12:12, Ram Pai wrote:
>>>>>>>> H_PUT_TCE_INDIRECT hcall uses a page filled with TCE entries, as one of
>>>>>>>> its parameters. On secure VMs, hypervisor cannot access the contents of
>>>>>>>> this page since it gets encrypted. Hence share the page with the
>>>>>>>> hypervisor, and unshare when done.
>>>>>>>
>>>>>>>
>>>>>>> I thought the idea was to use H_PUT_TCE and avoid sharing any extra
>>>>>>> pages. There is small problem that when DDW is enabled,
>>>>>>> FW_FEATURE_MULTITCE is ignored (easy to fix); I also noticed complains
>>>>>>> about the performance on slack but this is caused by initial cleanup of
>>>>>>> the default TCE window (which we do not use anyway) and to battle this
>>>>>>> we can simply reduce its size by adding
>>>>>>
>>>>>> something that takes hardly any time with H_PUT_TCE_INDIRECT, takes
>>>>>> 13secs per device for H_PUT_TCE approach, during boot. This is with a
>>>>>> 30GB guest. With larger guest, the time will further detoriate.
>>>>>
>>>>>
>>>>> No it will not, I checked. The time is the same for 2GB and 32GB guests-
>>>>> the delay is caused by clearing the small DMA window which is small by
>>>>> the space mapped (1GB) but quite huge in TCEs as it uses 4K pages; and
>>>>> for DDW window + emulated devices the IOMMU page size will be 2M/16M/1G
>>>>> (depends on the system) so the number of TCEs is much smaller.
>>>>
>>>> I cant get your results. What changes did you make to get it?
>>>
>>>
>>> Get what? I passed "-m 2G" and "-m 32G", got the same time - 13s spent
>>> in clearing the default window and the huge window took a fraction of a
>>> second to create and map.
>>
>> Is this if we disable FW_FEATURE_MULTITCE in the guest and force the use
>> of H_PUT_TCE everywhere?
>
>
> Yes. Well, for the DDW case FW_FEATURE_MULTITCE is ignored but even when
> fixed (I have it in my local branch), this does not make a difference.
>
>
>>
>> In theory couldn't we leave FW_FEATURE_MULTITCE in place so that
>> iommu_table_clear() can still use H_STUFF_TCE (which I guess is basically
>> instant),
>
> PAPR/LoPAPR "conveniently" do not describe what hcall-multi-tce does
> exactly. But I am pretty sure the idea is that either both H_STUFF_TCE
> and H_PUT_TCE_INDIRECT are present or neither.
>
>
>> and then force H_PUT_TCE for new mappings via something like:
>>
>> diff --git a/arch/powerpc/platforms/pseries/iommu.c b/arch/powerpc/platforms/pseries/iommu.c
>> index 6ba081dd61c9..85d092baf17d 100644
>> --- a/arch/powerpc/platforms/pseries/iommu.c
>> +++ b/arch/powerpc/platforms/pseries/iommu.c
>> @@ -194,6 +194,7 @@ static int tce_buildmulti_pSeriesLP(struct iommu_table *tbl, long tcenum,
>> unsigned long flags;
>>
>> if ((npages == 1) || !firmware_has_feature(FW_FEATURE_MULTITCE)) {
>> + if ((npages == 1) || !firmware_has_feature(FW_FEATURE_MULTITCE) || is_secure_guest()) {
>
>
> Nobody (including myself) seems to like the idea of having
> is_secure_guest() all over the place.
>
> And with KVM acceleration enabled, it is pretty fast anyway. Just now we
> do not have H_PUT_TCE in KVM/UV for secure guests but we will have to
> fix this for secure PCI passhtrough anyway.
>
>
>> return tce_build_pSeriesLP(tbl, tcenum, npages, uaddr,
>> direction, attrs);
>> }
>>
>> That seems like it would avoid the extra 13s.
>
> Or move around iommu_table_clear() which imho is just the right thing to do.


Huh. It is not the right thing as the firmware could have left mappings
there so we need cleanup. Even if I fixed SLOF, there is POWERVM which I
do not know what it does about TCEs. Thanks,



--
Alexey