Re: [PATCH v9 12/13] KVM: PPC: Add support for IOMMU in-kernel handling

From: Alexey Kardashevskiy
Date: Thu Sep 05 2013 - 19:38:33 EST


On 09/06/2013 04:10 AM, Gleb Natapov wrote:
> On Wed, Sep 04, 2013 at 02:01:28AM +1000, Alexey Kardashevskiy wrote:
>> On 09/03/2013 08:53 PM, Gleb Natapov wrote:
>>> On Mon, Sep 02, 2013 at 01:14:29PM +1000, Alexey Kardashevskiy wrote:
>>>> On 09/01/2013 10:06 PM, Gleb Natapov wrote:
>>>>> On Wed, Aug 28, 2013 at 06:50:41PM +1000, Alexey Kardashevskiy wrote:
>>>>>> This allows the host kernel to handle H_PUT_TCE, H_PUT_TCE_INDIRECT
>>>>>> and H_STUFF_TCE requests targeted an IOMMU TCE table without passing
>>>>>> them to user space which saves time on switching to user space and back.
>>>>>>
>>>>>> Both real and virtual modes are supported. The kernel tries to
>>>>>> handle a TCE request in the real mode, if fails it passes the request
>>>>>> to the virtual mode to complete the operation. If it a virtual mode
>>>>>> handler fails, the request is passed to user space.
>>>>>>
>>>>>> The first user of this is VFIO on POWER. Trampolines to the VFIO external
>>>>>> user API functions are required for this patch.
>>>>>>
>>>>>> This adds a "SPAPR TCE IOMMU" KVM device to associate a logical bus
>>>>>> number (LIOBN) with an VFIO IOMMU group fd and enable in-kernel handling
>>>>>> of map/unmap requests. The device supports a single attribute which is
>>>>>> a struct with LIOBN and IOMMU fd. When the attribute is set, the device
>>>>>> establishes the connection between KVM and VFIO.
>>>>>>
>>>>>> Tests show that this patch increases transmission speed from 220MB/s
>>>>>> to 750..1020MB/s on 10Gb network (Chelsea CXGB3 10Gb ethernet card).
>>>>>>
>>>>>> Signed-off-by: Paul Mackerras <paulus@xxxxxxxxx>
>>>>>> Signed-off-by: Alexey Kardashevskiy <aik@xxxxxxxxx>
>>>>>>
>>>>>> ---
>>>>>>
>>>>>> Changes:
>>>>>> v9:
>>>>>> * KVM_CAP_SPAPR_TCE_IOMMU ioctl to KVM replaced with "SPAPR TCE IOMMU"
>>>>>> KVM device
>>>>>> * release_spapr_tce_table() is not shared between different TCE types
>>>>>> * reduced the patch size by moving VFIO external API
>>>>>> trampolines to separate patche
>>>>>> * moved documentation from Documentation/virtual/kvm/api.txt to
>>>>>> Documentation/virtual/kvm/devices/spapr_tce_iommu.txt
>>>>>>
>>>>>> v8:
>>>>>> * fixed warnings from check_patch.pl
>>>>>>
>>>>>> 2013/07/11:
>>>>>> * removed multiple #ifdef IOMMU_API as IOMMU_API is always enabled
>>>>>> for KVM_BOOK3S_64
>>>>>> * kvmppc_gpa_to_hva_and_get also returns host phys address. Not much sense
>>>>>> for this here but the next patch for hugepages support will use it more.
>>>>>>
>>>>>> 2013/07/06:
>>>>>> * added realmode arch_spin_lock to protect TCE table from races
>>>>>> in real and virtual modes
>>>>>> * POWERPC IOMMU API is changed to support real mode
>>>>>> * iommu_take_ownership and iommu_release_ownership are protected by
>>>>>> iommu_table's locks
>>>>>> * VFIO external user API use rewritten
>>>>>> * multiple small fixes
>>>>>>
>>>>>> 2013/06/27:
>>>>>> * tce_list page is referenced now in order to protect it from accident
>>>>>> invalidation during H_PUT_TCE_INDIRECT execution
>>>>>> * added use of the external user VFIO API
>>>>>>
>>>>>> 2013/06/05:
>>>>>> * changed capability number
>>>>>> * changed ioctl number
>>>>>> * update the doc article number
>>>>>>
>>>>>> 2013/05/20:
>>>>>> * removed get_user() from real mode handlers
>>>>>> * kvm_vcpu_arch::tce_tmp usage extended. Now real mode handler puts there
>>>>>> translated TCEs, tries realmode_get_page() on those and if it fails, it
>>>>>> passes control over the virtual mode handler which tries to finish
>>>>>> the request handling
>>>>>> * kvmppc_lookup_pte() now does realmode_get_page() protected by BUSY bit
>>>>>> on a page
>>>>>> * The only reason to pass the request to user mode now is when the user mode
>>>>>> did not register TCE table in the kernel, in all other cases the virtual mode
>>>>>> handler is expected to do the job
>>>>>> ---
>>>>>> .../virtual/kvm/devices/spapr_tce_iommu.txt | 37 +++
>>>>>> arch/powerpc/include/asm/kvm_host.h | 4 +
>>>>>> arch/powerpc/kvm/book3s_64_vio.c | 310 ++++++++++++++++++++-
>>>>>> arch/powerpc/kvm/book3s_64_vio_hv.c | 122 ++++++++
>>>>>> arch/powerpc/kvm/powerpc.c | 1 +
>>>>>> include/linux/kvm_host.h | 1 +
>>>>>> virt/kvm/kvm_main.c | 5 +
>>>>>> 7 files changed, 477 insertions(+), 3 deletions(-)
>>>>>> create mode 100644 Documentation/virtual/kvm/devices/spapr_tce_iommu.txt
>>>>>>
>>>>>> diff --git a/Documentation/virtual/kvm/devices/spapr_tce_iommu.txt b/Documentation/virtual/kvm/devices/spapr_tce_iommu.txt
>>>>>> new file mode 100644
>>>>>> index 0000000..4bc8fc3
>>>>>> --- /dev/null
>>>>>> +++ b/Documentation/virtual/kvm/devices/spapr_tce_iommu.txt
>>>>>> @@ -0,0 +1,37 @@
>>>>>> +SPAPR TCE IOMMU device
>>>>>> +
>>>>>> +Capability: KVM_CAP_SPAPR_TCE_IOMMU
>>>>>> +Architectures: powerpc
>>>>>> +
>>>>>> +Device type supported: KVM_DEV_TYPE_SPAPR_TCE_IOMMU
>>>>>> +
>>>>>> +Groups:
>>>>>> + KVM_DEV_SPAPR_TCE_IOMMU_ATTR_LINKAGE
>>>>>> + Attributes: single attribute with pair { LIOBN, IOMMU fd}
>>>>>> +
>>>>>> +This is completely made up device which provides API to link
>>>>>> +logical bus number (LIOBN) and IOMMU group. The user space has
>>>>>> +to create a new SPAPR TCE IOMMU device per a logical bus.
>>>>>> +
>>>>> Why not have one device that can handle multimple links?
>>>>
>>>>
>>>> I can do that. If I make it so, it won't even look as a device at all, just
>>>> some weird interface to KVM but ok. What bothers me is it is just a
>>> May be I do not understand usage pattern here. Why do you feel that device
>>> that can handle multiple links is worse than device per link? How many logical
>>> buses is there usually? How often they created/destroyed? I am not insisting
>>> on the change, just trying to understand why you do not like it.
>>
>>
>> Is it usually one PCI host bus adapter per IOMMU group which is usually
>> one PCI card or 2-3 cards if it is a legacy PCI-X, and they are created
>> when QEMU-KVM starts. Not many. And they live till KVM ends.
>>
>> My point is why would I want to put all links to one device? It all is just
>> a matter of taste and nothing more. Or I am missing something but I do not
>> see what. If it is all about making thing to be kosher/halal/orthodox, then
>> I have more stuff to do, like reworking the emulated TCEs. But if is it for
>> (I do not know, just guessing) performance or something like that - then
>> I'll fix it, I just need to know what I am fixing.
>>
> Each device creates an fd, if you can have a lot of them eventually this
> will be a bottleneck. You are saying this is not the case, so lets go
> with proposed interface.


Did you decide not to answer the email which Ben sent yesterday or you just
did not see it? Just checking :)



--
Alexey
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/