Re: [PATCH 1/4] KVM: PPC: Add support for multiple-TCE hcalls

From: Alexander Graf
Date: Mon Jun 17 2013 - 04:02:51 EST



On 17.06.2013, at 09:55, Alexey Kardashevskiy wrote:

> On 06/17/2013 08:06 AM, Alexander Graf wrote:
>>
>> On 05.06.2013, at 08:11, Alexey Kardashevskiy wrote:
>>
>>> This adds real mode handlers for the H_PUT_TCE_INDIRECT and
>>> H_STUFF_TCE hypercalls for QEMU emulated devices such as IBMVIO
>>> devices or emulated PCI. These calls allow adding multiple entries
>>> (up to 512) into the TCE table in one call which saves time on
>>> transition to/from real mode.
>>>
>>> This adds a tce_tmp cache to kvm_vcpu_arch to save valid TCEs
>>> (copied from user and verified) before writing the whole list into
>>> the TCE table. This cache will be utilized more in the upcoming
>>> VFIO/IOMMU support to continue TCE list processing in the virtual
>>> mode in the case if the real mode handler failed for some reason.
>>>
>>> This adds a guest physical to host real address converter
>>> and calls the existing H_PUT_TCE handler. The converting function
>>> is going to be fully utilized by upcoming VFIO supporting patches.
>>>
>>> This also implements the KVM_CAP_PPC_MULTITCE capability,
>>> so in order to support the functionality of this patch, QEMU
>>> needs to query for this capability and set the "hcall-multi-tce"
>>> hypertas property only if the capability is present, otherwise
>>> there will be serious performance degradation.
>>>
>>> Cc: David Gibson <david@xxxxxxxxxxxxxxxxxxxxx>
>>> Signed-off-by: Alexey Kardashevskiy <aik@xxxxxxxxx>
>>> Signed-off-by: Paul Mackerras <paulus@xxxxxxxxx>
>>
>> Only a few minor nits. Ben already commented on implementation details.
>>
>>>
>>> ---
>>> Changelog:
>>> 2013/06/05:
>>> * fixed mistype about IBMVIO in the commit message
>>> * updated doc and moved it to another section
>>> * changed capability number
>>>
>>> 2013/05/21:
>>> * added kvm_vcpu_arch::tce_tmp
>>> * removed cleanup if put_indirect failed, instead we do not even start
>>> writing to TCE table if we cannot get TCEs from the user and they are
>>> invalid
>>> * kvmppc_emulated_h_put_tce is split to kvmppc_emulated_put_tce
>>> and kvmppc_emulated_validate_tce (for the previous item)
>>> * fixed bug with failthrough for H_IPI
>>> * removed all get_user() from real mode handlers
>>> * kvmppc_lookup_pte() added (instead of making lookup_linux_pte public)
>>> ---
>>> Documentation/virtual/kvm/api.txt | 17 ++
>>> arch/powerpc/include/asm/kvm_host.h | 2 +
>>> arch/powerpc/include/asm/kvm_ppc.h | 16 +-
>>> arch/powerpc/kvm/book3s_64_vio.c | 118 ++++++++++++++
>>> arch/powerpc/kvm/book3s_64_vio_hv.c | 266 +++++++++++++++++++++++++++----
>>> arch/powerpc/kvm/book3s_hv.c | 39 +++++
>>> arch/powerpc/kvm/book3s_hv_rmhandlers.S | 6 +
>>> arch/powerpc/kvm/book3s_pr_papr.c | 37 ++++-
>>> arch/powerpc/kvm/powerpc.c | 3 +
>>> include/uapi/linux/kvm.h | 1 +
>>> 10 files changed, 473 insertions(+), 32 deletions(-)
>>>
>>> diff --git a/Documentation/virtual/kvm/api.txt b/Documentation/virtual/kvm/api.txt
>>> index 5f91eda..6c082ff 100644
>>> --- a/Documentation/virtual/kvm/api.txt
>>> +++ b/Documentation/virtual/kvm/api.txt
>>> @@ -2362,6 +2362,23 @@ calls by the guest for that service will be passed to userspace to be
>>> handled.
>>>
>>>
>>> +4.83 KVM_CAP_PPC_MULTITCE
>>> +
>>> +Capability: KVM_CAP_PPC_MULTITCE
>>> +Architectures: ppc
>>> +Type: vm
>>> +
>>> +This capability tells the guest that multiple TCE entry add/remove hypercalls
>>> +handling is supported by the kernel. This significanly accelerates DMA
>>> +operations for PPC KVM guests.
>>> +
>>> +Unlike other capabilities in this section, this one does not have an ioctl.
>>> +Instead, when the capability is present, the H_PUT_TCE_INDIRECT and
>>> +H_STUFF_TCE hypercalls are to be handled in the host kernel and not passed to
>>> +the guest. Othwerwise it might be better for the guest to continue using H_PUT_TCE
>>> +hypercall (if KVM_CAP_SPAPR_TCE or KVM_CAP_SPAPR_TCE_IOMMU are present).
>>
>
>> While this describes perfectly well what the consequences are of the
>> patches, it does not describe properly what the CAP actually expresses.
>> The CAP only says "this kernel is able to handle H_PUT_TCE_INDIRECT and
>> H_STUFF_TCE hypercalls directly". All other consequences are nice to
>> document, but the semantics of the CAP are missing.
>
>
> ? It expresses ability to handle 2 hcalls. What is missing?

You don't describe the kvm <-> qemu interface. You describe some decisions qemu can take from this cap.

>
>
>> We also usually try to keep KVM behavior unchanged with regards to older
>> versions until a CAP is enabled. In this case I don't think it matters
>> all that much, so I'm fine with declaring it as enabled by default.
>> Please document that this is a change in behavior versus older KVM
>> versions though.
>
>
> Ok!
>
>
>>> +
>>> +
>>> 5. The kvm_run structure
>>> ------------------------
>>>
>>> diff --git a/arch/powerpc/include/asm/kvm_host.h b/arch/powerpc/include/asm/kvm_host.h
>>> index af326cd..85d8f26 100644
>>> --- a/arch/powerpc/include/asm/kvm_host.h
>>> +++ b/arch/powerpc/include/asm/kvm_host.h
>>> @@ -609,6 +609,8 @@ struct kvm_vcpu_arch {
>>> spinlock_t tbacct_lock;
>>> u64 busy_stolen;
>>> u64 busy_preempt;
>>> +
>>> + unsigned long *tce_tmp; /* TCE cache for TCE_PUT_INDIRECT hall */
>>> #endif
>>> };
>>
>> [...]
>>>
>>>
>>
>> [...]
>>
>>> diff --git a/arch/powerpc/kvm/book3s_hv.c b/arch/powerpc/kvm/book3s_hv.c
>>> index 550f592..a39039a 100644
>>> --- a/arch/powerpc/kvm/book3s_hv.c
>>> +++ b/arch/powerpc/kvm/book3s_hv.c
>>> @@ -568,6 +568,30 @@ int kvmppc_pseries_do_hcall(struct kvm_vcpu *vcpu)
>>> ret = kvmppc_xics_hcall(vcpu, req);
>>> break;
>>> } /* fallthrough */
>>
>> The fallthrough comment isn't accurate anymore.
>>
>>> + return RESUME_HOST;
>>> + case H_PUT_TCE:
>>> + ret = kvmppc_virtmode_h_put_tce(vcpu, kvmppc_get_gpr(vcpu, 4),
>>> + kvmppc_get_gpr(vcpu, 5),
>>> + kvmppc_get_gpr(vcpu, 6));
>>> + if (ret == H_TOO_HARD)
>>> + return RESUME_HOST;
>>> + break;
>>> + case H_PUT_TCE_INDIRECT:
>>> + ret = kvmppc_virtmode_h_put_tce_indirect(vcpu, kvmppc_get_gpr(vcpu, 4),
>>> + kvmppc_get_gpr(vcpu, 5),
>>> + kvmppc_get_gpr(vcpu, 6),
>>> + kvmppc_get_gpr(vcpu, 7));
>>> + if (ret == H_TOO_HARD)
>>> + return RESUME_HOST;
>>> + break;
>>> + case H_STUFF_TCE:
>>> + ret = kvmppc_virtmode_h_stuff_tce(vcpu, kvmppc_get_gpr(vcpu, 4),
>>> + kvmppc_get_gpr(vcpu, 5),
>>> + kvmppc_get_gpr(vcpu, 6),
>>> + kvmppc_get_gpr(vcpu, 7));
>>> + if (ret == H_TOO_HARD)
>>> + return RESUME_HOST;
>>> + break;
>>> default:
>>> return RESUME_HOST;
>>> }
>>> @@ -958,6 +982,20 @@ struct kvm_vcpu *kvmppc_core_vcpu_create(struct kvm *kvm, unsigned int id)
>>> vcpu->arch.cpu_type = KVM_CPU_3S_64;
>>> kvmppc_sanity_check(vcpu);
>>>
>>> + /*
>>> + * As we want to minimize the chance of having H_PUT_TCE_INDIRECT
>>> + * half executed, we first read TCEs from the user, check them and
>>> + * return error if something went wrong and only then put TCEs into
>>> + * the TCE table.
>>> + *
>>> + * tce_tmp is a cache for TCEs to avoid stack allocation or
>>> + * kmalloc as the whole TCE list can take up to 512 items 8 bytes
>>> + * each (4096 bytes).
>>> + */
>>> + vcpu->arch.tce_tmp = kmalloc(4096, GFP_KERNEL);
>>> + if (!vcpu->arch.tce_tmp)
>>> + goto free_vcpu;
>>> +
>>> return vcpu;
>>>
>>> free_vcpu:
>>> @@ -980,6 +1018,7 @@ void kvmppc_core_vcpu_free(struct kvm_vcpu *vcpu)
>>> unpin_vpa(vcpu->kvm, &vcpu->arch.slb_shadow);
>>> unpin_vpa(vcpu->kvm, &vcpu->arch.vpa);
>>> spin_unlock(&vcpu->arch.vpa_update_lock);
>>> + kfree(vcpu->arch.tce_tmp);
>>> kvm_vcpu_uninit(vcpu);
>>> kmem_cache_free(kvm_vcpu_cache, vcpu);
>>> }
>>> diff --git a/arch/powerpc/kvm/book3s_hv_rmhandlers.S b/arch/powerpc/kvm/book3s_hv_rmhandlers.S
>>> index b02f91e..d35554e 100644
>>> --- a/arch/powerpc/kvm/book3s_hv_rmhandlers.S
>>> +++ b/arch/powerpc/kvm/book3s_hv_rmhandlers.S
>>> @@ -1490,6 +1490,12 @@ hcall_real_table:
>>> .long 0 /* 0x11c */
>>> .long 0 /* 0x120 */
>>> .long .kvmppc_h_bulk_remove - hcall_real_table
>>> + .long 0 /* 0x128 */
>>> + .long 0 /* 0x12c */
>>> + .long 0 /* 0x130 */
>>> + .long 0 /* 0x134 */
>>> + .long .kvmppc_h_stuff_tce - hcall_real_table
>>> + .long .kvmppc_h_put_tce_indirect - hcall_real_table
>>> hcall_real_table_end:
>>>
>>> ignore_hdec:
>>> diff --git a/arch/powerpc/kvm/book3s_pr_papr.c b/arch/powerpc/kvm/book3s_pr_papr.c
>>> index da0e0bc..91d4b45 100644
>>> --- a/arch/powerpc/kvm/book3s_pr_papr.c
>>> +++ b/arch/powerpc/kvm/book3s_pr_papr.c
>>> @@ -220,7 +220,38 @@ static int kvmppc_h_pr_put_tce(struct kvm_vcpu *vcpu)
>>> unsigned long tce = kvmppc_get_gpr(vcpu, 6);
>>> long rc;
>>>
>>> - rc = kvmppc_h_put_tce(vcpu, liobn, ioba, tce);
>>> + rc = kvmppc_virtmode_h_put_tce(vcpu, liobn, ioba, tce);
>>> + if (rc == H_TOO_HARD)
>>> + return EMULATE_FAIL;
>>> + kvmppc_set_gpr(vcpu, 3, rc);
>>> + return EMULATE_DONE;
>>> +}
>>> +
>>> +static int kvmppc_h_pr_put_tce_indirect(struct kvm_vcpu *vcpu)
>>> +{
>>> + unsigned long liobn = kvmppc_get_gpr(vcpu, 4);
>>> + unsigned long ioba = kvmppc_get_gpr(vcpu, 5);
>>> + unsigned long tce = kvmppc_get_gpr(vcpu, 6);
>>> + unsigned long npages = kvmppc_get_gpr(vcpu, 7);
>>> + long rc;
>>> +
>>> + rc = kvmppc_virtmode_h_put_tce_indirect(vcpu, liobn, ioba,
>>> + tce, npages);
>>> + if (rc == H_TOO_HARD)
>>> + return EMULATE_FAIL;
>>> + kvmppc_set_gpr(vcpu, 3, rc);
>>> + return EMULATE_DONE;
>>> +}
>>> +
>>> +static int kvmppc_h_pr_stuff_tce(struct kvm_vcpu *vcpu)
>>> +{
>>> + unsigned long liobn = kvmppc_get_gpr(vcpu, 4);
>>> + unsigned long ioba = kvmppc_get_gpr(vcpu, 5);
>>> + unsigned long tce_value = kvmppc_get_gpr(vcpu, 6);
>>> + unsigned long npages = kvmppc_get_gpr(vcpu, 7);
>>> + long rc;
>>> +
>>> + rc = kvmppc_virtmode_h_stuff_tce(vcpu, liobn, ioba, tce_value, npages);
>>> if (rc == H_TOO_HARD)
>>> return EMULATE_FAIL;
>>> kvmppc_set_gpr(vcpu, 3, rc);
>>> @@ -247,6 +278,10 @@ int kvmppc_h_pr(struct kvm_vcpu *vcpu, unsigned long cmd)
>>> return kvmppc_h_pr_bulk_remove(vcpu);
>>> case H_PUT_TCE:
>>> return kvmppc_h_pr_put_tce(vcpu);
>>> + case H_PUT_TCE_INDIRECT:
>>> + return kvmppc_h_pr_put_tce_indirect(vcpu);
>>> + case H_STUFF_TCE:
>>> + return kvmppc_h_pr_stuff_tce(vcpu);
>>> case H_CEDE:
>>> vcpu->arch.shared->msr |= MSR_EE;
>>> kvm_vcpu_block(vcpu);
>>> diff --git a/arch/powerpc/kvm/powerpc.c b/arch/powerpc/kvm/powerpc.c
>>> index 6316ee3..8465c2a 100644
>>> --- a/arch/powerpc/kvm/powerpc.c
>>> +++ b/arch/powerpc/kvm/powerpc.c
>>> @@ -395,6 +395,9 @@ int kvm_dev_ioctl_check_extension(long ext)
>>> r = 1;
>>> break;
>>> #endif
>>> + case KVM_CAP_SPAPR_MULTITCE:
>>> + r = 1;
>>
>> This should only be true for book3s.
>
>
> We had this discussion with v2.
>
> David:
> ===
> So, in the case of MULTITCE, that's not quite right. PR KVM can
> emulate a PAPR system on a BookE machine, and there's no reason not to
> allow TCE acceleration as well. We can't make it dependent on PAPR
> mode being selected, because that's enabled per-vcpu, whereas these
> capabilities are queried on the VM before the vcpus are created.
> ===
>
> Wrong?

Partially. BookE can not emulate a PAPR system as it stands today.

The code should of course be generic and be available generically. But your code only patches the hypercall's availability in the book3s hypercall handlers, so that specific kernel version can only handle these hypercalls on book3s.

Whether a later version of the kernel will be able to handle them is a different question.


Alex

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/