Re: [PATCH v10 11/16] s390/vfio-ap: allow hot plug/unplug of AP resources using mdev device

From: Tony Krowiak
Date: Mon Oct 05 2020 - 17:49:01 EST




On 10/5/20 2:30 PM, Halil Pasic wrote:
On Mon, 5 Oct 2020 12:24:39 -0400
Tony Krowiak <akrowiak@xxxxxxxxxxxxx> wrote:


On 9/27/20 9:01 PM, Halil Pasic wrote:
On Fri, 21 Aug 2020 15:56:11 -0400
Tony Krowiak<akrowiak@xxxxxxxxxxxxx> wrote:

Let's hot plug/unplug adapters, domains and control domains assigned to or
unassigned from an AP matrix mdev device while it is in use by a guest per
the following:

* When the APID of an adapter is assigned to a matrix mdev in use by a KVM
guest, the adapter will be hot plugged into the KVM guest as long as each
APQN derived from the Cartesian product of the APID being assigned and
the APQIs already assigned to the guest's CRYCB references a queue device
bound to the vfio_ap device driver.

* When the APID of an adapter is unassigned from a matrix mdev in use by a
KVM guest, the adapter will be hot unplugged from the KVM guest.

* When the APQI of a domain is assigned to a matrix mdev in use by a KVM
guest, the domain will be hot plugged into the KVM guest as long as each
APQN derived from the Cartesian product of the APQI being assigned and
the APIDs already assigned to the guest's CRYCB references a queue device
bound to the vfio_ap device driver.

* When the APQI of a domain is unassigned from a matrix mdev in use by a
KVM guest, the domain will be hot unplugged from the KVM guest
Hm, I suppose this means that what your guest effectively gets may depend
on whether assign_domain or assign_adapter is done first.

Suppose we have the queues
0.0 0.1
1.0
bound to vfio_ap, i.e. 1.1 is missing for a reason different than
belonging to the default drivers (for what exact reason no idea).
I'm not quite sure what you mean be "we have queue". I will
assume you mean those queues are bound to the vfio_ap
device driver.
Yes, this is exactly what I've meant.


The only way this could happen is if somebody
manually unbinds queue 1.1.

Assuming that:
1) every time we observe ap_perm the ap subsystem in in a settled state
(i.e. not in a middle of pushing things left and right
because of an ap_perm change,
2) the only non-default driver is vfio_ap, and that
3) queues handle non-operational states by other means than dissapearing
(should be the case with the latest reworks)
I agree what is left is manual unbind, which I lean towards considering
an edge case.

If this is indeed just about that edge case, maybe we can live with a
simpler algorithm than this one.

The simplest algorithm is:
* Forego hot plug of an adapter if any APQN derived from the adapter's
   APID and the APQIs of the domains assigned to the matrix mdev references a
   queue device not bound to the vfio_ap device driver. In your scenario, this
   would result in no queues for the guest at 3) despite the fact that 1.0 is bound
   to the vfio_ap dd.

As you stated, this is an edge case, so maybe this would be sufficient. Do you
have any alternative algorithm that makes sense?

Also, keep in mind that one of my goals with this design was to avoid leaving
the guest without any queues if possible. Maybe that is not a critical requirement.



Let's suppose we started with the matix containing only adapter
0 (0.) and domain 0 (.0).

After echo 1 > assign_adapter && echo 1 > assign_domain we end up with
matrix:
0.0 0.1
1.0 1.1
guest_matrix:
0.0 0.1
while after echo 1 > assign_domain && echo 1 > assign_adapter we end up
with:
matrix:
0.0 0.1
1.0 1.1
guest_matrix:
0.0
0.1

That means, the set of bound queues and the set of assigned resources do
not fully determine the set of resources passed through to the guest.

Is that a deliberate design choice?
Yes, it is a deliberate choice to only allow guest access to queues
represented by queue devices bound to the vfio_ap device driver.
The idea here is to adhere to the linux device model.

This is not what I've asked. My question was about he fact that
reordering assignments gives different results. Well this was kind
of the case before as well, with the notable difference, that in a
past we always had an error. So if a full sequence of assignments could
be performed without an error, than any permutation would be performed
with the exact same result.

I'm not sure that the exact same result can be achieved regardless
of the order of assignments - other than possibly the simple algorithm
I suggested above - it is something I would have to think about some
more.

Another option is to call the filtering mechanism introduced
in patch 8/16 in which case the results will be consistent with the
configuration of the shadow_apcb. This would be the same
configuration as if the guest were started after the assignment.


I'm all for only allowing guest access to queues represented by queue
devices bound to the vfio_ap device driver. I'm concerned with the
permutation (and calculus).

* When the domain number of a control domain is assigned to a matrix mdev
in use by a KVM guest, the control domain will be hot plugged into the
KVM guest.

* When the domain number of a control domain is unassigned from a matrix
mdev in use by a KVM guest, the control domain will be hot unplugged
from the KVM guest.

Signed-off-by: Tony Krowiak<akrowiak@xxxxxxxxxxxxx>
---
[..]

+static bool vfio_ap_mdev_assign_guest_apid(struct ap_matrix_mdev *matrix_mdev,
+ unsigned long apid)
+{
+ unsigned long apqi, apqn;
+
+ if (!vfio_ap_mdev_has_crycb(matrix_mdev) ||
+ !test_bit_inv(apid, (unsigned long *)matrix_dev->info.apm))
+ return false;
+
+ if (bitmap_empty(matrix_mdev->shadow_apcb.aqm, AP_DOMAINS))
+ return vfio_ap_mdev_assign_apqis_4_apid(matrix_mdev, apid);
Hm. Let's say we have the same situation regarding the bound queues as
above but we start with the empty matrix, and do all the assignments
while the guest is running.

Consider the following sequence of actions.

1) echo 0 > assign_domain
matrix:            .0
guest_matrix: no APQNs

2) echo 1 > assign_domain
matrix:            .0, .1
guest_matrix: no APQNs

3) echo 1 > assign_adapter
matrix:           1.0, 1.1
guest_matrix: 1.0

4) echo 0 > assign_adapter
matrix:           0.0, 0.1, 1.0, 1.1
guest_matrix: 0.0, 1.0
5) echo 1 > unassign_adapter
matrix:           0.0, 0.1
guest_matrix: 0.0

I understand that at 3), because
bitmap_empty(matrix_mdev->shadow_apcb.aqm)we would end up with a shadow
aqm containing just domain 0, as queue 1.1 ain't bound to us.
True

Thus at the end we would have
matrix:
0.0 0.1
guest_matrix:
0.0
At the end I had:
matrix:            0.0, 0.1
guest_matrix: 0.0

And if we add in an adapter 2. into the mix with the queues 2.0 and 2.1
then after
6) echo 2 > assign_adapter
we get
Thus at the end we would have
matrix:
0.0 0.1
2.0 2.1
guest_matrix:
0.0
2.0

This looks very quirky to me. Did I read the code wrong? Opinions?
You read the code correctly and I agree, this is a bit quirky. I would say
that after adding adapter 2, we should end up with guest matrix:
0.0, 0.1
2.0, 2.1

If you agree, I'll make the adjustment.

I do agree, but maybe we should discuss what adjustments do you have in
mind.

[..]

+static bool vfio_ap_mdev_unassign_guest_apid(struct ap_matrix_mdev *matrix_mdev,
+ unsigned long apid)
+{
+ if (vfio_ap_mdev_has_crycb(matrix_mdev)) {
+ if (test_bit_inv(apid, matrix_mdev->shadow_apcb.apm)) {
+ clear_bit_inv(apid, matrix_mdev->shadow_apcb.apm);
+
+ /*
+ * If there are no APIDs assigned to the guest, then
+ * the guest will not have access to any queues, so
+ * let's also go ahead and unassign the APQIs. Keeping
+ * them around may yield unpredictable results during
+ * a probe that is not related to a host AP
+ * configuration change (i.e., an AP adapter is
+ * configured online).
+ */
I don't quite understand this comment. Clearing out the other mask when
the other one becomes empty, does allow us to recover the full possible guest
matrix in the scenario described above. I don't see any shadow
manipulation in the probe handler at this stage. Are we maybe
talking about the same effect as I described for assign?
Patch 15/16 is for the probe.

I still don't understand the logic, but I guess we want to make
adjustments anyways, so maybe I don't have to.

Regards,
Halil