Re: [PATCH 3/3] vfio-pci: Invalidate mmaps and block MMIO access on disabled memory
From: Alex Williamson
Date: Tue May 05 2020 - 13:12:39 EST
On Mon, 4 May 2020 17:01:23 -0300
Jason Gunthorpe <jgg@xxxxxxxx> wrote:
> On Mon, May 04, 2020 at 01:35:52PM -0600, Alex Williamson wrote:
>
> > Ok, this all makes a lot more sense with memory_lock still in the
> > picture. And it looks like you're not insisting on the wait_event, we
> > can block on memory_lock so long as we don't have an ordering issue.
> > I'll see what I can do. Thanks,
>
> Right, you can block on the rwsem if it is ordered properly vs
> mmap_sem.
This is what I've come up with, please see if you agree with the logic:
void vfio_pci_zap_and_down_write_memory_lock(struct vfio_pci_device *vdev)
{
struct vfio_pci_mmap_vma *mmap_vma, *tmp;
/*
* Lock ordering:
* vma_lock is nested under mmap_sem for vm_ops callback paths.
* The memory_lock semaphore is used by both code paths calling
* into this function to zap vmas and the vm_ops.fault callback
* to protect the memory enable state of the device.
*
* When zapping vmas we need to maintain the mmap_sem => vma_lock
* ordering, which requires using vma_lock to walk vma_list to
* acquire an mm, then dropping vma_lock to get the mmap_sem and
* reacquiring vma_lock. This logic is derived from similar
* requirements in uverbs_user_mmap_disassociate().
*
* mmap_sem must always be the top-level lock when it is taken.
* Therefore we can only hold the memory_lock write lock when
* vma_list is empty, as we'd need to take mmap_sem to clear
* entries. vma_list can only be guaranteed empty when holding
* vma_lock, thus memory_lock is nested under vma_lock.
*
* This enables the vm_ops.fault callback to acquire vma_lock,
* followed by memory_lock read lock, while already holding
* mmap_sem without risk of deadlock.
*/
while (1) {
struct mm_struct *mm = NULL;
mutex_lock(&vdev->vma_lock);
while (!list_empty(&vdev->vma_list)) {
mmap_vma = list_first_entry(&vdev->vma_list,
struct vfio_pci_mmap_vma,
vma_next);
mm = mmap_vma->vma->vm_mm;
if (mmget_not_zero(mm))
break;
list_del(&mmap_vma->vma_next);
kfree(mmap_vma);
mm = NULL;
}
if (!mm)
break;
mutex_unlock(&vdev->vma_lock);
down_read(&mm->mmap_sem);
if (mmget_still_valid(mm)) {
mutex_lock(&vdev->vma_lock);
list_for_each_entry_safe(mmap_vma, tmp,
&vdev->vma_list, vma_next) {
struct vm_area_struct *vma = mmap_vma->vma;
if (vma->vm_mm != mm)
continue;
list_del(&mmap_vma->vma_next);
kfree(mmap_vma);
zap_vma_ptes(vma, vma->vm_start,
vma->vm_end - vma->vm_start);
}
mutex_unlock(&vdev->vma_lock);
}
up_read(&mm->mmap_sem);
mmput(mm);
}
down_write(&vdev->memory_lock);
mutex_unlock(&vdev->vma_lock);
}
As noted in the comment, the fault handler can simply do:
mutex_lock(&vdev->vma_lock);
down_read(&vdev->memory_lock);
This should be deadlock free now, so we can drop the retry handling
Paths needing to acquire memory_lock with vmas zapped (device reset,
memory bit *->0 transition) call this function, perform their
operation, then simply release with up_write(&vdev->memory_lock). Both
the read and write version of acquiring memory_lock can still occur
outside this function for operations that don't require flushing all
vmas or otherwise touch vma_lock or mmap_sem (ex. read/write, MSI-X
vector table access, writing *->1 to memory enable bit).
I still need to work on the bus reset path as acquiring memory_lock
write locks across multiple devices seems like it requires try-lock
behavior, which is clearly complicated, or at least messy in the above
function.
Does this seem like it's going in a reasonable direction? Thanks,
Alex