Re: [PATCH] uacce: fix concurrency of fops_open and uacce_remove

From: Zhangfei Gao
Date: Fri Jun 17 2022 - 04:20:47 EST




On 2022/6/17 下午2:05, Zhangfei Gao wrote:


On 2022/6/16 下午4:14, Jean-Philippe Brucker wrote:
On Thu, Jun 16, 2022 at 12:10:18PM +0800, Zhangfei Gao wrote:
diff --git a/drivers/misc/uacce/uacce.c b/drivers/misc/uacce/uacce.c
index 281c54003edc..b6219c6bfb48 100644
--- a/drivers/misc/uacce/uacce.c
+++ b/drivers/misc/uacce/uacce.c
@@ -136,9 +136,16 @@ static int uacce_fops_open(struct inode *inode, struct file *filep)
       if (!q)
           return -ENOMEM;
+    mutex_lock(&uacce->queues_lock);
+
+    if (!uacce->parent->driver) {
I don't think this is useful, because the core clears parent->driver after
having run uacce_remove():

    rmmod hisi_zip        open()
     ...                 uacce_fops_open()
     __device_release_driver()      ...
      pci_device_remove()
       hisi_zip_remove()
        hisi_qm_uninit()
         uacce_remove()
          ...              ...
                       mutex_lock(uacce->queues_lock)
      ...                  if (!uacce->parent->driver)
      device_unbind_cleanup()      /* driver still valid, proceed */
       dev->driver = NULL
The check  if (!uacce->parent->driver) is required, otherwise NULL pointer
may happen.
I agree we need something, what I mean is that this check is not
sufficient.

iommu_sva_bind_device
const struct iommu_ops *ops = dev_iommu_ops(dev);  ->
dev->iommu->iommu_dev->ops

rmmod has no issue, but remove parent pci device has the issue.
Ah right, relying on the return value of bind() wouldn't be enough even if
we mandated SVA.

[...]
I think we need the global uacce_mutex to serialize uacce_remove() and
uacce_fops_open(). uacce_remove() would do everything, including
xa_erase(), while holding that mutex. And uacce_fops_open() would try to
obtain the uacce object from the xarray while holding the mutex, which
fails if the uacce object is being removed.
Since fops_open get char device refcount, uacce_release will not happen
until open returns.
The refcount only ensures that the uacce_device object is not freed as
long as there are open fds. But uacce_remove() can run while there are
open fds, or fds in the process of being opened. And atfer uacce_remove()
runs, the uacce_device object still exists but is mostly unusable. For
example once the module is freed, uacce->ops is not valid anymore. But
currently uacce_fops_open() may dereference the ops in this case:

    uacce_fops_open()
     if (!uacce->parent->driver)
     /* Still valid, keep going */
     ...                    rmmod
                         uacce_remove()
     ...                     free_module()
     uacce->ops->get_queue() /* BUG */

uacce_remove should wait for uacce->queues_lock, until fops_open release the lock.
If open happen just after the uacce_remove: unlock, uacce_bind_queue in open should fail.

Accessing uacce->ops after free_module() is a use-after-free. We need all
you men parent release the resources.
the fops to synchronize with uacce_remove() to ensure they don't use any
resource of the parent after it's been freed.
After fops_open, currently we are counting on parent driver stop all dma first, then call uacce_remove, which is assumption.
Like drivers/crypto/hisilicon/zip/zip_main.c: hisi_qm_wait_task_finish, which will wait uacce_release.
If comments this , there may other issue,
Unable to handle kernel paging request at virtual address ffff80000b700204
pc : hisi_qm_cache_wb.part.0+0x2c/0xa0

I see uacce_fops_poll() may have the same problem, and should be inside
uacce_mutex.
Do we need consider this, uacce_remove can happen anytime but not waiting dma stop?

Not sure uacce_mutex can do this.
Currently the sequence is
mutex_lock(&uacce->queues_lock);
mutex_lock(&uacce_mutex);

Or we set all the callbacks of uacce_ops to NULL?
How about in uacce_remove
mutex_lock(&uacce_mutex);
uacce->ops = NULL;
mutex_unlock(&uacce_mutex);

And check uacce->ops  first when using.

Or set all ops of uacce->ops to NULL.

Module_get/put only works for module, but not for removing device.

Thanks


Thanks,
Jean