On Tue, Apr 28, 2020 at 09:06:51AM +0200, Javier GonzÃlez wrote:
CAUTION: This email originated from outside of Western Digital. Do not click on links or open attachments unless you recognize the sender and know that the content is safe.
On 27.04.2020 18:22, Niklas Cassel wrote:
> On Mon, Apr 27, 2020 at 08:03:11PM +0200, Javier GonzÃlez wrote:
> > On 27.04.2020 14:34, Niklas Cassel wrote:
> > > When jumping to the out_put_disk label, we will call put_disk(), which will
> > > trigger a call to disk_release(), which calls blk_put_queue().
> > >
> > > Later in the cleanup code, we do blk_cleanup_queue(), which will also call
> > > blk_put_queue().
> > >
> > > Putting the queue twice is incorrect, and will generate a KASAN splat.
> > >
> > > Set the disk->queue pointer to NULL, before calling put_disk(), so that the
> > > first call to blk_put_queue() will not free the queue.
> > >
> > > The second call to blk_put_queue() uses another pointer to the same queue,
> > > so this call will still free the queue.
> > >
> > > Fixes: 85136c010285 ("lightnvm: simplify geometry enumeration")
> > > Signed-off-by: Niklas Cassel <niklas.cassel@xxxxxxx>
> > > ---
> > > drivers/nvme/host/core.c | 2 ++
> > > 1 file changed, 2 insertions(+)
> > >
> > > diff --git a/drivers/nvme/host/core.c b/drivers/nvme/host/core.c
> > > index 91c1bd659947..f2adea96b04c 100644
> > > --- a/drivers/nvme/host/core.c
> > > +++ b/drivers/nvme/host/core.c
> > > @@ -3642,6 +3642,8 @@ static void nvme_alloc_ns(struct nvme_ctrl *ctrl, unsigned nsid)
> > >
> > > return;
> > > out_put_disk:
> > > + /* prevent double queue cleanup */
> > > + ns->disk->queue = NULL;
> > > put_disk(ns->disk);
> > > out_unlink_ns:
> > > mutex_lock(&ctrl->subsys->lock);
> > > --
> > > 2.25.3
> > >
> > What about delaying the assignment of ns->disk?
> >
> > diff --git i/drivers/nvme/host/core.c w/drivers/nvme/host/core.c
> > index a4d8c90ee7cc..6da4a9ced945 100644
> > --- i/drivers/nvme/host/core.c
> > +++ w/drivers/nvme/host/core.c
> > @@ -3541,7 +3541,6 @@ static int nvme_alloc_ns(struct nvme_ctrl *ctrl, unsigned nsid)
> > disk->queue = ns->queue;
> > disk->flags = flags;
> > memcpy(disk->disk_name, disk_name, DISK_NAME_LEN);
> > - ns->disk = disk;
> >
> > __nvme_revalidate_disk(disk, id);
> >
> > @@ -3553,6 +3552,8 @@ static int nvme_alloc_ns(struct nvme_ctrl *ctrl, unsigned nsid)
> > }
> > }
> >
> > + ns->disk = disk;
> > +
>
> Hello Javier!
>
>
> The only case where we jump to the out_put_disk label, is if the
> nvme_nvm_register() call failed.
>
> In that case, we want to undo the alloc_disk_node() operation, i.e.,
> decrease the refcount.
>
> If we don't set "ns->disk = disk;" before the call to nvme_nvm_register(),
> then, if register fails, and we jump to the put_disk(ns->disk) label,
> ns->disk will be NULL, so the recount will not be decreased, so I assume
> that this memory would then be a memory leak.
>
>
> I think that the problem is that the block functions are a bit messy.
> Most drivers seem to do blk_cleanup_queue() first and then do put_disk(),
> but some drivers do it in the opposite way, so I think that we might have
> some more use-after-free bugs in some of these drivers that do it in the
> opposite way.
>
Hi Niklas,
Yes, the out_put_disk label was introduced at the same time as the
LightNVM entry point. We can do a better job at separating the cleanup
functions, but as far as I can see ns->disk is not used in the LightNVM
initialization, so delaying the initialization should be ok. Part of
this should be also changing the out_put_disk to put_disk(disk).
Hello Javier,
If I understand correctly, you are suggesting this:
--- a/drivers/nvme/host/core.c
+++ b/drivers/nvme/host/core.c
@@ -3616,7 +3616,6 @@ static void nvme_alloc_ns(struct nvme_ctrl *ctrl, unsigned nsid)
disk->queue = ns->queue;
disk->flags = flags;
memcpy(disk->disk_name, disk_name, DISK_NAME_LEN);
- ns->disk = disk;
__nvme_revalidate_disk(disk, id);
@@ -3628,6 +3627,8 @@ static void nvme_alloc_ns(struct nvme_ctrl *ctrl, unsigned nsid)
}
}
+ ns->disk = disk;
+
down_write(&ctrl->namespaces_rwsem);
list_add_tail(&ns->list, &ctrl->namespaces);
up_write(&ctrl->namespaces_rwsem);
@@ -3642,7 +3643,7 @@ static void nvme_alloc_ns(struct nvme_ctrl *ctrl, unsigned nsid)
return;
out_put_disk:
- put_disk(ns->disk);
+ put_disk(disk);
out_unlink_ns:
mutex_lock(&ctrl->subsys->lock);
list_del_rcu(&ns->siblings);
That would not solve the double free error when the registration fails,
since disk->queue is still set, so this call will still free the queue.
Later in the cleanup code blk_cleanup_queue(ns->queue);
is called, which will still try to free what is already freed.
To remove the double free by delaying assignments, we would need to
delay the assignment of disk->queue until after the LightNVM entry point,
but I don't think that is possible, since __nvme_revalidate_disk(), which
is called before the LightNVM entry point uses disk->queue.
Christoph said that he would clean up this mess with better block layer
APIs. So perhaps the fix that is already queued is good enough until then.