Re: [PATCH] infiniband: fix a subtle race condition
From: Cong Wang
Date: Thu Jun 14 2018 - 19:14:42 EST
On Thu, Jun 14, 2018 at 10:24 AM, Jason Gunthorpe <jgg@xxxxxxxxxxxx> wrote:
> On Thu, Jun 14, 2018 at 10:03:09AM -0700, Cong Wang wrote:
>> On Thu, Jun 14, 2018 at 7:24 AM, Jason Gunthorpe <jgg@xxxxxxxxxxxx> wrote:
>> >
>> > This was my brief reaction too, this code path almost certainly has a
>> > use-after-free, and we should fix the concurrency between the two
>> > places in some correct way..
>>
>> First of all, why use-after-free could trigger an imbalance unlock?
>> IOW, why do we have to solve use-after-free to fix this imbalance
>> unlock?
>
> The issue syzkaller hit is that accessing ctx->file does not seem
> locked in any way and can race with other manipulations of ctx->file.
>
> So.. for this patch to be correct we need to understand how this
> statement:
>
> f = ctx->file
>
> Avoids f becoming a dangling pointer - and without locking, my
It doesn't, because this is not the point, this is not the cause
of the unlock imbalance either. syzbot didn't report use-after-free
or a kernel segfault here.
> suspicion is that it doesn't - because missing locking around
> ctx->file is probably the actual bug syzkaller found.
Does my patch make it lockless or dangling? Apparently no.
Before my patch:
mutex_lock(&ctx->file->mut);
After my patch:
cur_file = ctx->file;
mutex_lock(&cur_file->mut);
The deference is same as before, it was lockless and it is lockless
after my patch.
Look at the assembly code *without* my patch:
ffffffff819354f0: 49 8b 7c 24 78 mov 0x78(%r12),%rdi
ffffffff819354f5: 48 89 c3 mov %rax,%rbx
ffffffff819354f8: 31 f6 xor %esi,%esi
ffffffff819354fa: e8 d8 dd 40 00 callq
ffffffff81d432d7 <mutex_lock_nested>
Apparently the pointer is dereferenced before lock.
What difference does my patch make?
ffffffff819354f2: 4d 8b 74 24 78 mov 0x78(%r12),%r14
ffffffff819354f7: 48 89 c3 mov %rax,%rbx
ffffffff819354fa: 31 f6 xor %esi,%esi
ffffffff819354fc: 4c 89 f7 mov %r14,%rdi
ffffffff819354ff: e8 9b df 40 00 callq
ffffffff81d4349f <mutex_lock_nested>
...
ffffffff8193567d: 4c 89 f7 mov %r14,%rdi
ffffffff81935680: e8 98 dd 40 00 callq
ffffffff81d4341d <mutex_unlock>
The %r14 here is the whole point of my patch.
>
> If this is not the case, then add a comment explaining how f's
> lifetime is OK.
>
> Otherwise, we need some kind of locking and guessing we need to hold a
> kref for f?
I agree with you, but again, this is not necessary for unlock
imbalance.
>
>> Third of all, the use-after-free I can see (race with ->close) exists
>> before my patch, this patch doesn't make it better or worse, nor
>> I have any intend to fix it.
>
> I'm not sure that race exists, there should be something that flushes
> the WQ on the path to close... (though I have another email that
> perhaps that is broken, sigh)
>
This is not related to my patch, but to convince you, let me explain:
struct ucma_file is not refcnt'ed, I know you cancel the work in
rdma_destroy_id(), but after ucma_migrate_id() the ctx has already
been moved to the new file, for the old file, it won't cancel the
ctx flying with workqueue. So, I think the following use-after-free
could happen:
ucma_event_handler():
cur_file = ctx->file; // old file
ucma_migrate_id():
lock();
list_move_tail(&ctx->list, &new_file->ctx_list);
ctx->file = new_file;
unlock();
ucma_close():
// retrieve old file via filp->private_data
// the loop won't cover the ctx moved to the new_file
kfree(file);
ucma_event_handler():
// continued from above
lock(&cur_file->mux); // already freed!
This is _not_ the cause of the unlock imbalance, and is _not_ expected
to solve by patch either.