Re: [PATCH] infiniband: fix a subtle race condition

From: Cong Wang
Date: Thu Jun 14 2018 - 19:14:42 EST

On Thu, Jun 14, 2018 at 10:24 AM, Jason Gunthorpe <jgg@xxxxxxxxxxxx> wrote:
> On Thu, Jun 14, 2018 at 10:03:09AM -0700, Cong Wang wrote:
>> On Thu, Jun 14, 2018 at 7:24 AM, Jason Gunthorpe <jgg@xxxxxxxxxxxx> wrote:
>> >
>> > This was my brief reaction too, this code path almost certainly has a
>> > use-after-free, and we should fix the concurrency between the two
>> > places in some correct way..
>> First of all, why use-after-free could trigger an imbalance unlock?
>> IOW, why do we have to solve use-after-free to fix this imbalance
>> unlock?
> The issue syzkaller hit is that accessing ctx->file does not seem
> locked in any way and can race with other manipulations of ctx->file.
> So.. for this patch to be correct we need to understand how this
> statement:
> f = ctx->file
> Avoids f becoming a dangling pointer - and without locking, my

It doesn't, because this is not the point, this is not the cause
of the unlock imbalance either. syzbot didn't report use-after-free
or a kernel segfault here.

> suspicion is that it doesn't - because missing locking around
> ctx->file is probably the actual bug syzkaller found.

Does my patch make it lockless or dangling? Apparently no.

Before my patch:


After my patch:

cur_file = ctx->file;

The deference is same as before, it was lockless and it is lockless
after my patch.

Look at the assembly code *without* my patch:

ffffffff819354f0: 49 8b 7c 24 78 mov 0x78(%r12),%rdi
ffffffff819354f5: 48 89 c3 mov %rax,%rbx
ffffffff819354f8: 31 f6 xor %esi,%esi
ffffffff819354fa: e8 d8 dd 40 00 callq
ffffffff81d432d7 <mutex_lock_nested>

Apparently the pointer is dereferenced before lock.

What difference does my patch make?

ffffffff819354f2: 4d 8b 74 24 78 mov 0x78(%r12),%r14
ffffffff819354f7: 48 89 c3 mov %rax,%rbx
ffffffff819354fa: 31 f6 xor %esi,%esi
ffffffff819354fc: 4c 89 f7 mov %r14,%rdi
ffffffff819354ff: e8 9b df 40 00 callq
ffffffff81d4349f <mutex_lock_nested>
ffffffff8193567d: 4c 89 f7 mov %r14,%rdi
ffffffff81935680: e8 98 dd 40 00 callq
ffffffff81d4341d <mutex_unlock>

The %r14 here is the whole point of my patch.

> If this is not the case, then add a comment explaining how f's
> lifetime is OK.
> Otherwise, we need some kind of locking and guessing we need to hold a
> kref for f?

I agree with you, but again, this is not necessary for unlock

>> Third of all, the use-after-free I can see (race with ->close) exists
>> before my patch, this patch doesn't make it better or worse, nor
>> I have any intend to fix it.
> I'm not sure that race exists, there should be something that flushes
> the WQ on the path to close... (though I have another email that
> perhaps that is broken, sigh)

This is not related to my patch, but to convince you, let me explain:

struct ucma_file is not refcnt'ed, I know you cancel the work in
rdma_destroy_id(), but after ucma_migrate_id() the ctx has already
been moved to the new file, for the old file, it won't cancel the
ctx flying with workqueue. So, I think the following use-after-free
could happen:

cur_file = ctx->file; // old file

list_move_tail(&ctx->list, &new_file->ctx_list);
ctx->file = new_file;

// retrieve old file via filp->private_data
// the loop won't cover the ctx moved to the new_file

// continued from above
lock(&cur_file->mux); // already freed!

This is _not_ the cause of the unlock imbalance, and is _not_ expected
to solve by patch either.