Re: bio linked list corruption.

From: Linus Torvalds
Date: Mon Oct 24 2016 - 20:00:44 EST


On Mon, Oct 24, 2016 at 3:42 PM, Andy Lutomirski <luto@xxxxxxxxxxxxxx> wrote:
>
> Here's my theory: I think you're looking at the right code but the
> wrong stack. shmem_fault_wait is fine, but shmem_fault_waitq looks
> really dicey.

Hmm.

> Consider:
>
> fallocate calls wake_up_all(), which calls autoremove_wait_function().
> That wakes up the shmem_fault thread. Suppose that the shmem_fault
> thread gets moving really quickly (presumably because it never went to
> sleep in the first place -- suppose it hadn't made it to schedule(),
> so schedule() turns into a no-op). It calls finish_wait() *before*
> autoremove_wake_function() does the list_del_init(). finish_wait()
> gets to the list_empty_careful() call and it returns true.

All of this happens under inode->i_lock, so the different parts are
serialized. So if this happens before the wakeup, then finish_wait()
will se that the wait-queue entry is not empty (it points to the wait
queue head in shmem_falloc_waitq.

But then it will just remove itself carefully with list_del_init()
under the waitqueue lock, and things are fine.

Yes, it uses the waitiqueue lock on the other stack, but that stack is
still ok since
(a) we're serialized by inode->i_lock
(b) this code ran before the fallocate thread catches up and exits.

In other words, your scenario seems to be dependent on those two
threads being able to race. But they both hold inode->i_lock in the
critical region you are talking about.

> Now the fallocate thread catches up and *exits*. Dave's test makes a
> new thread that reuses the stack (the vmap area or the backing store).
>
> Now the shmem_fault thread continues on its merry way and takes
> q->lock. But oh crap, q->lock is pointing at some random spot on some
> other thread's stack. Kaboom!

Note that q->lock should be entirely immaterial, since inode->i_lock
nests outside of it in all uses.

Now, if there is some code that runs *without* the inode->i_lock, then
that would be a big bug.

But I'm not seeing it.

I do agree that some race on some stack data structure could easily be
the cause of these issues. And yes, the vmap code obviously starts
reusing the stack much earlier, and would trigger problems that would
essentially be hidden by the fact that the kernel stack used to stay
around not just until exit(), but until the process was reaped.

I just think that in this case i_lock really looks like it should
serialize things correctly.

Or are you seeing something I'm not?

Linus