Re: [PATCH 2/2] mm: soft_dirty: userfaultfd: introduce wrprotect_tlb_flush_pending

From: Linus Torvalds
Date: Thu Jan 07 2021 - 19:34:37 EST


On Thu, Jan 7, 2021 at 3:48 PM Andrea Arcangeli <aarcange@xxxxxxxxxx> wrote:
>
> > The alternate fix remains "make sure we always flush the TLB before
> > releasing the page table lock, and make COW do the copy under the page
> > table lock". But I really would prefer to just have this code follow

> The copy under PT lock isn't enough.
>
> Flush TLB before releasing is enough of course.

Right. That's why I said "and". You need both, afaik.

But if we just do the mmap write lock, you need neither - then you
just need to flush before you release the write lock.

> Note also the patch in 2/2 patch that I sent:

Yes, yes, and that's what I'm objecting to.

All these stupid games with "flush_pending(" counts are complete garbage.

They all come from the fact that this code doesn't hold the right lock.

I don't understand you: in one breath you seem to say "yes, taking the
write lock is the right thing to do", and then in the next one you
point to this patch that adds all this garbage *because* it's not
holding the write lock.

All of those "tlb_flush_pending" things are wrong. They should not
exist. The code in clear_refs_write() should hold the mmap_sem for
writing, and do the TLB flush before it releases the mmap sem, and
then it *cannot* race with the page faults.

See what I'm saying? I refuse to apply your patch 2/2, because it all
seems entirely wrong.

What's doubly ludicrous about that is that the coide already _took_
the mmap_sem for writing, and spent extra cycles to downgrade it -
INCORRECTLY - to a read-lock. And as far as I can tell, it doesn't
even do anything expensive inside that (now downgraded) region, so the
downgrading was

(a) buggy

(b) slower than just keeping the lock the way it had

and (b) is because it had already done the expensive part (which was
to get the lock in the first place).

Just as an example, the whole "Rollback wrprotect_tlb_flush_pending"
is all because it got the lock - again wrongly - as a read-lock
initially, then it says "oh, I need to get a write lock", releases it,
re-takes it as a write lock, does a tiny amount of work, and then -
again incorrectly - downgrades it to a read-lock.

To make it even worse (if that is possible) it actually had YET
ANOTHER case - that CLEAR_REFS_MM_HIWATER_RSS - where it took the mmap
sem for writing, did its thing, and then released it.

So there's like *four* different locking mistakes in that single
function. And it's not even an important function to begin with.

It shgould just have done a single

mmap_write_lock_killable(mm);
...
mmap_write_unlock(mm);

around the whole thing, instead of _any_ of that crazy stuff.

That code is WRONG.

And your PATCH 2/2 makes that insane code EVEN WORSE. Why the heck is
that magic fs/proc/ interface allowed to get VM internals so wrong,
and make things so much worse?

Can you not see why I'm arguing with you?

Please. Why is the correct patch not the attached one (apart from the
obvious fact that I haven't tested it and maybe just screwed up
completely - but you get the idea)?

Linus

Attachment: patch
Description: Binary data