Re: [PATCH 01/13] mm: Update ptep_get_lockless()s comment

From: Linus Torvalds
Date: Fri Oct 28 2022 - 20:44:19 EST


On Fri, Oct 28, 2022 at 4:57 PM Nadav Amit <nadav.amit@xxxxxxxxx> wrote:
>
> The problem is in the following code of zap_pte_range():
>
> if (!PageAnon(page)) {
> if (pte_dirty(ptent)) {
> force_flush = 1;
> set_page_dirty(page);
> }
>
> }
> page_remove_rmap(page, vma, false);
>
> Once we remove the rmap, rmap_walk() would not acquire the page-table lock
> anymore. As a result, nothing prevents the kernel from performing writeback
> and cleaning the page-struct dirty-bit before the TLB is actually flushed.

Hah.

The original reason for force_flush there was similar, with a race wrt
page_mkclean() because this code doesn't take the page lock that would
normally serialize things, because the page lock is so painful and
ends up having nasty nasty interactions with slow IO operations.

So all the page-dirty handling really *wants* to use the page lock,
and for the IO side (writeback) that ends up being acceptable and
works well, but from that "serialize VM state" it's horrendous.

So instead the code intentionally serialized on the rmap data
structures which page_mkclean() also walks, and as you point out,
that's broken. It's not broken at the point where we do
set_page_dirty(), but it *comes* broken when we drop the rmap, and the
problem is exactly that "we still have the dirty bit hidden in the TLB
state" issue that you pinpoint.

I think the proper fix (or at least _a_ proper fix) would be to
actually carry the dirty bit along to the __tlb_remove_page() point,
and actually treat it exactly the same way as the page pointer itself
- set the page dirty after the TLB flush, the same way we can free the
page after the TLB flush.

We could easiy hide said dirty bit in the low bits of the
"batch->pages[]" array or something like that. We'd just have to add
the 'dirty' argument to __tlb_remove_page_size() and friends.

Hmm?

Your idea of "do the page_remove_rmap() late instead" would also work,
but the reason I think just squirrelling away the dirty bit is the
"proper" fix is that it would get rid of the whole need for
'force_flush' in this area entirely. So we'd not only fix that race
you noticed, we'd actually do so and reduce the number of TLB flushes
too.

I don't know. Maybe I'm missing something fundamental, and my idea is
just stupid.

Linus