Re: [PATCH] arm64: mm: drop tlb flush operation when clearing the access bit

From: Baolin Wang
Date: Tue Oct 24 2023 - 22:43:16 EST




On 10/25/2023 9:58 AM, Alistair Popple wrote:

Barry Song <21cnbao@xxxxxxxxx> writes:

On Wed, Oct 25, 2023 at 9:18 AM Alistair Popple <apopple@xxxxxxxxxx> wrote:


Barry Song <21cnbao@xxxxxxxxx> writes:

On Wed, Oct 25, 2023 at 7:16 AM Barry Song <21cnbao@xxxxxxxxx> wrote:

On Tue, Oct 24, 2023 at 8:57 PM Baolin Wang
<baolin.wang@xxxxxxxxxxxxxxxxx> wrote:

[...]

(A). Constant flush cost vs. (B). very very occasional reclaimed hot
page, B might
be a correct choice.

Plus, I doubt B is really going to happen. as after a page is promoted to
the head of lru list or new generation, it needs a long time to slide back
to the inactive list tail or to the candidate generation of mglru. the time
should have been large enough for tlb to be flushed. If the page is really
hot, the hardware will get second, third, fourth etc opportunity to set an
access flag in the long time in which the page is re-moved to the tail
as the page can be accessed multiple times if it is really hot.

This might not be true if you have external hardware sharing the page
tables with software through either HMM or hardware supported ATS
though.

In those cases I think it's much more likely hardware can still be
accessing the page even after a context switch on the CPU say. So those
pages will tend to get reclaimed even though hardware is still actively
using them which would be quite expensive and I guess could lead to
thrashing as each page is reclaimed and then immediately faulted back
in.

That's possible, but the chance should be relatively low. At least on x86, I have not heard of this issue.

i am not quite sure i got your point. has the external hardware sharing cpu's
pagetable the ability to set access flag in page table entries by
itself? if yes,
I don't see how our approach will hurt as folio_referenced can notify the
hardware driver and the driver can flush its own tlb. If no, i don't see
either as the external hardware can't set access flags, that means we
have ignored its reference and only knows cpu's access even in the current
mainline code. so we are not getting worse.

so the external hardware can also see cpu's TLB? or cpu's tlb flush can
also broadcast to external hardware, then external hardware sees the
cleared access flag, thus, it can set access flag in page table when the
hardware access it? If this is the case, I feel what you said is true.

Perhaps it would help if I gave a concrete example. Take for example the
ARM SMMU. It has it's own TLB. Invalidating this TLB is done in one of
two ways depending on the specific HW implementation.

If broadcast TLB maintenance (BTM) is supported it will snoop CPU TLB
invalidations. If BTM is not supported it relies on SW to explicitly
forward TLB invalidations via MMU notifiers.

On our ARM64 hardware, we rely on BTM to maintain TLB coherency.

Now consider the case where some external device is accessing mappings
via the SMMU. The access flag will be cached in the SMMU TLB. If we
clear the access flag without a TLB invalidate the access flag in the
CPU page table will not get updated because it's already set in the SMMU
TLB.

As an aside access flag updates happen in one of two ways. If the SMMU
HW supports hardware translation table updates (HTTU) then hardware will
manage updating access/dirty flags as required. If this is not supported
then SW is relied on to update these flags which in practice means
taking a minor fault. But I don't think that is relevant here - in
either case without a TLB invalidate neither of those things will
happen.

I suppose drivers could implement the clear_flush_young() MMU notifier
callback (none do at the moment AFAICT) but then won't that just lead to
the opposite problem - that every page ever used by an external device
remains active and unavailable for reclaim because the access flag never
gets cleared? I suppose they could do the flush then which would ensure

Yes, I think so too. The reason there is currently no problem, perhaps I think, there are no actual use cases at the moment? At least on our Alibaba's fleet, SMMU and MMU do not share page tables now.

the page is marked inactive if it's not referenced between the two
folio_referenced calls().

But that requires changes to those drivers. SMMU from memory doesn't
even register for notifiers if BTM is supported.

- Alistair


Of course TLB flushes are equally (perhaps even more) expensive for this
kind of external HW so reducing them would still be beneficial. I wonder
if there's some way they could be deferred until the page is moved to
the inactive list say?


[1] https://lore.kernel.org/lkml/20220617070555.344368-1-21cnbao@xxxxxxxxx/
Signed-off-by: Baolin Wang <baolin.wang@xxxxxxxxxxxxxxxxx>
---
arch/arm64/include/asm/pgtable.h | 31 ++++++++++++++++---------------
1 file changed, 16 insertions(+), 15 deletions(-)

diff --git a/arch/arm64/include/asm/pgtable.h b/arch/arm64/include/asm/pgtable.h
index 0bd18de9fd97..2979d796ba9d 100644
--- a/arch/arm64/include/asm/pgtable.h
+++ b/arch/arm64/include/asm/pgtable.h
@@ -905,21 +905,22 @@ static inline int ptep_test_and_clear_young(struct vm_area_struct *vma,
static inline int ptep_clear_flush_young(struct vm_area_struct *vma,
unsigned long address, pte_t *ptep)
{
- int young = ptep_test_and_clear_young(vma, address, ptep);
-
- if (young) {
- /*
- * We can elide the trailing DSB here since the worst that can
- * happen is that a CPU continues to use the young entry in its
- * TLB and we mistakenly reclaim the associated page. The
- * window for such an event is bounded by the next
- * context-switch, which provides a DSB to complete the TLB
- * invalidation.
- */
- flush_tlb_page_nosync(vma, address);
- }
-
- return young;
+ /*
+ * This comment is borrowed from x86, but applies equally to ARM64:
+ *
+ * Clearing the accessed bit without a TLB flush doesn't cause
+ * data corruption. [ It could cause incorrect page aging and
+ * the (mistaken) reclaim of hot pages, but the chance of that
+ * should be relatively low. ]
+ *
+ * So as a performance optimization don't flush the TLB when
+ * clearing the accessed bit, it will eventually be flushed by
+ * a context switch or a VM operation anyway. [ In the rare
+ * event of it not getting flushed for a long time the delay
+ * shouldn't really matter because there's no real memory
+ * pressure for swapout to react to. ]
+ */
+ return ptep_test_and_clear_young(vma, address, ptep);
}

#ifdef CONFIG_TRANSPARENT_HUGEPAGE
--
2.39.3


Thanks
Barry

Thanks
Barry