Re: [RFC PATCH v1 0/4] Reduce cost of ptep_get_lockless on arm64

From: David Hildenbrand
Date: Mon Apr 08 2024 - 04:36:38 EST


On 03.04.24 14:59, Ryan Roberts wrote:
On 27/03/2024 09:34, David Hildenbrand wrote:
On 26.03.24 18:51, Ryan Roberts wrote:
On 26/03/2024 17:39, David Hildenbrand wrote:
On 26.03.24 18:32, Ryan Roberts wrote:
On 26/03/2024 17:04, David Hildenbrand wrote:

Likely, we just want to read "the real deal" on both sides of the
pte_same()
handling.

Sorry I'm not sure I understand? You mean read the full pte including
access/dirty? That's the same as dropping the patch, right? Of course if
we do
that, we still have to keep pte_get_lockless() around for this case. In an
ideal
world we would convert everything over to ptep_get_lockless_norecency() and
delete ptep_get_lockless() to remove the ugliness from arm64.

Yes, agreed. Patch #3 does not look too crazy and it wouldn't really affect
any
architecture.

I do wonder if pte_same_norecency() should be defined per architecture
and the
default would be pte_same(). So we could avoid the mkold etc on all other
architectures.

Wouldn't that break it's semantics? The "norecency" of
ptep_get_lockless_norecency() means "recency information in the returned pte
may
be incorrect". But the "norecency" of pte_same_norecency() means "ignore the
access and dirty bits when you do the comparison".

My idea was that ptep_get_lockless_norecency() would return the actual
result on
these architectures. So e.g., on x86, there would be no actual change in
generated code.

I think this is a bad plan... You'll end up with subtle differences between
architectures.


But yes, the documentation of these functions would have to be improved.

Now I wonder if ptep_get_lockless_norecency() should actively clear
dirty/accessed bits to more easily find any actual issues where the bits still
matter ...

I did a version that took that approach. Decided it was not as good as this way
though. Now for the life of me, I can't remember my reasoning.

Maybe because there are some code paths that check accessed/dirty without
"correctness" implications? For example, if the PTE is already dirty, no need to
set it dirty etc?

I think I decided I was penalizing the architectures that don't care because all
their ptep_get_norecency() and ptep_get_lockless_norecency() need to explicitly
clear access/dirty. And I would have needed ptep_get_norecency() from day 1 so
that I could feed its result into pte_same().

True. With ptep_get_norecency() you're also penalizing other architectures.
Therefore my original thought about making the behavior arch-specific, but the
arch has to make sure to get the combination of
ptep_get_lockless_norecency()+ptep_same_norecency() is right.

So if an arch decide to ignore bits in ptep_get_lockless_norecency(), it must
make sure to also ignore them in ptep_same_norecency(), and must be able to
handle access/dirty bit changes differently.

Maybe one could have one variant for "hw-managed access/dirty" vs. "sw managed
accessed or dirty". Only the former would end up ignoring stuff here, the latter
would not.

But again, just some random thoughts how this affects other architectures and
how we could avoid it. The issue I describe in patch #3 would be gone if
ptep_same_norecency() would just do a ptep_same() check on other architectures
-- and would make it easier to sell :)


I've been thinking some more about this. I think your proposal is the following:


// ARM64
ptep_get_lockless_norecency()
{
- returned access/dirty may be incorrect
- returned access/dirty may be differently incorrect between 2 calls
}
pte_same_norecency()
{
- ignore access/dirty when doing comparison
}
ptep_set_access_flags(ptep, pte)
{
- must not assume access/dirty in pte are "more permissive" than
access/dirty in *ptep
- must only set access/dirty in *ptep, never clear
}


// Other arches: no change to generated code
ptep_get_lockless_norecency()
{
return ptep_get_lockless();
}
pte_same_norecency()
{
return pte_same();
}
ptep_set_access_flags(ptep, pte)
{
- may assume access/dirty in pte are "more permissive" than access/dirty
in *ptep
- if no HW access/dirty updates, "*ptep = pte" always results in "more
permissive" change
}

An arch either specializes all 3 or none of them.

This would allow us to get rid of ptep_get_lockless().

And it addresses the bug you found with ptep_set_access_flags().


BUT, I still have a nagging feeling that there are likely to be other similar
problems caused by ignoring access/dirty during pte_same_norecency(). I can't
convince myself that its definitely all safe and robust.

Right, we'd have to identify the other possible cases and document what an arch + common code must stick to to make it work.

Some rules would be: if an arch implements ptep_get_lockless_norecency():

(1) Passing the result from ptep_get_lockless_norecency() to pte_same()
is wrong.
(2) Checking pte_young()/pte_old/pte_dirty()/pte_clean() after
ptep_get_lockless_norecency() is very likely wrong.



So I'm leaning towards dropping patch 3 and therefore keeping
ptep_get_lockless() around.

Let me know if you have any insight that might help me change my mind :)

I'm wondering if it would help if we could find a better name (or concept) for "norecency" here, that expresses that only on some archs we'd have that fuzzy handling.

Keeping ptep_get_lockless() around for now sounds like the best alternative.

--
Cheers,

David / dhildenb