Re: [PATCH 06/12] x86/mm: use INVLPGB for kernel TLB flushes
From: Rik van Riel
Date: Thu Jan 09 2025 - 15:21:29 EST
On Mon, 2025-01-06 at 09:21 -0800, Dave Hansen wrote:
> On 12/30/24 09:53, Rik van Riel wrote:
>
>
> > +static void broadcast_kernel_range_flush(unsigned long start,
> > unsigned long end)
> > +{
> > + unsigned long addr;
> > + unsigned long maxnr = invlpgb_count_max;
> > + unsigned long threshold = tlb_single_page_flush_ceiling *
> > maxnr;
>
> The 'tlb_single_page_flush_ceiling' value was determined by looking
> at
> _local_ invalidation cost. Could you talk a bit about why it's also a
> good value to use for remote invalidations? Does it hold up for
> INVLPGB
> the same way it did for good ol' INVLPG? Has there been any explicit
> testing here to find a good value?
>
> I'm also confused by the multiplication here. Let's say
> invlpgb_count_max==20 and tlb_single_page_flush_ceiling==30.
>
> You would need to switch away from single-address invalidation when
> the
> number of addresses is >20 for INVLPGB functional reasons. But you'd
> also need to switch away when >30 for performance reasons
> (tlb_single_page_flush_ceiling).
>
> But I don't understand how that would make the threshold 20*30=600
> invalidations.
I have not done any measurement to see how
flushing with INVLPGB stacks up versus
local TLB flushes.
What makes INVLPGB potentially slower:
- These flushes are done globally
What makes INVLPGB potentially faster:
- Multiple flushes can be pending simultaneously,
and executed in any convenient order by the CPUs.
- Wait once on completion of all the queued flushes.
Another thing that makes things interesting is the
TLB entry coalescing done by AMD CPUs.
When multiple pages are both virtually and physically
contiguous in memory (which is fairly common), the
CPU can use a single TLB entry to map up to 8 of them.
That means if we issue eg. 20 INVLPGB flushes for
8 4kB pages each, instead of the CPUs needing to
remove 160 TLB entries, there might only be 50.
I just guessed at the numbers used in my code,
while trying to sort out the details elsewhere
in the code.
How should we go about measuring the tradeoffs
between invalidation time, and the time spent
in TLB misses from flushing unnecessary stuff?
>
> > + /*
> > + * TLBSYNC only waits for flushes originating on the same
> > CPU.
> > + * Disabling migration allows us to wait on all flushes.
> > + */
>
> Imperative voice here too, please:
>
> Disable migration to wait on all flushes.
>
> > + guard(preempt)();
> > +
> > + if (end == TLB_FLUSH_ALL ||
> > + (end - start) > threshold << PAGE_SHIFT) {
>
> This is basically a copy-and-paste of the "range vs. global" flush
> logic, but taking 'invlpgb_count_max' into account.
>
> It would be ideal if those limit checks could be consolidated. I
> suspect
> that when the 'threshold' calculation above gets clarified that they
> may
> be easier to consolidate.
Maybe?
I implemented another suggestion in the code,
and the start of flush_tlb_kernel_range() now
looks like this:
void flush_tlb_kernel_range(unsigned long start, unsigned long end)
{
if (broadcast_kernel_range_flush(start, end))
return;
If we are to consolidate the limit check, it
should probably be in a helper function somewhere,
and not by spreading out the broadcast flush calls.
>
> BTW, what is a typical value for 'invlpgb_count_max'? Is it more or
> less
> than the typical value for 'tlb_single_page_flush_ceiling'?
>
> Maybe we should just lower 'tlb_single_page_flush_ceiling' if
> 'invlpgb_count_max' falls below it so we only have _one_ runtime
> value
> to consider.
The value for invlpgb_count_max on both Milan
and Bergamo CPUs appears to be 8. That is, the
CPU reports we can flush 7 additional pages
beyond a single page.
This matches the number of PTEs that can be
cached in one TLB entry if they are contiguous
and aligned, and matches one cache line worth
of PTE entries.
>
>
> > + invlpgb_flush_all();
> > + } else {
> > + unsigned long nr;
> > + for (addr = start; addr < end; addr += nr <<
> > PAGE_SHIFT) {
> > + nr = min((end - addr) >> PAGE_SHIFT,
> > maxnr);
> > + invlpgb_flush_addr(addr, nr);
> > + }
> > + }
> > +
> > + tlbsync();
> > +}
> > +
> > static void do_kernel_range_flush(void *info)
> > {
> > struct flush_tlb_info *f = info;
> > @@ -1089,6 +1115,11 @@ static void do_kernel_range_flush(void
> > *info)
> >
> > void flush_tlb_kernel_range(unsigned long start, unsigned long
> > end)
> > {
> > + if (cpu_feature_enabled(X86_FEATURE_INVLPGB)) {
> > + broadcast_kernel_range_flush(start, end);
> > + return;
> > + }
> > +
> > /* Balance as user space task's flush, a bit conservative
> > */
> > if (end == TLB_FLUSH_ALL ||
> > (end - start) > tlb_single_page_flush_ceiling <<
> > PAGE_SHIFT) {
>
> I also wonder if this would all get simpler if we give in and
> *always*
> call get_flush_tlb_info(). That would provide a nice single place to
> consolidate the "all vs. ranged" flush logic.
Possibly. That might be a good way to unify that
threshold check?
That should probably be a separate patch, though.
--
All Rights Reversed.