Re: [PATCH v4 29/30] x86/mm, mm/vmalloc: Defer flush_tlb_kernel_range() targeting NOHZ_FULL CPUs

From: Dave Hansen
Date: Thu Feb 20 2025 - 12:40:47 EST


On 2/20/25 09:10, Valentin Schneider wrote:
>> The LDT and maybe the PEBS buffers are the only implicit supervisor
>> accesses to vmalloc()'d memory that I can think of. But those are both
>> handled specially and shouldn't ever get zapped while in use. The LDT
>> replacement has its own IPIs separate from TLB flushing.
>>
>> But I'm actually not all that worried about accesses while actually
>> running userspace. It's that "danger zone" in the kernel between entry
>> and when the TLB might have dangerous garbage in it.
>>
> So say we have kPTI, thus no vmalloc() mapped in CR3 when running
> userspace, and do a full TLB flush right before switching to userspace -
> could the TLB still end up with vmalloc()-range-related entries when we're
> back in the kernel and going through the danger zone?

Yes, because the danger zone includes the switch back to the kernel CR3
with vmalloc() fully mapped. All bets are off about what's in the TLB
the moment that CR3 write occurs.

Actually, you could probably use that.

If a mapping is in the PTI user page table, you can't defer the flushes
for it. Basically the same rule for text poking in the danger zone.

If there's a deferred flush pending, make sure that all of the
SWITCH_TO_KERNEL_CR3's fully flush the TLB. You'd need something similar
to user_pcid_flush_mask.

But, honestly, I'm still not sure this is worth all the trouble. If
folks want to avoid IPIs for TLB flushes, there are hardware features
that *DO* that. Just get new hardware instead of adding this complicated
pile of software that we have to maintain forever. In 10 years, we'll
still have this software *and* 95% of our hardware has the hardware
feature too.