Re: [PATCH v3] mm/filemap: Allow arch to request folio size for exec memory
From: Ryan Roberts
Date: Fri Mar 28 2025 - 09:10:07 EST
On 27/03/2025 20:07, Zi Yan wrote:
> On 27 Mar 2025, at 12:44, Matthew Wilcox wrote:
>
>> On Thu, Mar 27, 2025 at 04:06:58PM +0000, Ryan Roberts wrote:
>>> So let's special-case the read(ahead) logic for executable mappings. The
>>> trade-off is performance improvement (due to more efficient storage of
>>> the translations in iTLB) vs potential read amplification (due to
>>> reading too much data around the fault which won't be used), and the
>>> latter is independent of base page size. I've chosen 64K folio size for
>>> arm64 which benefits both the 4K and 16K base page size configs and
>>> shouldn't lead to any read amplification in practice since the old
>>> read-around path was (usually) reading blocks of 128K. I don't
>>> anticipate any write amplification because text is always RO.
>>
>> Is there not also the potential for wasted memory due to ELF alignment?
>> Kalesh talked about it in the MM BOF at the same time that Ted and I
>> were discussing it in the FS BOF. Some coordination required (like
>> maybe Kalesh could have mentioned it to me rathere than assuming I'd be
>> there?)
>>
>>> +#define arch_exec_folio_order() ilog2(SZ_64K >> PAGE_SHIFT)
>>
>> I don't think the "arch" really adds much value here.
>>
>> #define exec_folio_order() get_order(SZ_64K)
>
> How about AMD’s PTE coalescing, which does PTE compression at
> 16KB or 32KB level? It covers 4 16KB and 2 32KB, at least it will
> not hurt AMD PTE coalescing. Starting with 64KB across all arch
> might be simpler to see the performance impact. Just a comment,
> no objection. :)
exec_folio_order() is defined per-architecture and SZ_64K is the arm64 preferred
size. At the moment x86 is not opted in, but they could choose to opt in with
32K (or whatever else makese sense) if the HW supports coalescing.
I'm not sure if you thought this was global and are arguing against that, or if
you are arguing for it to be global because it will more easily show us
performance regressions earlier if x86 is doing this too?
>
> Best Regards,
> Yan, Zi