Re: [PATCH v1 2/2] mm: mTHP stats for pagecache folio allocations

From: Ryan Roberts
Date: Tue Jul 16 2024 - 04:31:22 EST


On 13/07/2024 11:45, Ryan Roberts wrote:
> On 13/07/2024 02:08, David Hildenbrand wrote:
>> On 12.07.24 14:22, Lance Yang wrote:
>>> On Fri, Jul 12, 2024 at 11:00 AM Baolin Wang
>>> <baolin.wang@xxxxxxxxxxxxxxxxx> wrote:
>>>>
>>>>
>>>>
>>>> On 2024/7/11 15:29, Ryan Roberts wrote:
>>>>> Expose 3 new mTHP stats for file (pagecache) folio allocations:
>>>>>
>>>>>     /sys/kernel/mm/transparent_hugepage/hugepages-*kB/stats/file_alloc
>>>>>     /sys/kernel/mm/transparent_hugepage/hugepages-*kB/stats/file_fallback
>>>>>    
>>>>> /sys/kernel/mm/transparent_hugepage/hugepages-*kB/stats/file_fallback_charge
>>>>>
>>>>> This will provide some insight on the sizes of large folios being
>>>>> allocated for file-backed memory, and how often allocation is failing.
>>>>>
>>>>> All non-order-0 (and most order-0) folio allocations are currently done
>>>>> through filemap_alloc_folio(), and folios are charged in a subsequent
>>>>> call to filemap_add_folio(). So count file_fallback when allocation
>>>>> fails in filemap_alloc_folio() and count file_alloc or
>>>>> file_fallback_charge in filemap_add_folio(), based on whether charging
>>>>> succeeded or not. There are some users of filemap_add_folio() that
>>>>> allocate their own order-0 folio by other means, so we would not count
>>>>> an allocation failure in this case, but we also don't care about order-0
>>>>> allocations. This approach feels like it should be good enough and
>>>>> doesn't require any (impractically large) refactoring.
>>>>>
>>>>> The existing mTHP stats interface is reused to provide consistency to
>>>>> users. And because we are reusing the same interface, we can reuse the
>>>>> same infrastructure on the kernel side. The one small wrinkle is that
>>>>> the set of folio sizes supported by the pagecache are not identical to
>>>>> those supported by anon and shmem; pagecache supports order-1, unlike
>>>>> anon and shmem, and the max pagecache order may be less than PMD-size
>>>>> (see arm64 with 64K base pages), again unlike anon and shmem. So we now
>>>>> create a hugepages-*kB directory for the union of the sizes supported by
>>>>> all 3 memory types and populate it with the relevant stats and controls.
>>>>
>>>> Personally, I like the idea that can help analyze the allocation of
>>>> large folios for the page cache.
>>>>
>>>> However, I have a slight concern about the consistency of the interface.
>>>>
>>>> For 64K, the fields layout:
>>>> ├── hugepages-64kB
>>>> │   ├── enabled
>>>> │   ├── shmem_enabled
>>>> │   └── stats
>>>> │       ├── anon_fault_alloc
>>>> │       ├── anon_fault_fallback
>>>> │       ├── anon_fault_fallback_charge
>>>> │       ├── file_alloc
>>>> │       ├── file_fallback
>>>> │       ├── file_fallback_charge
>>>> │       ├── shmem_alloc
>>>> │       ├── shmem_fallback
>>>> │       ├── shmem_fallback_charge
>>>> │       ├── split
>>>> │       ├── split_deferred
>>>> │       ├── split_failed
>>>> │       ├── swpout
>>>> │       └── swpout_fallback
>>>>
>>>> But for 8K (for pagecache), you removed some fields (of course, I
>>>> understand why they are not supported).
>>>>
>>>> ├── hugepages-8kB
>>>> │   └── stats
>>>> │       ├── file_alloc
>>>> │       ├── file_fallback
>>>> │       └── file_fallback_charge
>>>>
>>>> This might not be user-friendly for some user-space parsing tools, as
>>>> they lack certain fields for the same pattern interfaces. Of course,
>>>> this might not be an issue if we have clear documentation describing the
>>>> differences here:)
>>>>
>>>> Another possible approach is to maintain the same field layout to keep
>>>> consistent, but prohibit writing to the fields that are not supported by
>>>> the pagecache, and any stats read from them would be 0.
>>>
>>> I agree that maintaining a uniform field layout, especially at the stats
>>> level, might be necessary ;)
>>>
>>> Keeping a consistent interface could future-proof the design. It allows
>>> for the possibility that features not currently supported for 8kB pages
>>> might be enabled in the future.
>>
>> I'll just note that, with shmem/file effectively being disabled for order > 11,
>> we'll also have entries there that are effectively unused.
>
> Indeed, I mentioned that in the commit log :)
>
>>
>> Good question how we want to deal with that (stats are easy, but what about when
>> we enable something? Maybe we should document that "enabled" is only effective
>> when supported).
>
> The documentation already says "If enabling multiple hugepage sizes, the kernel
> will select the most appropriate enabled size for a given allocation." for anon
> THP (and I've added similar wording for my as-yet-unposted patch to add controls
> for page cache folio sizes). So I think we could easily add dummy *enabled
> controls for all sizes, that can be written to and read back consistently, but
> the kernel just ignores them when deciding what size to use. It would also
> simplify the code that populates the controls.
>
> Personally though, I'm not convinced of the value of trying to make the controls
> for every size look identical. What's the real value to the user to pretend that
> they can select a size that they cannot? What happens when we inevitably want to
> add some new control in future which only applies to select sizes and there is
> no good way to fake it for the other sizes? Why can't user space just be
> expected to rely on the existance of the files rather than on the existance of
> the directories?
>
> As always, I'll go with the majority, but just wanted to register my opinion.

Should I assume from the lack of reply on this that everyone else is in favour
of adding dummy controls so that all sizes have the same set of controls? If I
don't hear anything further, I'll post v2 with dummry controls today or tomorrow.

>
> Thanks,
> Ryan
>
>>
>> Hmmmmm
>>
>