Re: [PATCH v2 3/3] mm: mTHP stats for pagecache folio allocations

From: Ryan Roberts
Date: Wed Jul 24 2024 - 04:12:13 EST

Next message: Arnd Bergmann: "Re: [PATCH 1/2] uprobe: Change uretprobe syscall scope and number"
Previous message: David Howells: "Re: [PATCH] vfs: Fix potential circular locking through setxattr() and removexattr()"
Next in thread: Barry Song: "Re: [PATCH v2 3/3] mm: mTHP stats for pagecache folio allocations"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]

On 23/07/2024 23:07, Barry Song wrote:
> On Fri, Jul 19, 2024 at 8:56 PM Ryan Roberts <ryan.roberts@xxxxxxx> wrote:
>>
>> On 19/07/2024 01:12, Barry Song wrote:
>>> On Wed, Jul 17, 2024 at 1:59 AM Ryan Roberts <ryan.roberts@xxxxxxx> wrote:
>>>>
>>>> Expose 3 new mTHP stats for file (pagecache) folio allocations:
>>>>
>>>> /sys/kernel/mm/transparent_hugepage/hugepages-*kB/stats/file_alloc
>>>> /sys/kernel/mm/transparent_hugepage/hugepages-*kB/stats/file_fallback
>>>> /sys/kernel/mm/transparent_hugepage/hugepages-*kB/stats/file_fallback_charge
>>>>
>>>> This will provide some insight on the sizes of large folios being
>>>> allocated for file-backed memory, and how often allocation is failing.
>>>>
>>>> All non-order-0 (and most order-0) folio allocations are currently done
>>>> through filemap_alloc_folio(), and folios are charged in a subsequent
>>>> call to filemap_add_folio(). So count file_fallback when allocation
>>>> fails in filemap_alloc_folio() and count file_alloc or
>>>> file_fallback_charge in filemap_add_folio(), based on whether charging
>>>> succeeded or not. There are some users of filemap_add_folio() that
>>>> allocate their own order-0 folio by other means, so we would not count
>>>> an allocation failure in this case, but we also don't care about order-0
>>>> allocations. This approach feels like it should be good enough and
>>>> doesn't require any (impractically large) refactoring.
>>>>
>>>> The existing mTHP stats interface is reused to provide consistency to
>>>> users. And because we are reusing the same interface, we can reuse the
>>>> same infrastructure on the kernel side.
>>>>
>>>> Signed-off-by: Ryan Roberts <ryan.roberts@xxxxxxx>
>>>> ---
>>>> Documentation/admin-guide/mm/transhuge.rst | 13 +++++++++++++
>>>> include/linux/huge_mm.h | 3 +++
>>>> include/linux/pagemap.h | 16 ++++++++++++++--
>>>> mm/filemap.c | 6 ++++--
>>>> mm/huge_memory.c | 7 +++++++
>>>> 5 files changed, 41 insertions(+), 4 deletions(-)
>>>>
>>>> diff --git a/Documentation/admin-guide/mm/transhuge.rst b/Documentation/admin-guide/mm/transhuge.rst
>>>> index 058485daf186..d4857e457add 100644
>>>> --- a/Documentation/admin-guide/mm/transhuge.rst
>>>> +++ b/Documentation/admin-guide/mm/transhuge.rst
>>>> @@ -512,6 +512,19 @@ shmem_fallback_charge
>>>> falls back to using small pages even though the allocation was
>>>> successful.
>>>>
>>>> +file_alloc
>>>> + is incremented every time a file huge page is successfully
>>>> + allocated.
>>>> +
>>>> +file_fallback
>>>> + is incremented if a file huge page is attempted to be allocated
>>>> + but fails and instead falls back to using small pages.
>>>> +
>>>> +file_fallback_charge
>>>> + is incremented if a file huge page cannot be charged and instead
>>>> + falls back to using small pages even though the allocation was
>>>> + successful.
>>>> +
>>>
>>> I realized that when we talk about fallback, it doesn't necessarily mean
>>> small pages; it could also refer to smaller huge pages.
>>
>> Yes good point, I'll update the documentation to reflect that for all memory types.
>>
>>>
>>> anon_fault_alloc
>>> is incremented every time a huge page is successfully
>>> allocated and charged to handle a page fault.
>>>
>>> anon_fault_fallback
>>> is incremented if a page fault fails to allocate or charge
>>> a huge page and instead falls back to using huge pages with
>>> lower orders or small pages.
>>>
>>> anon_fault_fallback_charge
>>> is incremented if a page fault fails to charge a huge page and
>>> instead falls back to using huge pages with lower orders or
>>> small pages even though the allocation was successful.
>>>
>>> This also applies to files, right?
>>
>> It does in the place you highlight below, but page_cache_ra_order() just falls
>> back immediately to order-0. Regardless, I think we should just document the
>> user facing docs to allow for a lower high order. That gives the implementation
>> more flexibility.
>>
>>>
>>> do {
>>> gfp_t alloc_gfp = gfp;
>>>
>>> err = -ENOMEM;
>>> if (order > 0)
>>> alloc_gfp |= __GFP_NORETRY | __GFP_NOWARN;
>>> folio = filemap_alloc_folio(alloc_gfp, order);
>>> if (!folio)
>>> continue;
>>>
>>> /* Init accessed so avoid atomic
>>> mark_page_accessed later */
>>> if (fgp_flags & FGP_ACCESSED)
>>> __folio_set_referenced(folio);
>>>
>>> err = filemap_add_folio(mapping, folio, index, gfp);
>>> if (!err)
>>> break;
>>> folio_put(folio);
>>> folio = NULL;
>>> } while (order-- > 0);
>>>
>>>
>>>> split
>>>> is incremented every time a huge page is successfully split into
>>>> smaller orders. This can happen for a variety of reasons but a
>>>> diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h
>>>> index b8c63c3e967f..4f9109fcdded 100644
>>>> --- a/include/linux/huge_mm.h
>>>> +++ b/include/linux/huge_mm.h
>>>> @@ -123,6 +123,9 @@ enum mthp_stat_item {
>>>> MTHP_STAT_SHMEM_ALLOC,
>>>> MTHP_STAT_SHMEM_FALLBACK,
>>>> MTHP_STAT_SHMEM_FALLBACK_CHARGE,
>>>> + MTHP_STAT_FILE_ALLOC,
>>>> + MTHP_STAT_FILE_FALLBACK,
>>>> + MTHP_STAT_FILE_FALLBACK_CHARGE,
>>>> MTHP_STAT_SPLIT,
>>>> MTHP_STAT_SPLIT_FAILED,
>>>> MTHP_STAT_SPLIT_DEFERRED,
>>>> diff --git a/include/linux/pagemap.h b/include/linux/pagemap.h
>>>> index 6e2f72d03176..95a147b5d117 100644
>>>> --- a/include/linux/pagemap.h
>>>> +++ b/include/linux/pagemap.h
>>>> @@ -562,14 +562,26 @@ static inline void *detach_page_private(struct page *page)
>>>> }
>>>>
>>>> #ifdef CONFIG_NUMA
>>>> -struct folio *filemap_alloc_folio_noprof(gfp_t gfp, unsigned int order);
>>>> +struct folio *__filemap_alloc_folio_noprof(gfp_t gfp, unsigned int order);
>>>> #else
>>>> -static inline struct folio *filemap_alloc_folio_noprof(gfp_t gfp, unsigned int order)
>>>> +static inline struct folio *__filemap_alloc_folio_noprof(gfp_t gfp, unsigned int order)
>>>> {
>>>> return folio_alloc_noprof(gfp, order);
>>>> }
>>>> #endif
>>>>
>>>> +static inline struct folio *filemap_alloc_folio_noprof(gfp_t gfp, unsigned int order)
>>>> +{
>>>> + struct folio *folio;
>>>> +
>>>> + folio = __filemap_alloc_folio_noprof(gfp, order);
>>>> +
>>>> + if (!folio)
>>>> + count_mthp_stat(order, MTHP_STAT_FILE_FALLBACK);
>>>> +
>>>> + return folio;
>>>> +}
>>>
>>> Do we need to add and export __filemap_alloc_folio_noprof()?
>>
>> It is exported. See the below change in filemap.c. Previously
>> filemap_alloc_folio_noprof() was exported, but that is now an inline and
>> __filemap_alloc_folio_noprof() is exported in its place.
>>
>>> In any case,
>>> we won't call count_mthp_stat(order, MTHP_STAT_FILE_FALLBACK) and
>>> will only allocate the folio instead?
>>
>> Sorry I don't understand what you mean by this?
>
> Ryan, my question is whether exporting __filemap_alloc_folio_noprof() might lead
> to situations where people call __filemap_alloc_folio_noprof() and
> forget to call
> count_mthp_stat(order, MTHP_STAT_FILE_FALLBACK). Currently, it seems
> that counting is always necessary?

OK to make sure I've understood, I think you are saying that there is a risk
that people will call __filemap_alloc_folio_noprof() and bypass the stat
counting? But its the same problem with all "_noprof" functions; if those are
called directly, it bypasses the memory allocation profiling infrastructure. So
this just needs to be a case of "don't do that" IMHO. filemap_alloc_folio() is
the public API.

> Another option is that we still
> call count mthp
> within filemap_alloc_folio_noprof() and make it noinline if
> __filemap_alloc_folio_noprof()
> is not inline?

I could certainly make filemap_alloc_folio_noprof() always out-of-line and then
handle the counting privately in the compilation unit. But before my change
filemap_alloc_folio_noprof() was inline for the !CONFIG_NUMA case and I was
trying not to change that. I think what you're suggesting would be tidier
though. I'll make the change. But I don't think it solves the wider problem of
the possibility that people can call private APIs. That's what review is for.

Thanks,
Ryan

>
>>
>>>
>>>> +
>>>> #define filemap_alloc_folio(...) \
>>>> alloc_hooks(filemap_alloc_folio_noprof(__VA_ARGS__))
>>>>
>>>> diff --git a/mm/filemap.c b/mm/filemap.c
>>>> index 53d5d0410b51..131d514fca29 100644
>>>> --- a/mm/filemap.c
>>>> +++ b/mm/filemap.c
>>>> @@ -963,6 +963,8 @@ int filemap_add_folio(struct address_space *mapping, struct folio *folio,
>>>> int ret;
>>>>
>>>> ret = mem_cgroup_charge(folio, NULL, gfp);
>>>> + count_mthp_stat(folio_order(folio),
>>>> + ret ? MTHP_STAT_FILE_FALLBACK_CHARGE : MTHP_STAT_FILE_ALLOC);
>>>> if (ret)
>>>> return ret;
>>>
>>> Would the following be better?
>>>
>>> ret = mem_cgroup_charge(folio, NULL, gfp);
>>> if (ret) {
>>> count_mthp_stat(folio_order(folio),
>>> MTHP_STAT_FILE_FALLBACK_CHARGE);
>>> return ret;
>>> }
>>> count_mthp_stat(folio_order(folio), MTHP_STAT_FILE_ALLOC);
>>>
>>> Anyway, it's up to you. The code just feels a bit off to me :-)
>>
>> Yes, agree your version is better. I also noticed that anon and shmem increment
>> FALLBACK whenever FALLBACK_CHARGE is incremented so I should apply those same
>> semantics here.
>>
>> Thanks,
>> Ryan
>>
>>
>>>
>>>>
>>>> @@ -990,7 +992,7 @@ int filemap_add_folio(struct address_space *mapping, struct folio *folio,
>>>> EXPORT_SYMBOL_GPL(filemap_add_folio);
>>>>
>>>> #ifdef CONFIG_NUMA
>>>> -struct folio *filemap_alloc_folio_noprof(gfp_t gfp, unsigned int order)
>>>> +struct folio *__filemap_alloc_folio_noprof(gfp_t gfp, unsigned int order)
>>>> {
>>>> int n;
>>>> struct folio *folio;
>>>> @@ -1007,7 +1009,7 @@ struct folio *filemap_alloc_folio_noprof(gfp_t gfp, unsigned int order)
>>>> }
>>>> return folio_alloc_noprof(gfp, order);
>>>> }
>>>> -EXPORT_SYMBOL(filemap_alloc_folio_noprof);
>>>> +EXPORT_SYMBOL(__filemap_alloc_folio_noprof);
>>>> #endif
>>>>
>>>> /*
>>>> diff --git a/mm/huge_memory.c b/mm/huge_memory.c
>>>> index 578ac212c172..26d558e3e80f 100644
>>>> --- a/mm/huge_memory.c
>>>> +++ b/mm/huge_memory.c
>>>> @@ -608,7 +608,14 @@ static struct attribute_group anon_stats_attr_grp = {
>>>> .attrs = anon_stats_attrs,
>>>> };
>>>>
>>>> +DEFINE_MTHP_STAT_ATTR(file_alloc, MTHP_STAT_FILE_ALLOC);
>>>> +DEFINE_MTHP_STAT_ATTR(file_fallback, MTHP_STAT_FILE_FALLBACK);
>>>> +DEFINE_MTHP_STAT_ATTR(file_fallback_charge, MTHP_STAT_FILE_FALLBACK_CHARGE);
>>>> +
>>>> static struct attribute *file_stats_attrs[] = {
>>>> + &file_alloc_attr.attr,
>>>> + &file_fallback_attr.attr,
>>>> + &file_fallback_charge_attr.attr,
>>>> #ifdef CONFIG_SHMEM
>>>> &shmem_alloc_attr.attr,
>>>> &shmem_fallback_attr.attr,
>>>> --
>>>> 2.43.0
>>>>
>>>
>
> Thanks
> Barry

Next message: Arnd Bergmann: "Re: [PATCH 1/2] uprobe: Change uretprobe syscall scope and number"
Previous message: David Howells: "Re: [PATCH] vfs: Fix potential circular locking through setxattr() and removexattr()"
Next in thread: Barry Song: "Re: [PATCH v2 3/3] mm: mTHP stats for pagecache folio allocations"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]