Re: [PATCH v3 1/2] slub: introduce count_partial_free_approx()

From: Vlastimil Babka
Date: Mon Apr 22 2024 - 03:49:18 EST


On 4/20/24 2:18 AM, David Rientjes wrote:
> On Fri, 19 Apr 2024, Jianfeng Wang wrote:
>
>> diff --git a/mm/slub.c b/mm/slub.c
>> index 1bb2a93cf7b6..993cbbdd2b6c 100644
>> --- a/mm/slub.c
>> +++ b/mm/slub.c
>> @@ -3213,6 +3213,43 @@ static inline bool free_debug_processing(struct kmem_cache *s,
>> #endif /* CONFIG_SLUB_DEBUG */
>>
>> #if defined(CONFIG_SLUB_DEBUG) || defined(SLAB_SUPPORTS_SYSFS)
>> +#define MAX_PARTIAL_TO_SCAN 10000
>> +
>> +static unsigned long count_partial_free_approx(struct kmem_cache_node *n)
>> +{
>> + unsigned long flags;
>> + unsigned long x = 0;
>> + struct slab *slab;
>> +
>> + spin_lock_irqsave(&n->list_lock, flags);
>> + if (n->nr_partial <= MAX_PARTIAL_TO_SCAN) {
>> + list_for_each_entry(slab, &n->partial, slab_list)
>> + x += slab->objects - slab->inuse;
>> + } else {
>> + /*
>> + * For a long list, approximate the total count of objects in
>> + * it to meet the limit on the number of slabs to scan.
>> + * Scan from both the list's head and tail for better accuracy.
>> + */
>> + unsigned long scanned = 0;
>> +
>> + list_for_each_entry(slab, &n->partial, slab_list) {
>> + x += slab->objects - slab->inuse;
>> + if (++scanned == MAX_PARTIAL_TO_SCAN / 2)
>> + break;
>> + }
>> + list_for_each_entry_reverse(slab, &n->partial, slab_list) {
>> + x += slab->objects - slab->inuse;
>> + if (++scanned == MAX_PARTIAL_TO_SCAN)
>> + break;
>> + }
>> + x = mult_frac(x, n->nr_partial, scanned);
>> + x = min(x, node_nr_objs(n));
>> + }
>> + spin_unlock_irqrestore(&n->list_lock, flags);
>> + return x;
>> +}
>
> Creative :)
>
> The default value of MAX_PARTIAL_TO_SCAN seems to work well in practice
> while being large enough to bias for actual values?
>
> I can't think of a better way to avoid the disruption that very long
> partial lists cause. If the actual value is needed, it will need to be
> read from the sysfs file for that slab cache.
>
> It does beg the question of whether we want to extend slabinfo to indicate
> that some fields are approximations, however. Adding a suffix such as
> " : approx" to a slab cache line may be helpful if the disparity in the
> estimates would actually make a difference in practice.

I'm afraid that changing the layout of /proc/slabinfo has a much higher
chance of breaking some consumer, than the imprecision due to approximation
has. So I would rather not risk it.

> I have a hard time believing that this approximation will not be "close
> enough" for all practical purposes, given that the value could very well
> substantially change the instant after the iteration is done anyway.
>
> So for that reason, this sounds good to me!
>
> Acked-by: David Rientjes <rientjes@xxxxxxxxxx>
>
>> +
>> static unsigned long count_partial(struct kmem_cache_node *n,
>> int (*get_count)(struct slab *))
>> {
>> @@ -7089,7 +7126,7 @@ void get_slabinfo(struct kmem_cache *s, struct slabinfo *sinfo)
>> for_each_kmem_cache_node(s, node, n) {
>> nr_slabs += node_nr_slabs(n);
>> nr_objs += node_nr_objs(n);
>> - nr_free += count_partial(n, count_free);
>> + nr_free += count_partial_free_approx(n);
>> }
>>
>> sinfo->active_objs = nr_objs - nr_free;