Re: [PATCH 3/3] debugobjects: Use hlist_cut_number() to optimize performance and improve readability

From: Leizhen (ThunderTown)
Date: Wed Sep 11 2024 - 03:44:46 EST




On 2024/9/10 19:44, Thomas Gleixner wrote:
> On Tue, Sep 10 2024 at 12:00, Leizhen wrote:
>> On 2024/9/10 2:41, Thomas Gleixner wrote:
>>> All related functions have this problem and all of this code is very
>>> strict about boundaries. Instead of accurately doing the refill, purge
>>> etc. we should look into proper batch mode mechanisms. Let me think
>>> about it.
>> It may be helpful to add several arrays to record the first node of each batch
>> in each free list. Take 'percpu_pool' as an example:
>>
>> struct debug_percpu_free {
>> struct hlist_head free_objs;
>> int obj_free;
>> + int batch_idx;
>> + struct hlist_node *batch_first[4]; // ODEBUG_POOL_PERCPU_SIZE / ODEBUG_BATCH_SIZE
>> };
>>
>> A new free node is added to the header of the list, and the batch is cut from the tail
>> of the list.
>> NodeA<-->...<-->NodeB<-->...<-->NodeC<-->NodeD<--> free_objs
>> |---one batch---|---one batch---|
>> | |
>> batch_first[0] batch_first[1]
> The current data structures are not fit for the purpose. Glueing
> workarounds into the existing mess makes it just worse.
>
> So the data structures need to be redesigned from ground up to be fit
> for the purpose.
>
> allocation:
>
> 1) Using the global pool for single object allocations is wrong
>
> During boot this can be a completely disconnected list, which does
> not need any accounting, does not need pool_lock and can be just
> protected with irqsave like the per CPU pools. It's effectivly a
> per CPU pool because at that point there is only one CPU and
> everything is single threaded.
>
> 2) The per CPU pool handling is backwards
>
> If the per CPU pool is empty, then the pool needs to be refilled
> with a batch from the global pool and allocated from there.
>
> Allocation then always happens from the active per CPU batch slot.
>
> free:
>
> 1) Early boot
>
> Just put it back on the dedicated boot list and be done
>
> 2) After obj_cache is initialized
>
> Put it back to the per CPU pool into the active batch slot. If
> the slot becomes full then make the next slot the active slot. It
> the full slot was the top most slot then move that slot either
> into the global pool when there is a free slot, or move it to the
> to_free pool.
>
> That means the per CPU pool is different from the global pools as it can
> allocate/free single objects, while the global pools are strictly stacks
> of batches. Any movement between per CPU pools and global pools is batch
> based and just moves lists from one head to another.
>
> That minimizes the pool lock contention and the cache foot print. The
> global to free pool must have an extra twist to accomodate non-batch
> sized drops and to handle the all slots are full case, but that's just a
> trivial detail.

That's great. I really admire you for completing the refactor in such a
short of time. But I have a few minor comments.
1. When kmem_cache_zalloc() is called to allocate objs for filling,
if less than one batch of objs are allocated, all of them can be
pushed to the local CPU. That's, call pcpu_free() one by one.
In this way, the number of free objs cached by pool_global and
pool_to_free is always an integer multiple of ODEBUG_BATCH_SIZE.
2. Member tot_cnt of struct global_pool can be deleted. We can get it
simply and quickly through (slot_idx * ODEBUG_BATCH_SIZE). Avoid
redundant maintenance.
3. debug_objects_pool_min_level also needs to be adjusted accordingly,
the number of batches of the min level.

>
> See the completely untested combo patch against tip core/debugobjects
> below.
>
> Thanks,
>
> tglx

--
Regards,
Zhen Lei