Re: [RFC v11 01/14] mm: page_frag: add a test module for page_frag

From: Yunsheng Lin
Date: Tue Jul 23 2024 - 09:19:22 EST


On 2024/7/22 1:34, Alexander Duyck wrote:
> On Fri, Jul 19, 2024 at 2:36 AM Yunsheng Lin <linyunsheng@xxxxxxxxxx> wrote:
>>
>> Basing on the lib/objpool.c, change it to something like a
>> ptrpool, so that we can utilize that to test the correctness
>> and performance of the page_frag.
>>
>> The testing is done by ensuring that the fragment allocated
>> from a frag_frag_cache instance is pushed into a ptrpool
>> instance in a kthread binded to a specified cpu, and a kthread
>> binded to a specified cpu will pop the fragment from the
>> ptrpool and free the fragment.
>>
>> We may refactor out the common part between objpool and ptrpool
>> if this ptrpool thing turns out to be helpful for other place.
>>
>> CC: Alexander Duyck <alexander.duyck@xxxxxxxxx>
>> Signed-off-by: Yunsheng Lin <linyunsheng@xxxxxxxxxx>
>> ---
>> mm/Kconfig.debug | 8 +
>> mm/Makefile | 1 +
>> mm/page_frag_test.c | 393 ++++++++++++++++++++++++++++++++++++++++++++
>> 3 files changed, 402 insertions(+)
>> create mode 100644 mm/page_frag_test.c
>
> I might have missed it somewhere. Is there any reason why this isn't
> in the selftests/mm/ directory? Seems like that would be a better fit
> for this.
>
>> diff --git a/mm/Kconfig.debug b/mm/Kconfig.debug
>> index afc72fde0f03..1ebcd45f47d4 100644
>> --- a/mm/Kconfig.debug
>> +++ b/mm/Kconfig.debug
>> @@ -142,6 +142,14 @@ config DEBUG_PAGE_REF
>> kernel code. However the runtime performance overhead is virtually
>> nil until the tracepoints are actually enabled.
>>
>> +config DEBUG_PAGE_FRAG_TEST
>
> This isn't a "DEBUG" feature. This is a test feature.
>
>> + tristate "Test module for page_frag"
>> + default n
>> + depends on m && DEBUG_KERNEL
>
> I am not sure it is valid to have a tristate depend on being built as a module.

Perhaps I was copying the wrong pattern from TEST_OBJPOOL in lib/Kconfig.debug.
Perhaps mm/dmapool_test.c and DMAPOOL_TEST* *was more appropriate pattern
for test module for page_frag?

>
> I think if you can set it up as a selftest it will have broader use as
> you could compile it against any target kernel going forward and add
> it as a module rather than having to build it as a part of a debug
> kernel.

It seems tools/testing/selftests/mm/* are all about userspace testing
tool, and testing kernel module seems to be in the same directory with
the code to be tested?

>
>> + help
>> + This builds the "page_frag_test" module that is used to test the
>> + correctness and performance of page_frag's implementation.
>> +
>> config DEBUG_RODATA_TEST
>> bool "Testcase for the marking rodata read-only"

...

>> +
>> + /*
>> + * here we allocate percpu-slot & objs together in a single
>> + * allocation to make it more compact, taking advantage of
>> + * warm caches and TLB hits. in default vmalloc is used to
>> + * reduce the pressure of kernel slab system. as we know,
>> + * minimal size of vmalloc is one page since vmalloc would
>> + * always align the requested size to page size
>> + */
>> + if (gfp & GFP_ATOMIC)
>> + slot = kmalloc_node(size, gfp, cpu_to_node(i));
>> + else
>> + slot = __vmalloc_node(size, sizeof(void *), gfp,
>> + cpu_to_node(i),
>> + __builtin_return_address(0));
>
> When would anyone ever call this with atomic? This is just for your
> test isn't it?
>
>> + if (!slot)
>> + return -ENOMEM;
>> +
>> + memset(slot, 0, size);
>> + pool->cpu_slots[i] = slot;
>> +
>> + objpool_init_percpu_slot(pool, slot);
>> + }
>> +
>> + return 0;
>> +}

...

>> +/* release whole objpool forcely */
>> +static void objpool_free(struct objpool_head *pool)
>> +{
>> + if (!pool->cpu_slots)
>> + return;
>> +
>> + /* release percpu slots */
>> + objpool_fini_percpu_slots(pool);
>> +}
>> +
>
> Why add all this extra objpool overhead? This seems like overkill for
> what should be a simple test. Seems like you should just need a simple
> array located on one of your CPUs. I'm not sure what is with all the
> extra overhead being added here.

As mentioned in the commit log:
"We may refactor out the common part between objpool and ptrpool
if this ptrpool thing turns out to be helpful for other place."

The next thing I am trying to do is to use ptrpool to optimization
the pcp for mm subsystem. so I would rather not tailor the ptrpool
for page_frag_test, and it doesn't seem to affect the testing that
much.

>
>> +static struct objpool_head ptr_pool;
>> +static int nr_objs = 512;
>> +static atomic_t nthreads;
>> +static struct completion wait;
>> +static struct page_frag_cache test_frag;
>> +
>> +static int nr_test = 5120000;
>> +module_param(nr_test, int, 0);
>> +MODULE_PARM_DESC(nr_test, "number of iterations to test");
>> +
>> +static bool test_align;
>> +module_param(test_align, bool, 0);
>> +MODULE_PARM_DESC(test_align, "use align API for testing");
>> +
>> +static int test_alloc_len = 2048;
>> +module_param(test_alloc_len, int, 0);
>> +MODULE_PARM_DESC(test_alloc_len, "alloc len for testing");
>> +
>> +static int test_push_cpu;
>> +module_param(test_push_cpu, int, 0);
>> +MODULE_PARM_DESC(test_push_cpu, "test cpu for pushing fragment");
>> +
>> +static int test_pop_cpu;
>> +module_param(test_pop_cpu, int, 0);
>> +MODULE_PARM_DESC(test_pop_cpu, "test cpu for popping fragment");
>> +
>> +static int page_frag_pop_thread(void *arg)
>> +{
>> + struct objpool_head *pool = arg;
>> + int nr = nr_test;
>> +
>> + pr_info("page_frag pop test thread begins on cpu %d\n",
>> + smp_processor_id());
>> +
>> + while (nr > 0) {
>> + void *obj = objpool_pop(pool);
>> +
>> + if (obj) {
>> + nr--;
>> + page_frag_free(obj);
>> + } else {
>> + cond_resched();
>> + }
>> + }
>> +
>> + if (atomic_dec_and_test(&nthreads))
>> + complete(&wait);
>> +
>> + pr_info("page_frag pop test thread exits on cpu %d\n",
>> + smp_processor_id());
>> +
>> + return 0;
>> +}
>> +
>> +static int page_frag_push_thread(void *arg)
>> +{
>> + struct objpool_head *pool = arg;
>> + int nr = nr_test;
>> +
>> + pr_info("page_frag push test thread begins on cpu %d\n",
>> + smp_processor_id());
>> +
>> + while (nr > 0) {
>> + void *va;
>> + int ret;
>> +
>> + if (test_align) {
>> + va = page_frag_alloc_align(&test_frag, test_alloc_len,
>> + GFP_KERNEL, SMP_CACHE_BYTES);
>> +
>> + WARN_ONCE((unsigned long)va & (SMP_CACHE_BYTES - 1),
>> + "unaligned va returned\n");
>> + } else {
>> + va = page_frag_alloc(&test_frag, test_alloc_len, GFP_KERNEL);
>> + }
>> +
>> + if (!va)
>> + continue;
>> +
>> + ret = objpool_push(va, pool);
>> + if (ret) {
>> + page_frag_free(va);
>> + cond_resched();
>> + } else {
>> + nr--;
>> + }
>> + }
>> +
>> + pr_info("page_frag push test thread exits on cpu %d\n",
>> + smp_processor_id());
>> +
>> + if (atomic_dec_and_test(&nthreads))
>> + complete(&wait);
>> +
>> + return 0;
>> +}
>> +
>
> So looking over these functions they seem to overlook how the network
> stack works in many cases. One of the main motivations for the page
> frags approach is page recycling. For example with GRO enabled the
> headers allocated to record the frags might be freed for all but the
> first. As such you can end up with 17 fragments being allocated, and
> 16 freed within the same thread as NAPI will just be recycling the
> buffers.
>
> With this setup it doesn't seem very likely to be triggered since you
> are operating in two threads. One test you might want to look at
> adding is a test where you are allocating and freeing in the same
> thread at a fairly constant rate to test against the "ideal" scenario.

I am not sure if the above is still the "ideal" scenario, as you mentioned
that most drivers are turning to use page_pool for rx, the page frag is really
mostly for tx or skb->data for rx.

>