Re: [PATCH] x86: Run checksumming in parallel accross multiple alu's

From: Neil Horman
Date: Fri Oct 18 2013 - 12:51:35 EST


>
> Your benchmark uses a single 4K page, so data is _super_ hot in cpu
> caches.
> ( prefetch should give no speedups, I am surprised it makes any
> difference)
>
> Try now with 32 huges pages, to get 64 MBytes of working set.
>
> Because in reality we never csum_partial() data in cpu cache.
> (Unless the NIC preloaded the data into cpu cache before sending the
> interrupt)
>
> Really, if Sebastien got a speed up, it means that something fishy was
> going on, like :
>
> - A copy of data into some area of memory, prefilling cpu caches
> - csum_partial() done while data is hot in cache.
>
> This is exactly a "should not happen" scenario, because the csum in this
> case should happen _while_ doing the copy, for 0 ns.
>
>
>
>


So, I took your suggestion, and modified my test module to allocate 32 huge
pages instead of a single 4k page. I've attached the module changes and the
results below. Contrary to your assertion above, results came out the same as
in my first run. See below:

base results:
80381491
85279536
99537729
80398029
121385411
109478429
85369632
99242786
80250395
98170542

AVG=939 ns

prefetch only results:
86803812
101891541
85762713
95866956
102316712
93529111
90473728
79374183
93744053
90075501

AVG=919 ns

parallel only results:
68994797
63503221
64298412
63784256
75350022
66398821
77776050
79158271
91006098
67822318

AVG=718 ns

both prefetch and parallel results:
68852213
77536525
63963560
67255913
76169867
80418081
63485088
62386262
75533808
57731705

AVG=693 ns


So based on these, it seems that your assertion that prefetching is the key to
speedup here isn't quite correct. Either that or the testing continues to be
invalid. I'm going to try to do some of ingos microbenchmarking just to see if
that provides any further details. But any other thoughts about what might be
going awry are appreciated.

My module code:



#include <linux/module.h>
#include <linux/kernel.h>
#include <linux/netdevice.h>
#include <linux/etherdevice.h>
#include <linux/init.h>
#include <linux/moduleparam.h>
#include <linux/rtnetlink.h>
#include <net/rtnetlink.h>
#include <linux/u64_stats_sync.h>

static char *buf;

#define BUFSIZ_ORDER 4
#define BUFSIZ ((2 << BUFSIZ_ORDER) * (1024*1024*2))
static int __init csum_init_module(void)
{
int i;
__wsum sum = 0;
struct timespec start, end;
u64 time;
struct page *page;
u32 offset = 0;

page = alloc_pages((GFP_TRANSHUGE & ~__GFP_MOVABLE), BUFSIZ_ORDER);
if (!page) {
printk(KERN_CRIT "NO MEMORY FOR ALLOCATION");
return -ENOMEM;
}
buf = page_address(page);


printk(KERN_CRIT "INITALIZING BUFFER\n");

preempt_disable();
printk(KERN_CRIT "STARTING ITERATIONS\n");
getnstimeofday(&start);

for(i=0;i<100000;i++) {
sum = csum_partial(buf+offset, PAGE_SIZE, sum);
offset = (offset < BUFSIZ-PAGE_SIZE) ? offset+PAGE_SIZE : 0;
}
getnstimeofday(&end);
preempt_enable();
if ((unsigned long)start.tv_nsec > (unsigned long)end.tv_nsec)
time = (ULONG_MAX - (unsigned long)end.tv_nsec) + (unsigned long)start.tv_nsec;
else
time = (unsigned long)end.tv_nsec - (unsigned long)start.tv_nsec;

printk(KERN_CRIT "COMPLETED 100000 iterations of csum in %llu nanosec\n", time);
__free_pages(page, BUFSIZ_ORDER);
return 0;


}

static void __exit csum_cleanup_module(void)
{
return;
}

module_init(csum_init_module);
module_exit(csum_cleanup_module);
MODULE_LICENSE("GPL");
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/