[PATCH bpf-next 2/2] use prefetch function in bpf_map_lookup_batch()
From: Sagarika Sharma
Date: Tue Aug 16 2022 - 11:36:21 EST
This patch introduces the use of a module parameter n_prefetch
which enables prefetching within the bpf_map_lookup_batch function
for a faster lookup. Benefits depend on the platform, relative
density of the map, and the setting of the module parameter as
described below.
For multiprocessor machines, for a particular key in a bpf map,
each cpu has a value associated with that key. This patch’s
change is as follows: when copying each of these values to
userspace in bpf_map_lookup_batch, the value for a cpu
n_prefetch ahead is prefetched.
MEASUREMENTS:
The benchmark test added in this patch series was used to
measure the effect of prefetching as well as determine the
optimal setting of n_prefetch given the different parameters:
the test was run on many different platforms (with varying
number of cpus), with a range of settings of n_prefetch, and with
saturated, dense, and sparse maps (num_entries/capacity_of_map).
The benchmark test measures the average time for a single entry
lookup (t = num_entries_looked_up/total_time) given the varied
factors as mentioned above. The overhead of the
bpf_map_lookup_batch syscall introduces some error.
Here are the experimental results:
amd machine with 256 cores (rome zen 2)
Density of map n_prefetch single entry lookup time (ns/op)
--------------------------------------------------------------------
40k / 40k 0 16176.471
1 13095.238
5 7432.432
12 5188.679
20 9482.759
10k / 40k 0 13253.012
5 7482.993
12 5164.319
20 9649.123
2.5k / 40k 0 7394.958
5 7201.309
13 4721.030
20 8118.081
For denser maps, the experiments suggest that as n_prefetch
increases, there is a significant time benefit (~66% decrease)
until a certain point after which the time benefit begins to
decrease. For sparser maps, there is a less pronounced speedup
from prefetching. Additionally, this experiment seems to suggest
the optimal n_prefetch range on this particular machine is 12-13,
but a setting of n_prefetch = 5 can still improve the single
entry lookup time.
intel-skylake (with 112 cores)
Density of map n_prefetch single entry lookup time (ns/op)
------------------------------------------------------------------
40k / 40k 0 5729.167
1 5092.593
5 3395.062
20 6875.000
10k / 40k 0 2029.520
5 2989.130
20 5820.106
2.5k / 40k 0 1598.256
5 2935.290
20 4867.257
For this particular machine, the experimental results suggest that
there is only a significant benefit in prefetching with denser maps.
Prefetching within bpf_map_lookup_batch can provide significant
benefit depending on the use case. Across the many different
platforms experiments were performed on, a setting of n_prefetch = 5,
although not the optimal setting, significantly decreased the single
entry lookup time for denser maps.
Signed-off-by: Sagarika Sharma <sharmasagarika@xxxxxxxxxx>
---
kernel/bpf/hashtab.c | 7 +++++++
1 file changed, 7 insertions(+)
diff --git a/kernel/bpf/hashtab.c b/kernel/bpf/hashtab.c
index 8392f7f8a8ac..eb70c4bbe246 100644
--- a/kernel/bpf/hashtab.c
+++ b/kernel/bpf/hashtab.c
@@ -15,6 +15,9 @@
#include "bpf_lru_list.h"
#include "map_in_map.h"
+static uint n_prefetch;
+module_param(n_prefetch, uint, 0644);
+
#define HTAB_CREATE_FLAG_MASK \
(BPF_F_NO_PREALLOC | BPF_F_NO_COMMON_LRU | BPF_F_NUMA_NODE | \
BPF_F_ACCESS_MASK | BPF_F_ZERO_SEED)
@@ -1743,9 +1746,13 @@ __htab_map_lookup_and_delete_batch(struct bpf_map *map,
if (is_percpu) {
int off = 0, cpu;
void __percpu *pptr;
+ int num_cpus = num_possible_cpus();
pptr = htab_elem_get_ptr(l, map->key_size);
for_each_possible_cpu(cpu) {
+ if (n_prefetch > 0 && (cpu + n_prefetch) <= num_cpus)
+ prefetch(per_cpu_ptr(pptr, cpu + n_prefetch));
+
bpf_long_memcpy(dst_val + off,
per_cpu_ptr(pptr, cpu), size);
off += size;
--
2.37.1.595.g718a3a8f04-goog