Re: [PATCH bpf-next v8 06/34] bpf: prepare for memcg-based memory accounting for bpf maps

From: Daniel Borkmann
Date: Wed Nov 25 2020 - 19:22:02 EST


On 11/25/20 4:00 AM, Roman Gushchin wrote:
In the absolute majority of cases if a process is making a kernel
allocation, it's memory cgroup is getting charged.

Bpf maps can be updated from an interrupt context and in such
case there is no process which can be charged. It makes the memory
accounting of bpf maps non-trivial.

Fortunately, after commit 4127c6504f25 ("mm: kmem: enable kernel
memcg accounting from interrupt contexts") and b87d8cefe43c
("mm, memcg: rework remote charging API to support nesting")
it's finally possible.

To do it, a pointer to the memory cgroup of the process, which created
the map, is saved, and this cgroup can be charged for all allocations
made from an interrupt context. This commit introduces 2 helpers:
bpf_map_kmalloc_node() and bpf_map_alloc_percpu(). They can be used in
the bpf code for accounted memory allocations, both in the process and
interrupt contexts. In the interrupt context they're using the saved
memory cgroup, otherwise the current cgroup is getting charged.

Signed-off-by: Roman Gushchin <guro@xxxxxx>

Thanks for updating the cover letter; replying in this series instead
on one more item that came to mind:

[...]
diff --git a/kernel/bpf/syscall.c b/kernel/bpf/syscall.c
index f3fe9f53f93c..4154c616788c 100644
--- a/kernel/bpf/syscall.c
+++ b/kernel/bpf/syscall.c
@@ -31,6 +31,8 @@
#include <linux/poll.h>
#include <linux/bpf-netns.h>
#include <linux/rcupdate_trace.h>
+#include <linux/memcontrol.h>
+#include <linux/sched/mm.h>
#define IS_FD_ARRAY(map) ((map)->map_type == BPF_MAP_TYPE_PERF_EVENT_ARRAY || \
(map)->map_type == BPF_MAP_TYPE_CGROUP_ARRAY || \
@@ -456,6 +458,77 @@ void bpf_map_free_id(struct bpf_map *map, bool do_idr_lock)
__release(&map_idr_lock);
}
+#ifdef CONFIG_MEMCG_KMEM
+static void bpf_map_save_memcg(struct bpf_map *map)
+{
+ map->memcg = get_mem_cgroup_from_mm(current->mm);
+}
+
+static void bpf_map_release_memcg(struct bpf_map *map)
+{
+ mem_cgroup_put(map->memcg);
+}
+
+void *bpf_map_kmalloc_node(const struct bpf_map *map, size_t size, gfp_t flags,
+ int node)
+{
+ struct mem_cgroup *old_memcg;
+ bool in_interrupt;
+ void *ptr;
+
+ /*
+ * If the memory allocation is performed from an interrupt context,
+ * the memory cgroup to charge can't be determined from the context
+ * of the current task. Instead, we charge the memory cgroup, which
+ * contained the process created the map.
+ */
+ in_interrupt = in_interrupt();
+ if (in_interrupt)
+ old_memcg = set_active_memcg(map->memcg);
+
+ ptr = kmalloc_node(size, flags, node);
+
+ if (in_interrupt)
+ set_active_memcg(old_memcg);
+
+ return ptr;
+}
+
+void __percpu *bpf_map_alloc_percpu(const struct bpf_map *map, size_t size,
+ size_t align, gfp_t gfp)
+{
+ struct mem_cgroup *old_memcg;
+ bool in_interrupt;
+ void *ptr;
+
+ /*
+ * If the memory allocation is performed from an interrupt context,
+ * the memory cgroup to charge can't be determined from the context
+ * of the current task. Instead, we charge the memory cgroup, which
+ * contained the process created the map.
+ */
+ in_interrupt = in_interrupt();
+ if (in_interrupt)
+ old_memcg = set_active_memcg(map->memcg);
+
+ ptr = __alloc_percpu_gfp(size, align, gfp);
+
+ if (in_interrupt)
+ set_active_memcg(old_memcg);

For this and above bpf_map_kmalloc_node() one, wouldn't it make more sense to
perform the temporary memcg unconditionally?

old_memcg = set_active_memcg(map->memcg);
ptr = kmalloc_node(size, flags, node);
set_active_memcg(old_memcg);

I think the semantics are otherwise a bit weird and the charging unpredictable;
this way it would /always/ be accounted against the prog in the memcg that
originally created the map.

E.g. maps could be shared between progs attached to, say, XDP/tc where in_interrupt()
holds true with progs attached to skb-cgroup/egress where we're still in process
context. So some part of the memory is charged against the original map's memcg and
some other part against the current process' memcg which seems odd, no? Or, for example,
if we start to run a tracing BPF prog which updates state in a BPF map ... that tracing
prog now interferes with processes in other memcgs which may not be intentional & could
lead to potential failures there as opposed when the tracing prog is not run. My concern
is that the semantics are not quite clear and behavior unpredictable compared to always
charging against map->memcg.

Similarly, what if an orchestration prog creates dedicated memcg(s) for maps with
individual limits ... the assumed behavior (imho) would be that whatever memory is
accounted on the map it can be accurately retrieved from there & similarly limits
enforced, no? It seems that would not be the case currently.

Thoughts?

+ return ptr;
+}
+
+#else
+static void bpf_map_save_memcg(struct bpf_map *map)
+{
+}
+
+static void bpf_map_release_memcg(struct bpf_map *map)
+{
+}
+#endif
+
/* called from workqueue */
static void bpf_map_free_deferred(struct work_struct *work)
{
@@ -464,6 +537,7 @@ static void bpf_map_free_deferred(struct work_struct *work)
bpf_map_charge_move(&mem, &map->memory);
security_bpf_map_free(map);
+ bpf_map_release_memcg(map);
/* implementation dependent freeing */
map->ops->map_free(map);
bpf_map_charge_finish(&mem);
@@ -875,6 +949,8 @@ static int map_create(union bpf_attr *attr)
if (err)
goto free_map_sec;
+ bpf_map_save_memcg(map);
+
err = bpf_map_new_fd(map, f_flags);
if (err < 0) {
/* failed to allocate fd.