Re: [PATCH 3/3] bpf: Implement prealloc for task_local_storage

From: Martin KaFai Lau
Date: Fri Oct 22 2021 - 18:48:04 EST


On Wed, Oct 20, 2021 at 10:18:13AM -1000, Tejun Heo wrote:
> From 5e3ad0d4a0b0732e7ebe035582d282ab752397ed Mon Sep 17 00:00:00 2001
> From: Tejun Heo <tj@xxxxxxxxxx>
> Date: Wed, 20 Oct 2021 08:56:53 -1000
>
> task_local_storage currently does not support pre-allocation and the memory
> is allocated on demand using the GFP_ATOMIC mask. While atomic allocations
> succeed most of the time and the occasional failures aren't a problem for
> many use-cases, there are some which can benefit from reliable allocations -
> e.g. tracking acquisitions and releases of specific resources to root cause
> long-term reference leaks.
>
> Prealloc semantics for task_local_storage:
>
> * When a prealloc map is created, the map's elements for all existing tasks
> are allocated.
>
> * Afterwards, whenever a new task is forked, it automatically allocates the
> elements for the existing preallocated maps.
>
> To synchronize against concurrent forks, CONFIG_BPF_SYSCALL now enables
> CONFIG_THREADGROUP_RWSEM and prealloc task_local_storage creation path
> write-locks threadgroup_rwsem, and the rest of the implementation is
> straight-forward.

[ ... ]

> +static int task_storage_map_populate(struct bpf_local_storage_map *smap)
> +{
> + struct bpf_local_storage *storage = NULL;
> + struct bpf_local_storage_elem *selem = NULL;
> + struct task_struct *p, *g;
> + int err = 0;
> +
> + lockdep_assert_held(&threadgroup_rwsem);
> +retry:
> + if (!storage)
> + storage = bpf_map_kzalloc(&smap->map, sizeof(*storage),
> + GFP_USER);
> + if (!selem)
> + selem = bpf_map_kzalloc(&smap->map, smap->elem_size, GFP_USER);
> + if (!storage || !selem) {
> + err = -ENOMEM;
> + goto out_free;
> + }
> +
> + rcu_read_lock();
> + bpf_task_storage_lock();
> +
> + for_each_process_thread(g, p) {
I am thinking if this loop can be done in bpf iter.

If the bpf_local_storage_map is sleepable safe (not yet done but there is
an earlier attempt [0]), bpf_local_storage_update() should be able to
alloc without GFP_ATOMIC by sleepable bpf prog and this potentially
will be useful in general for other sleepable use cases.

For example, if a sleepable bpf iter prog can run in this loop (or the existing
bpf task iter loop is as good?), the iter bpf prog can call
bpf_task_storage_get(BPF_SK_STORAGE_GET_F_CREATE) on a sleepable
bpf_local_storage_map.

This pre-alloc then can be done similarly on the tcp/udp socket side
by running a bpf prog at the existing bpf tcp/udp iter.

[0]: https://lore.kernel.org/bpf/20210826235127.303505-1-kpsingh@xxxxxxxxxx/

> + struct bpf_local_storage_data *sdata;
> +
> + /* Try inserting with atomic allocations. On failure, retry with
> + * the preallocated ones.
> + */
> + sdata = bpf_local_storage_update(p, smap, NULL, BPF_NOEXIST);
> +
> + if (PTR_ERR(sdata) == -ENOMEM && storage && selem) {
> + sdata = __bpf_local_storage_update(p, smap, NULL,
> + BPF_NOEXIST,
> + &storage, &selem);
> + }
> +
> + /* Check -EEXIST before need_resched() to guarantee forward
> + * progress.
> + */
> + if (PTR_ERR(sdata) == -EEXIST)
> + continue;
> +
> + /* If requested or alloc failed, take a breather and loop back
> + * to preallocate.
> + */
> + if (need_resched() ||
> + PTR_ERR(sdata) == -EAGAIN || PTR_ERR(sdata) == -ENOMEM) {
> + bpf_task_storage_unlock();
> + rcu_read_unlock();
> + cond_resched();
> + goto retry;
> + }
> +
> + if (IS_ERR(sdata)) {
> + err = PTR_ERR(sdata);
> + goto out_unlock;
> + }
> + }
> +out_unlock:
> + bpf_task_storage_unlock();
> + rcu_read_unlock();
> +out_free:
> + if (storage)
> + kfree(storage);
> + if (selem)
> + kfree(selem);
> + return err;
> +}

[ ... ]

> +int bpf_task_storage_fork(struct task_struct *task)
> +{
> + struct bpf_local_storage_map *smap;
> +
> + percpu_rwsem_assert_held(&threadgroup_rwsem);
> +
> + list_for_each_entry(smap, &prealloc_smaps, prealloc_node) {
Mostly a comment here from the networking side, I suspect the common use case
is going to be more selective based on different protocol (tcp or udp)
and even port. There is some existing bpf hooks during inet_sock creation
time, bind time ...etc. The bpf prog can be selective on what bpf_sk_storage
it needs by inspecting different fields of a sk.

e.g. in inet_create(), there is BPF_CGROUP_RUN_PROG_INET_SOCK(sk).
Would a similar hook be useful on the fork side?

> + struct bpf_local_storage *storage;
> + struct bpf_local_storage_elem *selem;
> + struct bpf_local_storage_data *sdata;
> +
> + storage = bpf_map_kzalloc(&smap->map, sizeof(*storage),
> + GFP_USER);
> + selem = bpf_map_kzalloc(&smap->map, smap->elem_size, GFP_USER);
> +
> + rcu_read_lock();
> + bpf_task_storage_lock();
> + sdata = __bpf_local_storage_update(task, smap, NULL, BPF_NOEXIST,
> + &storage, &selem);
> + bpf_task_storage_unlock();
> + rcu_read_unlock();
> +
> + if (storage)
> + kfree(storage);
> + if (selem)
> + kfree(selem);
> +
> + if (IS_ERR(sdata)) {
> + bpf_task_storage_free(task);
> + return PTR_ERR(sdata);
> + }
> + }
> +
> + return 0;
> +}