Re: [RFC patch 14/19] bpf: Use migrate_disable() in hashtab code

From: Mathieu Desnoyers
Date: Fri Feb 14 2020 - 14:11:35 EST


On 14-Feb-2020 02:39:31 PM, Thomas Gleixner wrote:
> The required protection is that the caller cannot be migrated to a
> different CPU as these places take either a hash bucket lock or might
> trigger a kprobe inside the memory allocator. Both scenarios can lead to
> deadlocks. The deadlock prevention is per CPU by incrementing a per CPU
> variable which temporarily blocks the invocation of BPF programs from perf
> and kprobes.
>
> Replace the preempt_disable/enable() pairs with migrate_disable/enable()
> pairs to prepare BPF to work on PREEMPT_RT enabled kernels. On a non-RT
> kernel this maps to preempt_disable/enable(), i.e. no functional change.

Will that _really_ work on RT ?

I'm puzzled about what will happen in the following scenario on RT:

Thread A is preempted within e.g. htab_elem_free_rcu, and Thread B is
scheduled and runs through a bunch of tracepoints. Both are on the
same CPU's runqueue:

CPU 1

Thread A is scheduled
(Thread A) htab_elem_free_rcu()
(Thread A) migrate disable
(Thread A) __this_cpu_inc(bpf_prog_active); -> per-cpu variable for
deadlock prevention.
Thread A is preempted
Thread B is scheduled
(Thread B) Runs through various tracepoints:
trace_call_bpf()
if (unlikely(__this_cpu_inc_return(bpf_prog_active) != 1)) {
-> will skip any instrumentation that happens to be on
this CPU until...
Thread B is preempted
Thread A is scheduled
(Thread A) __this_cpu_dec(bpf_prog_active);
(Thread A) migrate enable

Having all those events randomly and silently discarded might be quite
unexpected from a user standpoint. This turns the deadlock prevention
mechanism into a random tracepoint-dropping facility, which is
unsettling. One alternative approach we could consider to solve this
is to make this deadlock prevention nesting counter per-thread rather
than per-cpu.

Also, I don't think using __this_cpu_inc() without preempt-disable or
irq off is safe. You'll probably want to move to this_cpu_inc/dec
instead, which can be heavier on some architectures.

Thanks,

Mathieu


>
> Signed-off-by: Thomas Gleixner <tglx@xxxxxxxxxxxxx>
> ---
> kernel/bpf/hashtab.c | 12 ++++++------
> 1 file changed, 6 insertions(+), 6 deletions(-)
>
> --- a/kernel/bpf/hashtab.c
> +++ b/kernel/bpf/hashtab.c
> @@ -698,11 +698,11 @@ static void htab_elem_free_rcu(struct rc
> * we're calling kfree, otherwise deadlock is possible if kprobes
> * are placed somewhere inside of slub
> */
> - preempt_disable();
> + migrate_disable();
> __this_cpu_inc(bpf_prog_active);
> htab_elem_free(htab, l);
> __this_cpu_dec(bpf_prog_active);
> - preempt_enable();
> + migrate_enable();
> }
>
> static void free_htab_elem(struct bpf_htab *htab, struct htab_elem *l)
> @@ -1327,7 +1327,7 @@ static int
> }
>
> again:
> - preempt_disable();
> + migrate_disable();
> this_cpu_inc(bpf_prog_active);
> rcu_read_lock();
> again_nocopy:
> @@ -1347,7 +1347,7 @@ static int
> raw_spin_unlock_irqrestore(&b->lock, flags);
> rcu_read_unlock();
> this_cpu_dec(bpf_prog_active);
> - preempt_enable();
> + migrate_enable();
> goto after_loop;
> }
>
> @@ -1356,7 +1356,7 @@ static int
> raw_spin_unlock_irqrestore(&b->lock, flags);
> rcu_read_unlock();
> this_cpu_dec(bpf_prog_active);
> - preempt_enable();
> + migrate_enable();
> kvfree(keys);
> kvfree(values);
> goto alloc;
> @@ -1406,7 +1406,7 @@ static int
>
> rcu_read_unlock();
> this_cpu_dec(bpf_prog_active);
> - preempt_enable();
> + migrate_enable();
> if (bucket_cnt && (copy_to_user(ukeys + total * key_size, keys,
> key_size * bucket_cnt) ||
> copy_to_user(uvalues + total * value_size, values,
>

--
Mathieu Desnoyers
EfficiOS Inc.
http://www.efficios.com