Re: [patch V2 01/20] bpf: Enforce preallocation for all instrumentation programs

From: Alexei Starovoitov
Date: Fri Feb 21 2020 - 23:29:28 EST


On Thu, Feb 20, 2020 at 09:45:18PM +0100, Thomas Gleixner wrote:
> The assumption that only programs attached to perf NMI events can deadlock
> on memory allocators is wrong. Assume the following simplified callchain:
>
> kmalloc() from regular non BPF context
> cache empty
> freelist empty
> lock(zone->lock);
> tracepoint or kprobe
> BPF()
> update_elem()
> lock(bucket)
> kmalloc()
> cache empty
> freelist empty
> lock(zone->lock); <- DEADLOCK
>
> There are other ways which do not involve locking to create wreckage:
>
> kmalloc() from regular non BPF context
> local_irq_save();
> ...
> obj = percpu_slab_first();
> kprobe()
> BPF()
> update_elem()
> lock(bucket)
> kmalloc()
> local_irq_save();
> ...
> obj = percpu_slab_first(); <- Same object as above ...
>
> So preallocation _must_ be enforced for all variants of intrusive
> instrumentation.
>
> Signed-off-by: Thomas Gleixner <tglx@xxxxxxxxxxxxx>
> ---
> V2: New patch
> ---
> kernel/bpf/verifier.c | 18 +++++++++++-------
> 1 file changed, 11 insertions(+), 7 deletions(-)
>
> --- a/kernel/bpf/verifier.c
> +++ b/kernel/bpf/verifier.c
> @@ -8144,19 +8144,23 @@ static int check_map_prog_compatibility(
> struct bpf_prog *prog)
>
> {
> - /* Make sure that BPF_PROG_TYPE_PERF_EVENT programs only use
> - * preallocated hash maps, since doing memory allocation
> - * in overflow_handler can crash depending on where nmi got
> - * triggered.
> + /*
> + * Make sure that trace type programs only use preallocated hash
> + * maps. Perf programs obviously can't do memory allocation in NMI
> + * context and all other types can deadlock on a memory allocator
> + * lock when a tracepoint/kprobe triggers a BPF program inside a
> + * lock held region or create inconsistent state when the probe is
> + * within an interrupts disabled critical region in the memory
> + * allocator.
> */
> - if (prog->type == BPF_PROG_TYPE_PERF_EVENT) {
> + if ((is_tracing_prog_type(prog->type)) {

This doesn't build.
I assumed the typo somehow sneaked in and proceeded, but it broke
a bunch of tests:
Summary: 1526 PASSED, 0 SKIPPED, 54 FAILED
One can argue that the test are unsafe and broken.
We used to test all those tests with and without prealloc:
map_flags = 0;
run_all_tests();
map_flags = BPF_F_NO_PREALLOC;
run_all_tests();
Then 4 years ago commit 5aa5bd14c5f866 switched hashmap to be no_prealloc
always and that how it stayed since then. We can adjust the tests to use
prealloc with tracing progs, but this breakage shows that there could be plenty
of bpf users that also use BPF_F_NO_PREALLOC with tracing. It could simply
be because they know that their kprobes are in a safe spot (and kmalloc is ok)
and they want to save memory. They could be using large max_entries parameter
for worst case hash map usage, but typical load is low. In general hashtables
don't perform well after 50%, so prealloc is wasting half of the memory. Since
we cannot control where kprobes are placed I'm not sure what is the right fix
here. It feels that if we proceed with this patch somebody will complain and we
would have to revert, but I'm willing to take this risk if we cannot come up
with an alternative fix.

Going further with the patchset.

Patch 9 "bpf: Use bpf_prog_run_pin_on_cpu() at simple call sites."
adds new warning:
../kernel/seccomp.c: In function âseccomp_run_filtersâ:
../kernel/seccomp.c:272:50: warning: passing argument 2 of âbpf_prog_run_pin_on_cpuâ discards âconstâ qualifier from pointer target type [-Wdiscarded-qualifiers]
u32 cur_ret = bpf_prog_run_pin_on_cpu(f->prog, sd);

I fixed it up and proceeded, but patch 16 failed to apply:

Applying: bpf: Factor out hashtab bucket lock operations
error: sha1 information is lacking or useless (kernel/bpf/hashtab.c).
error: could not build fake ancestor
Patch failed at 0001 bpf: Factor out hashtab bucket lock operations

I patched it in manually:
patch -p1 < a.patch
patching file kernel/bpf/hashtab.c
Hunk #1 succeeded at 1333 (offset 14 lines).
Hunk #2 succeeded at 1361 with fuzz 1 (offset 24 lines).
Hunk #3 succeeded at 1372 with fuzz 1 (offset 27 lines).
Hunk #4 succeeded at 1442 (offset 48 lines).
patching file kernel/bpf/syscall.c

and it looks correct.

But patch 17 failed completely:
patch -p1 < b.patch
patching file kernel/bpf/hashtab.c
Hunk #1 succeeded at 88 (offset 1 line).
Hunk #2 succeeded at 374 (offset 12 lines).
Hunk #3 succeeded at 437 (offset 12 lines).
Hunk #4 succeeded at 645 (offset 12 lines).
Hunk #5 succeeded at 653 (offset 12 lines).
Hunk #6 succeeded at 919 (offset 12 lines).
Hunk #7 succeeded at 960 (offset 12 lines).
Hunk #8 succeeded at 998 (offset 12 lines).
Hunk #9 succeeded at 1017 (offset 12 lines).
Hunk #10 succeeded at 1052 (offset 12 lines).
Hunk #11 succeeded at 1075 (offset 12 lines).
Hunk #12 succeeded at 1115 (offset 12 lines).
Hunk #13 succeeded at 1137 (offset 12 lines).
Hunk #14 succeeded at 1175 (offset 12 lines).
Hunk #15 succeeded at 1185 (offset 12 lines).
Hunk #16 succeeded at 1207 (offset 12 lines).
Hunk #17 succeeded at 1216 (offset 12 lines).
Hunk #18 FAILED at 1349.
Hunk #19 FAILED at 1358.
Hunk #20 FAILED at 1366.
Hunk #21 FAILED at 1407.
4 out of 21 hunks FAILED -- saving rejects to file kernel/bpf/hashtab.c.rej

That's where I gave up.

I pulled sched-for-bpf-2020-02-20 branch from tip and pushed it into bpf-next.
Could you please rebase your set on top of bpf-next and repost?
The logic in all patches looks good.

For now I propose to drop patch 1 and get the rest merged while we're
figuring out what to do.

Thanks!