Re: [PATCH bpf-next 1/2] bpf: Introduce bpf_probe_write_user_registered()

From: Alexei Starovoitov
Date: Tue Apr 09 2024 - 17:41:56 EST


On Tue, Apr 9, 2024 at 1:11 AM Marco Elver <elver@xxxxxxxxxx> wrote:
>
> On Mon, 8 Apr 2024 at 20:24, Alexei Starovoitov
> <alexei.starovoitov@xxxxxxxxx> wrote:
> >
> > On Mon, Apr 8, 2024 at 2:30 AM Marco Elver <elver@xxxxxxxxxx> wrote:
> > >
> > > On Fri, 5 Apr 2024 at 22:28, Andrii Nakryiko <andrii.nakryiko@xxxxxxxxx> wrote:
> > > >
> > > > On Fri, Apr 5, 2024 at 1:28 AM Marco Elver <elver@xxxxxxxxxx> wrote:
> > > > >
> > > > > On Fri, 5 Apr 2024 at 01:23, Alexei Starovoitov
> > > > > <alexei.starovoitov@xxxxxxxxx> wrote:
> > > [...]
> > > > > > and the tasks can use mmaped array shared across all or unique to each
> > > > > > process.
> > > > > > And both bpf and user space can read/write them with a single instruction.
> > > > >
> > > > > That's BPF_F_MMAPABLE, right?
> > > > >
> > > > > That does not work because the mmapped region is global. Our requirements are:
> >
> > It sounds not like "requirements", but a description of the proposed
> > solution.
>
> These requirements can also be implemented differently, e.g. with the
> proposed task-local storage maps if done right (the devil appears to
> be in the implementation-details - these details are currently beyond
> my knowledge of the BPF subsystem internals). They really are the bare
> minimum requirements for the use case. The proposed solution probably
> happens to be the simplest one, and mapping requirements is relatively
> straightforward for it.
>
> > Pls share the actual use case.
> > This "tracing prog" sounds more like a ghost scheduler that
> > wants to interact with known user processes.
>
> It's tcmalloc - we have a BPF program provide a simpler variant of the
> "thread re-scheduling" notifications discussed here:
> https://lore.kernel.org/all/CACT4Y+beLh1qnHF9bxhMUcva8KyuvZs7Mg_31SGK5xSoR=3m1A@xxxxxxxxxxxxxx/
>
> Various synchronization algorithms can be optimized if they know about
> scheduling events. This has been proposed in [1] to implement an
> adaptive mutex. However, we think that there are many more
> possibilities, but it really depends on the kind of scheduler state
> being exposed ("thread on CPU" as in [1], or more details, like
> details about which thread was switched in, which was switched out,
> where was the thread migrated to, etc.). Committing to these narrow
> use case ABIs and extending the main kernel with more and more such
> information does not scale. Instead, BPF easily allows to expose this
> information where it's required.
>
> [1] https://lore.kernel.org/all/20230529191416.53955-1-mathieu.desnoyers@xxxxxxxxxxxx/

Thanks for the links. That's very helpful.
I was about to mention rseq.
I think it's a better fit, but Mathieu doesn't want non-uapi bits in it.
Ideally we could have added pointer + size to rseq as a scratch area
that bpf prog can read/write.
User space would register such area, kernel would pin pages,
and bpf would directly access that.
Pretty much the same idea of bpf local storage, but without bpf map.
rseq has its pros and cons.
Having a bpf local storage map to manage such pinned pages provides
separation of concerns. And it's up to tcmalloc whether to use
one such map for all tasks or map for every tcmalloc instance.
Such scratch areas are per-task per-map.
rseq dictates per-task only, but no concerns with setup thanks
to glibc integration.

> > The main issue with bpf_probe_read/write_user() is their non-determinism.
> > They will error when memory is swapped out.
>
> Right. Although, user space mlock'ing and prefaulting the memory
> solves it in the common case (some corner cases like after fork() are
> still tricky).

yeah. gets tricky indeed.
With bpf local storage map managing pinned pages the map_fd with
auto-close property addresses unpinning. No need to hack into
fork-ing exit-ing paths, but tcmalloc would need to re-register
the areas into bpf local storage after fork... guessing.

> > These helpers are ok-ish for observability when consumers understand
> > that some events might be lost, but for 24/7 production use
> > losing reads becomes a problem that bpf prog cannot mitigate.
> > What do bpf prog suppose to do when this safer bpf_probe_write_user errors?
> > Use some other mechanism to communicate with user space?
>
> Right, for use cases where these errors aren't ok it does not work.
> But if the algorithm is tolerant to lossy reads/writes from the BPF
> program side, it's not an issue.

Reading Dmitry Vyukov's comments in that thread... looks like
he's also concerned with the 'probe' part of rseq api.

> > A mechanism with such builtin randomness in behavior is a footgun for
> > bpf users.
> > We have bpf_copy_from_user*() that don't have this non-determinism.
> > We can introduce bpf_copy_to_user(), but it will be usable
> > from sleepable bpf prog.
> > While it sounds you need it somewhere where scheduler makes decisions,
> > so I suspect bpf array or arena is a better fit.
>
> Correct, it can't sleep because it wants scheduler events.
>
> > Or something that extends bpf local storage map.
> > See long discussion:
> > https://lore.kernel.org/bpf/45878586-cc5f-435f-83fb-9a3c39824550@xxxxxxxxx/
> >
> > I still like the idea to let user tasks register memory in
> > bpf local storage map, the kernel will pin such pages,
> > and then bpf prog can read/write these regions directly.
> > In bpf prog it will be:
> > ptr = bpf_task_storage_get(&map, task, ...);
> > if (ptr) { *ptr = ... }
> > and direct read/write into the same memory from user space.
>
> I would certainly welcome something like this - I agree this looks
> better than bpf_probe_read/write.

Great to hear. It's on the todo list, but no one started to work on it.
If you have cycles to prototype it would be great.