Re: [PATCH bpf-next v5 1/6] bpf/helpers: introduce sleepable bpf_timers

From: Benjamin Tissoires
Date: Thu Apr 04 2024 - 11:27:13 EST


On Thu, Apr 4, 2024 at 4:44 AM Alexei Starovoitov
<alexei.starovoitov@xxxxxxxxx> wrote:
>
> On Wed, Apr 3, 2024 at 6:01 PM Alexei Starovoitov
> <alexei.starovoitov@xxxxxxxxx> wrote:
> >
> > On Wed, Apr 3, 2024 at 11:50 AM Alexei Starovoitov
> > <alexei.starovoitov@xxxxxxxxx> wrote:
> > >
> > > On Wed, Mar 27, 2024 at 10:02 AM Benjamin Tissoires
> > > <benjamin.tissoires@xxxxxxxxxx> wrote:
> > > > > > goto out;
> > > > > > }
> > > > > > + spin_lock(&t->sleepable_lock);
> > > > > > drop_prog_refcnt(t);
> > > > > > + spin_unlock(&t->sleepable_lock);
> > > > >
> > > > > this also looks odd.
> > > >
> > > > I basically need to protect "t->prog = NULL;" from happening while
> > > > bpf_timer_work_cb is setting up the bpf program to be run.
> > >
> > > Ok. I think I understand the race you're trying to fix.
> > > The bpf_timer_cancel_and_free() is doing
> > > cancel_work()
> > > and proceeds with
> > > kfree_rcu(t, rcu);
> > >
> > > That's the only race and these extra locks don't help.


Thanks a lot for pinpointing the location of the race. Indeed, when I
read your email this morning I said "of course, this was obvious" :(

>
> > >
> > > The t->prog = NULL is nothing to worry about.
> > > The bpf_timer_work_cb() might still see callback_fn == NULL
> > > "when it's being setup" and it's ok.
> > > These locks don't help that.
> > >
> > > I suggest to drop sleepable_lock everywhere.
> > > READ_ONCE of callback_fn in bpf_timer_work_cb() is enough.
> > > Add rcu_read_lock_trace() before calling bpf prog.
> > >
> > > The race to fix is above 'cancel_work + kfree_rcu'
> > > since kfree_rcu might free 'struct bpf_hrtimer *t'
> > > while the work is pending and work_queue internal
> > > logic might UAF struct work_struct work.
> > > By the time it may luckily enter bpf_timer_work_cb() it's too late.
> > > The argument 'struct work_struct *work' might already be freed.
> > >
> > > To fix this problem, how about the following:
> > > don't call kfree_rcu and instead queue the work to free it.
> > > After cancel_work(&t->work); the work_struct can be reused.
> > > So set it up to call "freeing callback" and do
> > > schedule_work(&t->work);
> > >
> > > There is a big assumption here that new work won't be
> > > executed before cancelled work completes.
> > > Need to check with wq experts.
> > >
> > > Another approach is to do something smart with
> > > cancel_work() return code.
> > > If it returns true set a flag inside bpf_hrtimer and
> > > make bpf_timer_work_cb() free(t) after bpf prog finishes.
> >
> > Looking through wq code... I think I have to correct myself.
> > cancel_work and immediate free is probably fine from wq pov.
> > It has this comment:
> > worker->current_func(work);
> > /*
> > * While we must be careful to not use "work" after this, the trace
> > * point will only record its address.
> > */
> > trace_workqueue_execute_end(work, worker->current_func);
> >
> > the bpf_timer_work_cb() might still be running bpf prog.
> > So it shouldn't touch 'struct bpf_hrtimer *t' after bpf prog returns,
> > since kfree_rcu(t, rcu); could have freed it by then.
> > There is also this code in net/rxrpc/rxperf.c
> > cancel_work(&call->work);
> > kfree(call);
>
> Correction to correction.
> Above piece in rxrpc is buggy.
> The following race is possible:
> cpu 0
> process_one_work()
> set_work_pool_and_clear_pending(work, pool->id, 0);
>
> cpu 1
> cancel_work()
> kfree_rcu(work)
>
> worker->current_func(work);
>
> Here 'work' is a pointer to freed memory.
> Though wq code will not be touching it, callback will UAF.
>
> Also what I proposed earlier as:
> INIT_WORK(A); schedule_work(); cancel_work(); INIT_WORK(B); schedule_work();
> won't guarantee the ordering.
> Since the callback function is different,
> find_worker_executing_work() will consider it a separate work item.
>
> Another option is to to keep bpf_timer_work_cb callback
> and add a 'bool free_me;' to struct bpf_hrtimer
> and let the callback free it.
> But it's also racy.
> cancel_work() may return false, though worker->current_func(work)
> wasn't called yet.
> So we cannot set 'free_me' in bpf_timer_cancel_and_free()
> in race free maner.
>
> After brainstorming with Tejun it seems the best is to use
> another work_struct to call a different callback and do
> cancel_work_sync() there.

Works for me. I should be able to spina v6 soon enough, but I have a
couple of remaining questions below:

>
> So we need something like:
>
> struct bpf_hrtimer {
> union {
> struct hrtimer timer;
> + struct work_struct work;
> };
> struct bpf_map *map;
> struct bpf_prog *prog;
> void __rcu *callback_fn;
> void *value;
> union {

Are you sure we need an union here? If we get to call kfree_rcu() we
need to have both struct rcu_head and sync_work, not one or the other.

> struct rcu_head rcu;
> + struct work_struct sync_work;
> };
> + u64 flags; // bpf_timer_init() will require BPF_F_TIMER_SLEEPABLE

If I understand, you want BPF_F_TIMER_SLEEPABLE in bpf_timer_init()
(like in my v2 or v3 IIRC). But that means that once a timer is
initialized it needs to be of one or the other type (especially true
with the first union in this struct).

So should we reject during run time bpf_timer_set_callback() for
sleepable timers and only allow bpf_timer_set_sleepable_cb() for
those? (and the invert in the other case).

This version of the patch allows for one timer to be used as softIRQ
or WQ, depending on the timer_set_callback that is used. But it might
be simpler for the kfree_rcu race to define the bpf_timer to be of one
kind, so we are sure to call the correct kfree method.

> };
>
> 'work' will be used to call bpf_timer_work_cb.
> 'sync_work' will be used to call cancel_work_sync() + kfree_rcu().
>
> And, of course,
> schedule_work(&t->sync_work); from bpf_timer_cancel_and_free()
> instead of kfree_rcu.
>

Cheers,
Benjamin