Re: [PATCH bpf-next v5 1/6] bpf/helpers: introduce sleepable bpf_timers
From: Alexei Starovoitov
Date: Wed Apr 03 2024 - 22:44:11 EST
On Wed, Apr 3, 2024 at 6:01 PM Alexei Starovoitov
<alexei.starovoitov@xxxxxxxxx> wrote:
>
> On Wed, Apr 3, 2024 at 11:50 AM Alexei Starovoitov
> <alexei.starovoitov@xxxxxxxxx> wrote:
> >
> > On Wed, Mar 27, 2024 at 10:02 AM Benjamin Tissoires
> > <benjamin.tissoires@xxxxxxxxxx> wrote:
> > > > > goto out;
> > > > > }
> > > > > + spin_lock(&t->sleepable_lock);
> > > > > drop_prog_refcnt(t);
> > > > > + spin_unlock(&t->sleepable_lock);
> > > >
> > > > this also looks odd.
> > >
> > > I basically need to protect "t->prog = NULL;" from happening while
> > > bpf_timer_work_cb is setting up the bpf program to be run.
> >
> > Ok. I think I understand the race you're trying to fix.
> > The bpf_timer_cancel_and_free() is doing
> > cancel_work()
> > and proceeds with
> > kfree_rcu(t, rcu);
> >
> > That's the only race and these extra locks don't help.
> >
> > The t->prog = NULL is nothing to worry about.
> > The bpf_timer_work_cb() might still see callback_fn == NULL
> > "when it's being setup" and it's ok.
> > These locks don't help that.
> >
> > I suggest to drop sleepable_lock everywhere.
> > READ_ONCE of callback_fn in bpf_timer_work_cb() is enough.
> > Add rcu_read_lock_trace() before calling bpf prog.
> >
> > The race to fix is above 'cancel_work + kfree_rcu'
> > since kfree_rcu might free 'struct bpf_hrtimer *t'
> > while the work is pending and work_queue internal
> > logic might UAF struct work_struct work.
> > By the time it may luckily enter bpf_timer_work_cb() it's too late.
> > The argument 'struct work_struct *work' might already be freed.
> >
> > To fix this problem, how about the following:
> > don't call kfree_rcu and instead queue the work to free it.
> > After cancel_work(&t->work); the work_struct can be reused.
> > So set it up to call "freeing callback" and do
> > schedule_work(&t->work);
> >
> > There is a big assumption here that new work won't be
> > executed before cancelled work completes.
> > Need to check with wq experts.
> >
> > Another approach is to do something smart with
> > cancel_work() return code.
> > If it returns true set a flag inside bpf_hrtimer and
> > make bpf_timer_work_cb() free(t) after bpf prog finishes.
>
> Looking through wq code... I think I have to correct myself.
> cancel_work and immediate free is probably fine from wq pov.
> It has this comment:
> worker->current_func(work);
> /*
> * While we must be careful to not use "work" after this, the trace
> * point will only record its address.
> */
> trace_workqueue_execute_end(work, worker->current_func);
>
> the bpf_timer_work_cb() might still be running bpf prog.
> So it shouldn't touch 'struct bpf_hrtimer *t' after bpf prog returns,
> since kfree_rcu(t, rcu); could have freed it by then.
> There is also this code in net/rxrpc/rxperf.c
> cancel_work(&call->work);
> kfree(call);
Correction to correction.
Above piece in rxrpc is buggy.
The following race is possible:
cpu 0
process_one_work()
set_work_pool_and_clear_pending(work, pool->id, 0);
cpu 1
cancel_work()
kfree_rcu(work)
worker->current_func(work);
Here 'work' is a pointer to freed memory.
Though wq code will not be touching it, callback will UAF.
Also what I proposed earlier as:
INIT_WORK(A); schedule_work(); cancel_work(); INIT_WORK(B); schedule_work();
won't guarantee the ordering.
Since the callback function is different,
find_worker_executing_work() will consider it a separate work item.
Another option is to to keep bpf_timer_work_cb callback
and add a 'bool free_me;' to struct bpf_hrtimer
and let the callback free it.
But it's also racy.
cancel_work() may return false, though worker->current_func(work)
wasn't called yet.
So we cannot set 'free_me' in bpf_timer_cancel_and_free()
in race free maner.
After brainstorming with Tejun it seems the best is to use
another work_struct to call a different callback and do
cancel_work_sync() there.
So we need something like:
struct bpf_hrtimer {
union {
struct hrtimer timer;
+ struct work_struct work;
};
struct bpf_map *map;
struct bpf_prog *prog;
void __rcu *callback_fn;
void *value;
union {
struct rcu_head rcu;
+ struct work_struct sync_work;
};
+ u64 flags; // bpf_timer_init() will require BPF_F_TIMER_SLEEPABLE
};
'work' will be used to call bpf_timer_work_cb.
'sync_work' will be used to call cancel_work_sync() + kfree_rcu().
And, of course,
schedule_work(&t->sync_work); from bpf_timer_cancel_and_free()
instead of kfree_rcu.