Re: [bug] kernel: bpf: syscall: a possible sleep-in-atomic bug in __bpf_prog_put()

From: Yonghong Song
Date: Wed May 24 2023 - 15:35:06 EST




On 5/24/23 5:42 AM, Teng Qi wrote:
Thank you.

We cannot use rcu_read_lock_held() in the 'if' statement. The return
value rcu_read_lock_held() could be 1 for some configurations regardless
whether rcu_read_lock() is really held or not. In most cases,
rcu_read_lock_held() is used in issuing potential warnings.
Maybe there are other ways to record whether rcu_read_lock() is held or not?

Sorry. I was not aware of the dependency of configurations of
rcu_read_lock_held().

If we cannot resolve rcu_read_lock() presence issue, maybe the condition
can be !in_interrupt(), so any process-context will go to a workqueue.

I agree that using !in_interrupt() as a condition is an acceptable solution.

This should work although it could be conservative.


Alternatively, we could have another solution. We could add another
function e.g., bpf_prog_put_rcu(), which indicates that bpf_prog_put()
will be done in rcu context.

Implementing a new function like bpf_prog_put_rcu() is a solution that involves
more significant changes.

Maybe we can change signature of bpf_prog_put instead? Like
void bpf_prog_put(struct bpf_prog *prog, bool in_rcu)
and inside bpf_prog_put we can add
WARN_ON_ONCE(in_rcu && !bpf_rcu_lock_held());


So if in_interrupt(), do kvfree, otherwise,
put into a workqueue.

Shall we proceed with submitting a patch following this approach?

You could choose either of the above although I think with newer bpf_prog_put() is better.

BTW, please do create a test case, e.g, sockmap test case which
can show the problem with existing code base.


I would like to mention something unrelated to the possible bug. At this
moment, things seem to be more puzzling. vfree() is safe under in_interrupt()
but not safe under other atomic contexts.
This disorder challenges our conventional belief, a monotonic incrementation
of limitations of the hierarchical atomic contexts, that programer needs
to be more and more careful to write code under rcu read lock, spin lock,
bh disable, interrupt...
This disorder can lead to unexpected consequences, such as code being safe
under interrupts but not safe under spin locks.
The disorder makes kernel programming more complex and may result in more bugs.
Even though we find a way to resolve the possible bug about the bpf_prog_put(),
I feel sad for undermining of kernel`s maintainability and disorder of
hierarchy of atomic contexts.

-- Teng Qi

On Tue, May 23, 2023 at 12:33 PM Yonghong Song <yhs@xxxxxxxx> wrote:



On 5/21/23 6:39 AM, Teng Qi wrote:
Thank you.

> Your above analysis makes sense if indeed that kvfree cannot appear
> inside a spin lock region or RCU read lock region. But is it true?
> I checked a few code paths in kvfree/kfree. It is either guarded
> with local_irq_save/restore or by
> spin_lock_irqsave/spin_unlock_
> irqrestore, etc. Did I miss
> anything? Are you talking about RT kernel here?

To see the sleepable possibility of kvfree, it is important to analyze the
following calling stack:
mm/util.c: 645 kvfree()
mm/vmalloc.c: 2763 vfree()

In kvfree(), to call vfree, if the pointer addr points to memory
allocated by
vmalloc(), it calls vfree().
void kvfree(const void *addr)
{
if (is_vmalloc_addr(addr))
vfree(addr);
else
kfree(addr);
}

In vfree(), in_interrupt() and might_sleep() need to be considered.
void vfree(const void *addr)
{
// ...
if (unlikely(in_interrupt()))
{
vfree_atomic(addr);
return;
}
// ...
might_sleep();
// ...
}

Sorry. I didn't check vfree path. So it does look like that
we need to pay special attention to non interrupt part.


The vfree() may sleep if in_interrupt() == false. The RCU read lock region
could have in_interrupt() == false and spin lock region which only disables
preemption also has in_interrupt() == false. So the kvfree() cannot appear
inside a spin lock region or RCU read lock region if the pointer addr points
to memory allocated by vmalloc().

> > Therefore, we propose modifying the condition to include
> > in_atomic(). Could we
> > update the condition as follows: "in_irq() || irqs_disabled() ||
> > in_atomic()"?
> Thank you! We look forward to your feedback.

We now think that ‘irqs_disabled() || in_atomic() ||
rcu_read_lock_held()’ is
more proper. irqs_disabled() is for irq flag reg, in_atomic() is for
preempt count and rcu_read_lock_held() is for RCU read lock region.

We cannot use rcu_read_lock_held() in the 'if' statement. The return
value rcu_read_lock_held() could be 1 for some configuraitons regardless
whether rcu_read_lock() is really held or not. In most cases,
rcu_read_lock_held() is used in issuing potential warnings.
Maybe there are other ways to record whether rcu_read_lock() is held or not?

I agree with your that 'irqs_disabled() || in_atomic()' makes sense
since it covers process context local_irq_save() and spin_lock() cases.

If we cannot resolve rcu_read_lock() presence issue, maybe the condition
can be !in_interrupt(), so any process-context will go to a workqueue.

Alternatively, we could have another solution. We could add another
function e.g., bpf_prog_put_rcu(), which indicates that bpf_prog_put()
will be done in rcu context. So if in_interrupt(), do kvfree, otherwise,
put into a workqueue.



-- Teng Qi

On Sun, May 21, 2023 at 11:45 AM Yonghong Song <yhs@xxxxxxxx
<mailto:yhs@xxxxxxxx>> wrote:



On 5/19/23 7:18 AM, Teng Qi wrote:
> Thank you for your response.
> > Looks like you only have suspicion here. Could you find a real
violation
> > here where __bpf_prog_put() is called with !in_irq() &&
> > !irqs_disabled(), but inside spin_lock or rcu read lock? I
have not seen
> > things like that.
>
> For the complex conditions to call bpf_prog_put() with 1 refcnt,
we have
> been
> unable to really trigger this atomic violation after trying to
construct
> test cases manually. But we found that it is possible to show
cases with
> !in_irq() && !irqs_disabled(), but inside spin_lock or rcu read lock.
> For example, even a failed case, one of selftest cases of bpf,
netns_cookie,
> calls bpf_sock_map_update() and may indirectly call bpf_prog_put()
> only inside rcu read lock: The possible call stack is:
> net/core/sock_map.c: 615 bpf_sock_map_update()
> net/core/sock_map.c: 468 sock_map_update_common()
> net/core/sock_map.c: 217 sock_map_link()
> kernel/bpf/syscall.c: 2111 bpf_prog_put()
>
> The files about netns_cookie include
> tools/testing/selftests/bpf/progs/netns_cookie_prog.c and
> tools/testing/selftests/bpf/prog_tests/netns_cookie.c. We
inserted the
> following code in
> ‘net/core/sock_map.c: 468 sock_map_update_common()’:
> static int sock_map_update_common(..)
> {
> int inIrq = in_irq();
> int irqsDisabled = irqs_disabled();
> int preemptBits = preempt_count();
> int inAtomic = in_atomic();
> int rcuHeld = rcu_read_lock_held();
> printk("in_irq() %d, irqs_disabled() %d, preempt_count() %d,
> in_atomic() %d, rcu_read_lock_held() %d", inIrq,
irqsDisabled,
> preemptBits, inAtomic, rcuHeld);
> }
>
> The output message is as follows:
> root@(none):/root/bpf# ./test_progs -t netns_cookie
> [ 137.639188] in_irq() 0, irqs_disabled() 0, preempt_count() 0,
> in_atomic() 0,
> rcu_read_lock_held() 1
> #113 netns_cookie:OK
> Summary: 1/0 PASSED, 0 SKIPPED, 0 FAILED
>
> We notice that there are numerous callers in kernel/, net/ and
drivers/,
> so we
> highly suggest modifying __bpf_prog_put() to address this gap.
The gap
> exists
> because __bpf_prog_put() is only safe under in_irq() ||
irqs_disabled()
> but not in_atomic() || rcu_read_lock_held(). The following code
snippet may
> mislead developers into thinking that bpf_prog_put() is safe in all
> contexts.
> if (in_irq() || irqs_disabled()) {
> INIT_WORK(&aux->work, bpf_prog_put_deferred);
> schedule_work(&aux->work);
> } else {
> bpf_prog_put_deferred(&aux->work);
> }
>
> Implicit dependency may lead to issues.
>
> > Any problem here?
> We mentioned it to demonstrate the possibility of kvfree() being
> called by __bpf_prog_put_noref().
>
> Thanks.
> -- Teng Qi
>
> On Wed, May 17, 2023 at 1:08 AM Yonghong Song <yhs@xxxxxxxx
<mailto:yhs@xxxxxxxx>
> <mailto:yhs@xxxxxxxx <mailto:yhs@xxxxxxxx>>> wrote:
>
>
>
> On 5/16/23 4:18 AM, starmiku1207184332@xxxxxxxxx
<mailto:starmiku1207184332@xxxxxxxxx>
> <mailto:starmiku1207184332@xxxxxxxxx
<mailto:starmiku1207184332@xxxxxxxxx>> wrote:
> > From: Teng Qi <starmiku1207184332@xxxxxxxxx
<mailto:starmiku1207184332@xxxxxxxxx>
> <mailto:starmiku1207184332@xxxxxxxxx
<mailto:starmiku1207184332@xxxxxxxxx>>>
> >
> > Hi, bpf developers,
> >
> > We are developing a static tool to check the matching between
> helpers and the
> > context of hooks. During our analysis, we have discovered some
> important
> > findings that we would like to report.
> >
> > ‘kernel/bpf/syscall.c: 2097 __bpf_prog_put()’ shows that
function
> > bpf_prog_put_deferred() won`t be called in the condition of
> > ‘in_irq() || irqs_disabled()’.
> > if (in_irq() || irqs_disabled()) {
> > INIT_WORK(&aux->work, bpf_prog_put_deferred);
> > schedule_work(&aux->work);
> > } else {
> >
> > bpf_prog_put_deferred(&aux->work);
> > }
> >
> > We suspect this condition exists because there might be
sleepable
> operations
> > in the callees of the bpf_prog_put_deferred() function:
> > kernel/bpf/syscall.c: 2097 __bpf_prog_put()
> > kernel/bpf/syscall.c: 2084 bpf_prog_put_deferred()
> > kernel/bpf/syscall.c: 2063 __bpf_prog_put_noref()
> > kvfree(prog->aux->jited_linfo);
> > kvfree(prog->aux->linfo);
>
> Looks like you only have suspicion here. Could you find a real
> violation
> here where __bpf_prog_put() is called with !in_irq() &&
> !irqs_disabled(), but inside spin_lock or rcu read lock? I
have not seen
> things like that.
>
> >
> > Additionally, we found that array prog->aux->jited_linfo is
> initialized in
> > ‘kernel/bpf/core.c: 157 bpf_prog_alloc_jited_linfo()’:
> > prog->aux->jited_linfo = kvcalloc(prog->aux->nr_linfo,
> > sizeof(*prog->aux->jited_linfo),
bpf_memcg_flags(GFP_KERNEL |
> __GFP_NOWARN));
>
> Any problem here?
>
> >
> > Our question is whether the condition 'in_irq() ||
> irqs_disabled() == false' is
> > sufficient for calling 'kvfree'. We are aware that calling
> 'kvfree' within the
> > context of a spin lock or an RCU lock is unsafe.

Your above analysis makes sense if indeed that kvfree cannot appear
inside a spin lock region or RCU read lock region. But is it true?
I checked a few code paths in kvfree/kfree. It is either guarded
with local_irq_save/restore or by
spin_lock_irqsave/spin_unlock_irqrestore, etc. Did I miss
anything? Are you talking about RT kernel here?


> >
> > Therefore, we propose modifying the condition to include
> in_atomic(). Could we
> > update the condition as follows: "in_irq() ||
irqs_disabled() ||
> in_atomic()"?
> >
> > Thank you! We look forward to your feedback.
> >
> > Signed-off-by: Teng Qi <starmiku1207184332@xxxxxxxxx
<mailto:starmiku1207184332@xxxxxxxxx>
> <mailto:starmiku1207184332@xxxxxxxxx
<mailto:starmiku1207184332@xxxxxxxxx>>>
>