Re: call_rcu from trace_preempt

From: Paul E. McKenney
Date: Wed Jun 17 2015 - 17:37:03 EST

On Wed, Jun 17, 2015 at 01:53:17PM -0700, Alexei Starovoitov wrote:
> On 6/17/15 1:37 PM, Paul E. McKenney wrote:
> >On Wed, Jun 17, 2015 at 11:39:29AM -0700, Alexei Starovoitov wrote:
> >>On 6/17/15 2:05 AM, Daniel Wagner wrote:
> >>>>>Steven's suggestion deferring the work via irq_work results in the same
> >>>>>stack trace. (Now I get cold feets, without the nice heat from the CPU
> >>>>>busy looping...)
> >>>That one still not working. It also makes the system really really slow.
> >>>I guess I still do something completely wrong.
> >>
> >>tried your irq_work patch. It indeed makes the whole system
> >>unresponsive. Ctrl-C of hwlathist no longer works and
> >>it runs out of memory in 20 sec or so of running hwlathist
> >>on idle system (without parallel hackbench).
> >>It looks that free_pending flag is racy, so I removed it,
> >>but it didn't help.
> >>
> >>Also I've tried all sort of other things in rcu including
> >>add rcu_bpf similar to rcu_sched to make sure that recursive
> >>call into call_rcu will not be messing rcu_preempt or rcu_sched
> >>states and instead will be operating on rcu_bpf per-cpu states.
> >>In theory that should have worked flawlessly and it sort-of did.
> >>But multiple hackbench runs still managed to crash it.
> >>So far I think the temp workaround is to stick with array maps
> >>for probing such low level things like trace_preempt.
> >>Note that pre-allocation of all elements in hash map also won't
> >>help, since the problem here is some collision of call_rcu and
> >>rcu_process_callbacks. I'm pretty sure that kfree_rcu with
> >>rcu_is_watching patch is ready for this type of abuse.
> >>The rcu_process_callbacks() path - no yet. I'm still analyzing it.
> >
> >How about if I just gave you a hook in __call_rcu() itself, just before
> >it returns, just after the local_irq_restore()? You could maintain
> >recursion flags and test the environment, at some point handling any
> >memory that needed freeing.
> >
> >The idea would be to use an atomic queue to accumulate the to-be-freed
> >data, then kfree_rcu() it in the hook if it was safe to do so.
> I'm not yet seeing how it will look. You mean I'll just locklessly
> enqueue into some global llist and it will get kfree-d from
> rcu_process_callbacks() ?

Locklessly enqueue onto a per-CPU list, but yes. The freeing is up to
you -- you get called just before exit from __call_rcu(), and get to
figure out what to do.

My guess would be if not in interrupt and not recursively invoked,
atomically remove all the elements from the list, then pass each to
kfree_rcu(), and finally let things take their course from there.
The llist APIs look like they would work.

> Something like kfree_rcu_lockless(ptr_to_be_freed) that
> llist_adds and let rcu core know that something has to be freed?
> I think such feature would be very useful in general.
> Or may be kfree_rcu_this_cpu(ptr_to_be_freed) that uses
> per-cpu llist of 'to-be-kfreed' objects?
> Performance will be great and not need to embed rcu_head in
> every datastructure.

Well, you do need to have something in each element to allow them to be
tracked. You could indeed use llist_add() to maintain the per-CPU list,
and then use llist_del_all() bulk-remove all the elements from the per-CPU
list. You can then pass each element in turn to kfree_rcu(). And yes,
I am suggesting that you open-code this, as it is going to be easier to
handle your special case then to provide a fully general solution. For
one thing, the general solution would require a full rcu_head to track
offset and next. In contrast, you can special-case the offset. And
ignore the overload special cases.

Thanx, Paul

> btw, irq_work suffers the same re-entrancy problem:
> [ 19.914910] [<ffffffff8117c63c>] free_work_cb+0x2c/0x50
> [ 19.914910] [<ffffffff81176624>] irq_work_run_list+0x44/0x70
> [ 19.914910] [<ffffffff8117667e>] irq_work_run+0x2e/0x50
> [ 19.914910] [<ffffffff81008d0e>] smp_irq_work_interrupt+0x2e/0x40
> [ 19.914910] [<ffffffff8178dd20>] irq_work_interrupt+0x70/0x80
> [ 19.914910] <EOI> [<ffffffff813d3cec>] ?
> debug_object_active_state+0xfc/0x150
> [ 19.914910] [<ffffffff813d3cec>] ? debug_object_active_state+0xfc/0x150
> [ 19.914910] [<ffffffff8178c21b>] ? _raw_spin_unlock_irqrestore+0x4b/0x80
> [ 19.914910] [<ffffffff8115e0a7>] ? trace_preempt_on+0x7/0x100
> [ 19.914910] [<ffffffff810991c3>] ? preempt_count_sub+0x73/0xf0
> [ 19.914910] [<ffffffff8178c21b>] _raw_spin_unlock_irqrestore+0x4b/0x80
> [ 19.914910] [<ffffffff813d3cec>] debug_object_active_state+0xfc/0x150
> [ 19.914910] [<ffffffff812035f0>] ? get_max_files+0x20/0x20
> [ 19.914910] [<ffffffff810e2d2f>] __call_rcu.constprop.67+0x5f/0x350
> [ 19.914910] [<ffffffff810e3097>] call_rcu+0x17/0x20
> [ 19.914910] [<ffffffff81203843>] __fput+0x183/0x200
> so if I do call_rcu from free_work_cb, it's equally bad
> as calling call_rcu from trace_preempt_on

To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at
Please read the FAQ at