Re: [PATCH] tracepoints: Update static_call before tp_funcs when adding a tracepoint

From: Mathieu Desnoyers
Date: Mon Jul 26 2021 - 15:12:08 EST

----- On Jul 26, 2021, at 2:49 PM, rostedt rostedt@xxxxxxxxxxx wrote:

> On Mon, 26 Jul 2021 13:39:18 -0400 (EDT)
> Mathieu Desnoyers <mathieu.desnoyers@xxxxxxxxxxxx> wrote:
>> ----- On Jul 26, 2021, at 12:56 PM, rostedt rostedt@xxxxxxxxxxx wrote:
>> > On Mon, 26 Jul 2021 11:46:41 -0400 (EDT)
>> > Mathieu Desnoyers <mathieu.desnoyers@xxxxxxxxxxxx> wrote:
>> [...]
>> >


>> AFAIU, none of the synchronization mechanisms you refer to here (memory barrier,
>> IPIs..) will change the fact that this CPU may still be delayed across the
>> entire
>> 1->0->1 transition sequence, and may end up calling the new callback with the
>> old data. Unless an explicit RCU-sync is done.
> OK. I see the issue you are saying. And this came from my assumption
> that the tracepoint code did a synchronization when unregistering the
> last callback. But of course it wont because that would make a lot of
> back to back synchronizations of a large number of tracepoints being
> unregistered at once.

Something along the lines of the work approach you propose should work.

>> >
>> >>
>> >> My third conclusion is that we'd need synchronize RCU whenever tp_funcs[0].data
>> >> changes for transitions 1->2, 2->1, and 1->2 because the priorities don't
>> >> guarantee
>> >> that the first callback stays in the first position, and we also need to rcu
>> >> sync
>> >> unconditionally on transition 1->0. We currently only have sync RCU on
>> >> transition
>> >> from 2->1 when tp_funcs[0].func changes, which is bogus in many ways.
>> >
>> > Going from 1 to 2, there's no issue. We switch to the iterator, which
>> > is the old method anyway. It looks directly at the array and matches
>> > the data with the func for each element of that array, and the data
>> > read initially (before calling the iterator) is ignored.
>> This relies on ordering guarantees between RCU assign/dereference and
>> static_call
>> updates/call. It may well be the case, but I'm asking anyway.
>> Are we guaranteed of the following ordering ?
>> static_call_update()
> The static_call_update() triggers an IPI on all CPUs that perform a full memory
> barrier.
> That is, nothing on any CPU will cross the static_call_update().
>> y = rcu_dereference(x) rcu_assign_pointer(x, ...)
>> do_static_call(y)
>> That load of "x" should never happen after the CPU fetches the new static call
>> instruction.
> The 'y' will always be the new static call (which is the iterator in
> this case), and it doesn't matter which x it read, because the iterator
> will read the array just like it was done without static calls.

Here by "y" I mean the pointer loaded by rcu_dereference, which is then passed to the
call. I do not mean the call per se.

I suspect there is something like a control dependency between loading "x" through
rcu_dereference and passing the loaded pointer as argument to the static call (and
the instruction fetch of the static call). But I don't recall reading any documentation
of this specific ordering.

>> Also, I suspect that transition 2->1 needs an unconditional rcu-sync because you
>> may have a sequence of 3->2->1 (or 1->2->1) where the element 0 data is
>> unchanged
>> between 2->1, but was changed from 3->2 (or from 1->2), which may be observed by
>> the
>> static call.
> I'll agree that we need to add synchronization between 1->0->1, but you
> have not convinced me on this second part.

In addition to 1->0->1, there are actually 2 other scenarios here:

Transition 1->2 to which the ordering question between RCU and static call is

Transition 2->1 would need an unconditional rcu-sync, because of transitions
3->2->1 and 1->2->1 which can lead the static call to observe wrong data if the
rcu_dereference happens when there are e.g. 3 callbacks registered, and then the
static call to the function (single callback) is called on the 3-callbacks array.



Mathieu Desnoyers
EfficiOS Inc.