Re: [PATCH 4/7] powerpc/ftrace: Additionally nop out the preceding mflr with -mprofile-kernel

From: Naveen N. Rao
Date: Wed Jun 19 2019 - 05:58:44 EST

Nicholas Piggin wrote:
Michael Ellerman's on June 19, 2019 3:14 pm:
Hi Naveen,

Sorry I meant to reply to this earlier .. :/

No problem. Thanks for the questions.

"Naveen N. Rao" <naveen.n.rao@xxxxxxxxxxxxxxxxxx> writes:
With -mprofile-kernel, gcc emits 'mflr r0', followed by 'bl _mcount' to
enable function tracing and profiling. So far, with dynamic ftrace, we
used to only patch out the branch to _mcount(). However, mflr is
executed by the branch unit that can only execute one per cycle on
POWER9 and shared with branches, so it would be nice to avoid it where

We cannot simply nop out the mflr either. When enabling function
tracing, there can be a race if tracing is enabled when some thread was
interrupted after executing a nop'ed out mflr. In this case, the thread
would execute the now-patched-in branch to _mcount() without having
executed the preceding mflr.

To solve this, we now enable function tracing in 2 steps: patch in the
mflr instruction, use synchronize_rcu_tasks() to ensure all existing
threads make progress, and then patch in the branch to _mcount(). We
override ftrace_replace_code() with a powerpc64 variant for this

According to the ISA we're not allowed to patch mflr at runtime. See the
section on "CMODX".

According to "quasi patch class" engineering note, we can patch
anything with a preferred nop. But that's written as an optional
facility, which we don't have a feature to test for.

Hmm... I wonder what the implications are. We've been patching in a 'trap' for kprobes for a long time now, along with having to patch back the original instruction (which can be anything), when the probe is removed.

I'm also not convinced the ordering between the two patches is
guaranteed by the ISA, given that there's possibly no isync on the other

Will they go through a context synchronizing event?

synchronize_rcu_tasks() should ensure a thread is scheduled away, but
I'm not actually sure it guarantees CSI if it's kernel->kernel. Could
do a smp_call_function to do the isync on each CPU to be sure.

Good point. Per Documentation/RCU/Design/Requirements/Requirements.html#Tasks RCU:
"The solution, in the form of Tasks RCU, is to have implicit read-side critical sections that are delimited by voluntary context switches, that is, calls to schedule(), cond_resched(), and synchronize_rcu_tasks(). In addition, transitions to and from userspace execution also delimit tasks-RCU read-side critical sections."

I suppose transitions to/from userspace, as well as calls to schedule() result in context synchronizing instruction being executed. But, if some tasks call cond_resched() and synchronize_rcu_tasks(), we probably won't have a CSI executed.

"In CONFIG_PREEMPT=n kernels, trampolines cannot be preempted, so these APIs map to call_rcu(), synchronize_rcu(), and rcu_barrier(), respectively."

In this scenario as well, I think we won't have a CSI executed in case of cond_resched().

Should we enhance patch_instruction() to handle that?

- Naveen