Re: ARC Assembler: bundle_align_mode directive support

From: Eugeniy Paltsev
Date: Tue Jun 11 2019 - 14:52:08 EST

Hi Vineet,

On Mon, 2019-06-10 at 15:55 +0000, Vineet Gupta wrote:
> On 6/8/19 11:21 AM, Eugeniy Paltsev wrote:
> > Hi Cupertino,
> >
> > I tried to use ".bundle_align_mode" directive in ARC assembly, but I got following error:
> > ----------------->8--------------
> > Assembler messages:
> > Error: unknown pseudo-op: `.bundle_align_mode'
> > ----------------->8--------------
> >
> > Is it possible to implement it in ARC assembler?
> > There is some context about the reason we want to have it:
> >
> > I'm trying to add support of jump labels for ARC in linux kernel. Jump labels
> > provide an interface to generate static branches using self-modifying code.
> > This allows us to implement conditional branches where changing branch
> > direction is expensive but branch selection is basically 'free'.
> >
> > There is nuance in current implementation:
> > We need to patch code by rewriting 32-bit NOP by 32-bit BRANCH instruction (or vice versa).
> > It can be easily done with following code:
> > ----------------->8--------------
> > write_32_bit(new_instruction)
> > flush_l1_dcache_range_this_cpu
> > invalidate_l1_icache_range_all_cpu
> > ----------------->8--------------
> >
> > I$ update will be atomic in most of cases except the patched instruction share
> > two L1 cache lines (so first 16 bits of instruction are in the one cache line and
> > last 16 bit are in another cache line).
> > In such case we can execute half-updated instruction if we are patching live code (and we are unlucky enough :)
> While I understand your need for alignment, I don't see how you can possibly
> execute stray lines.
> dcache flush will be propagated by hardware (SCU) to all cores (as applicable) and
> the icache cache flush xcall is synchronous and will have to finish on all cores
> before we proceed to execute the cod eitself.

It's easy as hell - we may execute some code on one CPU and patch it on another CPU at the same moment :)

And here is the example of the situation when we can get the issue:
- Let's imagine we have SMP system with CPU#0 and CPU#1.
- We have instruction which share two L1 cache lines:
~ ---------------------------------|--------------------------------- ~
~ |instruction-Z (32-bit instruction we patch)| ~
~ ---------------------------------|--------------------------------- ~
~ cache-line-X | cache-line-Y ~
~ ---------------------------------|--------------------------------- ~

CPU#0 is trying to switch our static branch, so it will patch the code (instruction-Z)
CPU#1 is executing this code with our static branch, so it execute the code (with instruction-Z) that will be patched.

1) CPU#0: we generate new instruction (to replace 'instruction-Z')
and write it with 32-bit store (st). It is written to CPU#0 L1 data cache.
2) CPU#0: we call our function flush_icache_range.
It firstly will flush L1 data cache region on CPU#0.
In our example it will flush CPU#0 L1 'cache-line-X' and 'cache-line-Y' to L2 cache.
3) CPU#1: we are executing code which is in 'cache-line-X'.
'cache-line-Y' is __NOT__ in the L1 instruction cache of CPU#1.
We need to execute 'instruction-Z', but only half of the instruction is in L1 instruction cache.
CPU#1 fetch 'cache-line-Y' to L1 instruction cache.
4) Ooops: now we have corrupted 'instruction-Z' in L1 instruction cache of CPU#1:
First 16 bit of this instruction (which belongs to 'cache-line-X') are old (not patched).
Last 16 bit of this instruction (which belongs to 'cache-line-Y') are new (patched).

> > As of today I simply align by 4 byte instruction which can be patched with ".balign 4" directive:
> > ----------------->8--------------
> > static __always_inline bool arch_static_branch_jump(struct static_key *key,
> > bool branch)
> > {
> > asm_volatile_goto(".balign 4\n"
> > "1:\n"
> > "b %l[l_yes]\n" // <- instruction which can be patched
> > ".pushsection __jump_table, \"aw\"\n"
> > ".word 1b, %l[l_yes], %c0\n"
> > ".popsection\n"
> > : : "i" (&((char *)key)[branch]) : : l_yes);
> >
> > return false;
> > l_yes:
> > return true;
> > }
> > ----------------->8--------------
> >
> > In that case patched instruction is aligned with one 16-bit NOP if this is required.
> > However 'align by 4' directive is much stricter than it actually required.
> I don't quite understand. Can u write a couple of lines of pseudo assembly to show
> what the issue is.

All instructions on ARCv2 are aligned by 2 byte (even if the instruction is 4-byte long).

Using '.balign' directive we align instruction which can be
patched by 4 byte.
So compiler will add one 'nop_s' before our instruction if it is not 4 byte aligned.

So this code
.balign 4
b %l[l_yes] #<- instruction which can be patched
may turn into this:
b l_yes #<- instruction which can be patched
or that:
nop_s # padding
b l_yes #<- instruction which can be patched

In most of the cases this extra 'nop_s' is unnecessary as it is fine to have our instruction
not 4 byte aligned if it doesn't cross l1 cache line boundary.

It is't the huge problem - but we are adding one unnecessary 'nop' while we try to optimize hot code.

> > It's enough
> > that our 32-bit instruction don't cross l1 cache line boundary.
> > That will save us from adding useless NOP padding in most of the cases.
> > It can be implemented with ".bundle_align_mode" directive which isn't supported by ARC AS unfortunately.
> This seems like a reasonable request (contingent to the difficulty of
> implementation in binutils). but I can't comprehend why you would need it.

If we will have ".bundle_align_mode" directive supported we can add extra 'nop_s' only if it's really required
(if our instruction _really_ cross L1 cache line boundary). So we won't add useless NOP to hot code.

Eugeniy Paltsev