Re: [PATCH] ARC: ARCv2: jump label: implement jump label patching
From: Peter Zijlstra
Date: Thu Jun 20 2019 - 17:31:04 EST
On Thu, Jun 20, 2019 at 06:34:55PM +0000, Eugeniy Paltsev wrote:
> On Thu, 2019-06-20 at 09:01 +0200, Peter Zijlstra wrote:
> > In particular we do not need the alignment.
> >
> > So what the x86 code does is:
> >
> > - overwrite the first byte of the instruction with a single byte trap
> > instruction
> >
> > - machine wide IPI which synchronizes I$
> >
> > At this point, any CPU that encounters this instruction will trap; and
> > the trap handler will emulate the 'new' instruction -- typically a jump.
> >
> > - overwrite the tail of the instruction (if there is a tail)
> >
> > - machine wide IPI which syncrhonizes I$
> >
> > At this point, nobody will execute the tail, because we'll still trap on
> > that first single byte instruction, but if they were to read the
> > instruction stream, the tail must be there.
> >
> > - overwrite the first byte of the instruction to now have a complete
> > instruction.
> >
> > - machine wide IPI which syncrhonizes I$
> >
> > At this point, any CPU will encounter the new instruction as a whole,
> > irrespective of alignment.
> >
> >
> > So the benefit of this scheme is that is works irrespective of the
> > instruction fetch window size and don't need the 'funny' alignment
> > stuff.
> >
>
> Thanks for explanation. Now I understand how this x86 magic works.
>
> However it looks like even more complex than ARM implementation.
> As I understand on ARM they do something like that:
> ---------------------------->8-------------------------
> on_each_cpu {
> write_instruction
> flush_data_cache_region
> invalidate_instruction_cache_region
> }
> ---------------------------->8-------------------------
>
> https://elixir.bootlin.com/linux/v5.1/source/arch/arm/kernel/patch.c#L121
>
> Yep, there is some overhead - as we don't need to do white and D$ flush on each cpu
> but that makes code simple and avoids additional checks.
>
> And I don't understand in which cases x86 approach with trap is better.
> In this ARM implementation we do one machine wide IPI instead of three in x86 trap approach.
>
> Probably there is some x86 specifics I don't get?
It's about variable instruction length; ARM (RISC in general) doesn't
have that, ARC does.
Your current proposal works by keeping the instruction inside of the
i-fetch window, but that then results in instruction padding (extra
NOPs). And that is fine, it really should work.
The x86 approach however allows you to get rid of that padding and
should work for unaligned variable length instructions (we have 1-15
byte instructions).
I just wanted to make sure you were aware of the possiblities such that
you made an informed decision, I'm not trying to force complexity on you
:-)