Re: [RFC PATCH v7 7/7] Restartable sequences: self-tests

From: Mathieu Desnoyers
Date: Mon Aug 15 2016 - 14:06:51 EST


----- On Aug 14, 2016, at 8:56 PM, Boqun Feng boqun.feng@xxxxxxxxx wrote:

> On Sun, Aug 14, 2016 at 03:02:20PM +0000, Mathieu Desnoyers wrote:
>> ----- On Aug 12, 2016, at 9:28 PM, Boqun Feng boqun.feng@xxxxxxxxx wrote:
>>
>> > On Fri, Aug 12, 2016 at 06:11:45PM +0000, Mathieu Desnoyers wrote:
>> >> ----- On Aug 12, 2016, at 12:35 PM, Boqun Feng boqun.feng@xxxxxxxxx wrote:
>> >>
>> >> > On Fri, Aug 12, 2016 at 01:30:15PM +0800, Boqun Feng wrote:
>> >> > [snip]
>> >> >> > > Besides, do we allow userspace programs do read-only access to the
>> >> >> > > memory objects modified by do_rseq(). If so, we have a problem when
>> >> >> > > there are two writes in a do_rseq()(either in the rseq critical section
>> >> >> > > or in the asm block), because in current implemetation, these two writes
>> >> >> > > are unordered, which makes the readers outside a do_rseq() could observe
>> >> >> > > the ordering of writes differently.
>> >> >> > >
>> >> >> > > For rseq_finish2(), a simple solution would be making the "final" write
>> >> >> > > a RELEASE.
>> >> >> >
>> >> >> > Indeed, we would need a release semantic for the final store here if this
>> >> >> > is the common use. Or we could duplicate the "flavors" of rseq_finish2 and
>> >> >> > add a rseq_finish2_release. We should find a way to eliminate code duplication
>> >> >>
>> >> >> I'm in favor of a separate rseq_finish2_release().
>> >> >>
>> >> >> > there. I suspect we'll end up doing macros.
>> >> >> >
>> >> >>
>> >> >> Me too. Lemme have a try ;-)
>> >> >>
>> >> >
>> >> > How about this? Although a little messy, I separated the asm block into
>> >> > several parts and implemented each part in a arch-diagnose way.
>> >>
>> >> I find it rather hard to follow the per-arch assembly with this approach.
>> >> It might prove to be troublesome if we want to do arch-specific optimizations
>> >> in the future.
>> >>
>> >
>> > It might be, but I was just trying to kill as much duplicate code as
>> > possible, because the more duplicate we have, the more maintain effort
>> > we need.
>> >
>> > For example, PPC32 and PPC64 may have the same asm code to check the
>> > event counter, but different code to do the final store. Having the
>> > same RSEQ_CHECK_COUNTER() for PPC32 and PPC64 actually makes it easy if
>> > we come up a way to optimize the counter check code on PPC.
>> >
>> > And if some arch wants to have some very specifical optimizations,
>> > it could always write the whole asm block again rather than use the
>> > helpers macros.
>>
>> Creating macros for each assembly "operation" done in the restartable
>> sequence ends up requiring that people learn a new custom mini-language,
>> and implement those macros for each architecture.
>>
>> I'd rather prefer to let each architecture maintainer express the
>> restartable sequence directly in assembly, which is already known to
>> them, than require them to learn a new small macro-based language.
>>
>> Eliminating duplicated code is a goal I agree with, but there are
>> ways to achieve this which don't end up creating a macro-based custom
>> mini-language (such as what I proposed below).
>>
>
> Fair point ;-)
>
> One more thing, do we want to use arch-specific header files to put
> arch-specific assembly code? For example, rseq-x86.h, rseq-powerpc.h,
> etc. This may save readers a lot of time if he or she is only interested
> in a particular arch, and also make maintaining a little easier(no need
> to worry about breaking other archs accidentally)
>
> [...]

Good point. I wanted to wait until we had enough architectures before
doing this, but now that we have x86 32/64, ppc 32/64 and arm 32, it
appears to be the right time. Done and pushed.

>>
>> In terms of fast-path, you would be trading:
>>
>> (1)
>> "ldr r0, %[current_event_counter]\n\t" \
>> "mov r1, #0\n\t"
>> "cmp %[start_event_counter], r0\n\t" \
>> "bne %l[failure]\n\t" \
>> "str %[to_write_final], [%[target_final]]\n\t" \
>> "2:\n\t" \
>> "str r1, [%[rseq_cs]]\n\t" \
>> for
>>
>> (2)
>> "ldr r0, %[current_event_counter]\n\t" \
>> "cmp %[start_event_counter], r0\n\t" \
>> "bne %l[failure]\n\t" \
>> "str %[to_write_final], [%[target_final]]\n\t" \
>> "2:\n\t" \
>> "mov r0, #0\n\t"
>> "str r0, [%[rseq_cs]]\n\t" \
>>
>> Your proposal (2) saves a register (does not clobber r1), but this
>> is at the expense of a slower fast-path. In (1), loading the constant
>> 0 is done while the processor is stalled on the current_event_counter
>> load, which is needed by a following comparison. Therefore, we can
>> increase instruction-level parallelism by placing the immediate value
>> 0 load right after the ldr instruction. This, however, requires that
>> we use a different register than r0, because r0 is already used by the
>> ldr/cmp instructions.
>>
>> Since this is a fast-path, achieving higher instruction throughput
>> is more important than saving a register.
>>
>> I came up with this as an optimization while doing benchmarking
>> on a ARM32 Cubietruck as a reference architecture.
>>
>
> Nice ;-) Better to put a comment there?

Done.

>
> I should try to investigate something similar for powerpc.
>

Yes, you could try clobbering one extra register to move the
"li %%r17, 0\n\t" right after the lwz instruction. Depending on
the architecture characteristics, it may speed it up a bit. I
would expect that benchmarks on older architectures (e.g. old ppc32)
might be more affected by such tweak than newer POWER8.

Thanks,

Mathieu

--
Mathieu Desnoyers
EfficiOS Inc.
http://www.efficios.com