Re: [PATCH] convert cnt32_to_63 to inline
From: Mathieu Desnoyers
Date: Tue Nov 11 2008 - 22:53:42 EST
* Russell King (rmk+lkml@xxxxxxxxxxxxxxxx) wrote:
> On Tue, Nov 11, 2008 at 03:11:30PM -0500, Mathieu Desnoyers wrote:
> > I think the added barrier() are causing these pipeline stalls. They
> > don't allow the compiler to read variables such as oscr2ns_scale before
> > the barrier because gcc cannot assume it won't be modified. However, to
> > insure that OSCR read is done after __m_cnt_hi read, this barrier seems
> > required to be safe against gcc optimizations.
> >
> > Have you compared my patch to Nicolas'patch, which adds a smp_rmb() in
> > the macro or to a vanilla tree ?
>
> Nicolas' patch compared to unmodified - there's less side effects,
> which come down to two pipeline stalls whereas we had none with
> the unmodified code.
>
> One pipeline stall for loading the address of __m_cnt_hi and reading
> its value, followed by the same thing for oscr2ns_scale.
>
> I think this is showing the problem of compiler barriers - they are
> indescriminate. They are total and complete barriers - not only do
> they act on the data but also the compilers ability to emit code for
> generating the addresses of the data to be loaded.
>
> Clearly, the address of OSCR, __m_cnt_hi nor oscr2ns_scale is ever
> going to change at run time - their addresses are all stored in the
> literal pool, but by putting compiler barriers in, the compiler is
> being prevented from reading from the literal pool at the most
> appropriate point.
>
> So, I've tried this:
>
> unsigned long long sched_clock(void)
> {
> + unsigned long *oscr2ns_ptr = &oscr2ns_scale;
> unsigned long long v = cnt32_to_63(OSCR);
> - return (v * oscr2ns_scale) >> OSCR2NS_SCALE_FACTOR;
> + return (v * *oscr2ns_ptr) >> OSCR2NS_SCALE_FACTOR;
> }
>
> to try to explicitly code the loads. This unfortunately results in
> three pipeline stalls. Also tried swapping the two lines starting
> 'unsigned long' without any improvement on not having those extra hacks
> to work around the barrier.
>
> So, let's summarise this:
>
> 1. the existing code works, is correct on ARM, and is efficient.
> 2. throwing barriers into the function makes it less efficient.
> 3. re-engineering the code appears to make things worse.
>
Hrm, if we want to obtain similar results gcc has currently, casting to
(volatile u32) or doing a *(volatile u32 *) to force gcc to do specific
memory accesses in order could probably be used. Those are generally
discouraged because they are not suitable for SMP systems, but we are
talking here about UP-specific optimizations that will end up in UP-only
code, so why not ?
http://gcc.gnu.org/onlinedocs/gcc-4.0.4/gcc/Volatiles.html
"The minimum either standard specifies is that at a sequence point all
previous accesses to volatile objects have stabilized and no subsequent
accesses have occurred. Thus an implementation is free to reorder and
combine volatile accesses which occur between sequence points, but
cannot do so for accesses across a sequence point."
Therefore, regarding program order, the access is insured to be
performed between each semicolumn.
So that would be an argument for leaving the variable read in
architecture-specific code, because it heavily depends on the
architecture context (whether it's UP-only or must support SMP, whether
it has unordered memory reads...).
Mathieu
--
Mathieu Desnoyers
OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F BA06 3F25 A8FE 3BAE 9A68
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/