Re: [PATCH 1/2] RISC-V: Probe for unaligned access speed

From: Evan Green
Date: Tue Jun 27 2023 - 15:12:23 EST

Next message: Phil Auld: "[PATCH v2] Sched/fair: Block nohz tick_stop when cfs bandwidth in use"
Previous message: Evan Green: "Re: [PATCH 1/2] RISC-V: Probe for unaligned access speed"
In reply to: Jessica Clarke: "Re: [PATCH 1/2] RISC-V: Probe for unaligned access speed"
Next in thread: David Laight: "RE: [PATCH 1/2] RISC-V: Probe for unaligned access speed"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]

On Mon, Jun 26, 2023 at 2:42 PM Jessica Clarke <jrtc27@xxxxxxxxxx> wrote:
>
> On 23 Jun 2023, at 23:20, Evan Green <evan@xxxxxxxxxxxx> wrote:
> >
> > Rather than deferring misaligned access speed determinations to a vendor
> > function, let's probe them and find out how fast they are. If we
> > determine that a misaligned word access is faster than N byte accesses,
> > mark the hardware's misaligned access as "fast".
>
> How sure are you that your measurements can be extrapolated and aren’t
> an artefact of the testing process? For example, off the top of my head:
>
> * The first run will potentially be penalised by data cache misses,
> untrained prefetchers, TLB misses, branch predictors, etc. compared
> with later runs. You have one warmup, but who knows how many
> iterations it will take to converge?

I'd expect the cache penalties to be reasonably covered by a single
warmup. You're right about branch prediction, which is why I tried to
use a large-ish buffer size, minimize the ratio of conditionals to
loads/stores, and do the test for a decent number of iterations (on my
THead, about 1800 and 400 for words and bytes).

When I ran the test a handful of times, I did see variation on the
order of ~5%. But the comparison of the two numbers doesn't seem to be
anywhere near that margin (THead C906 was ~4x faster doing misaligned
word accesses, others with slow misaligned accesses also reporting
numbers not anywhere close to each other).

>
> * The code being benchmarked isn’t the code being run, so differences
> in access patterns, loop unrolling, loop alignment, etc. may cause the
> real code to behave differently (and perhaps change which is better).

I'm not trying to make statements about memcpy specifically, but
(only) about misaligned accesses, which is why I tried to write loops
that isolated that element as much as possible.

>
> The non-determinism that could in theory result from this also seems
> like a not great idea to have.

This is fair, if we have machines where this waffles from boot to boot
that's not great. In theory if misaligned word accesses come out to
being almost exactly equal to N byte accesses, then it doesn't matter
which you choose, though of course it could still make a difference in
practice. The alternative though of providing no info just pushes the
same problem out into userspace, which seems worse.
-Evan

Next message: Phil Auld: "[PATCH v2] Sched/fair: Block nohz tick_stop when cfs bandwidth in use"
Previous message: Evan Green: "Re: [PATCH 1/2] RISC-V: Probe for unaligned access speed"
In reply to: Jessica Clarke: "Re: [PATCH 1/2] RISC-V: Probe for unaligned access speed"
Next in thread: David Laight: "RE: [PATCH 1/2] RISC-V: Probe for unaligned access speed"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]