Re: [PATCH v4 1/2] RISC-V: Probe for unaligned access speed

From: Evan Green
Date: Fri Sep 15 2023 - 12:49:26 EST


On Fri, Sep 15, 2023 at 12:57 AM David Laight <David.Laight@xxxxxxxxxx> wrote:
>
> From: Evan Green
> > Sent: 14 September 2023 17:37
> >
> > On Thu, Sep 14, 2023 at 8:55 AM David Laight <David.Laight@xxxxxxxxxx> wrote:
> > >
> > > From: Evan Green
> > > > Sent: 14 September 2023 16:01
> > > >
> > > > On Thu, Sep 14, 2023 at 1:47 AM David Laight <David.Laight@xxxxxxxxxx> wrote:
> > > > >
> > > > > From: Geert Uytterhoeven
> > > > > > Sent: 14 September 2023 08:33
> > > > > ...
> > > > > > > > rzfive:
> > > > > > > > cpu0: Ratio of byte access time to unaligned word access is
> > > > > > > > 1.05, unaligned accesses are fast
> > > > > > >
> > > > > > > Hrm, I'm a little surprised to be seeing this number come out so close
> > > > > > > to 1. If you reboot a few times, what kind of variance do you get on
> > > > > > > this?
> > > > > >
> > > > > > Rock-solid at 1.05 (even with increased resolution: 1.05853 on 3 tries)
> > > > >
> > > > > Would that match zero overhead unless the access crosses a
> > > > > cache line boundary?
> > > > > (I can't remember whether the test is using increasing addresses.)
> > > >
> > > > Yes, the test does use increasing addresses, it copies across 4 pages.
> > > > We start with a warmup, so caching effects beyond L1 are largely not
> > > > taken into account.
> > >
> > > That seems entirely excessive.
> > > If you want to avoid data cache issues (which probably do)
> > > then just repeating a single access would almost certainly
> > > suffice.
> > > Repeatedly using a short buffer (say 256 bytes) won't add
> > > much loop overhead.
> > > Although you may want to do a test that avoids transfers
> > > that cross cache line and especially page boundaries.
> > > Either of those could easily be much slower than a read
> > > that is entirely within a cache line.
> >
> > We won't be faulting on any of these pages, and they should remain in
> > the TLB, so I don't expect many page boundary specific effects. If
> > there is a steep penalty for misaligned loads across a cache line,
> > such that it's worse than doing byte accesses, I want the test results
> > to be dinged for that.
>
> That is an entirely different issue.
>
> Are you absolutely certain that the reason 8 byte loads take
> as long as a 64-bit mis-aligned load isn't because the entire
> test is limited by L1 cache fills?

Fair question. I hacked up a little code [1] to retry the test at
several different sizes, as well as printing out the best and worst
times. I only have one piece of real hardware, the THead C906, which
has a 32KB L1 D-cache.

Here are the results at various sizes, starting with the original:
[ 0.047556] cpu0: Ratio of byte access time to unaligned word
access is 4.35, unaligned accesses are fast
[ 0.047578] EVAN size 0x1f80 word cycles best 69 worst 29e, byte
cycles best 1c9 worst 3b7
[ 0.071549] cpu0: Ratio of byte access time to unaligned word
access is 4.29, unaligned accesses are fast
[ 0.071566] EVAN size 0x1000 word cycles best 36 worst 210, byte
cycles best e8 worst 2b2
[ 0.095540] cpu0: Ratio of byte access time to unaligned word
access is 4.14, unaligned accesses are fast
[ 0.095556] EVAN size 0x200 word cycles best 7 worst 1d9, byte
cycles best 1d worst 1d5
[ 0.119539] cpu0: Ratio of byte access time to unaligned word
access is 5.00, unaligned accesses are fast
[ 0.119555] EVAN size 0x100 word cycles best 3 worst 1a8, byte
cycles best f worst 1b5
[ 0.143538] cpu0: Ratio of byte access time to unaligned word
access is 3.50, unaligned accesses are fast
[ 0.143556] EVAN size 0x80 word cycles best 2 worst 1a5, byte
cycles best 7 worst 1aa

[1] https://pastebin.com/uwwU2CVn

I don't see any cliffs as the numbers get smaller, so it seems to me
there are no working set issues. Geert, it might be interesting to see
these same results on the rzfive. The thing that made me uncomfortable
with the smaller buffer sizes is it starts to bump up against the
resolution of the timer. Another option would have been to time
several iterations, but I went with the larger buffer instead as I'd
hoped it would minimize other overhead like the function calls, branch
prediction, C loop management, etc.

-Evan