Re: [PATCH] powerpc/32: Don't use lmw/stmw for saving/restoring non volatile regs

From: Christophe Leroy
Date: Tue Aug 24 2021 - 01:54:31 EST

Le 23/08/2021 à 20:46, Segher Boessenkool a écrit :
On Mon, Aug 23, 2021 at 03:29:12PM +0000, Christophe Leroy wrote:
Instructions lmw/stmw are interesting for functions that are rarely
used and not in the cache, because only one instruction is to be
copied into the instruction cache instead of 19. However those
instruction are less performant than 19x raw lwz/stw as they require
synchronisation plus one additional cycle.

lmw takes N+2 cycles for loading N words on 603/604/750/7400, and N+3 on
7450. stmw takes N+1 cycles for storing N words on 603, N+2 on 604/750/
7400, and N+3 on 7450 (load latency is 3 instead of 2 on 7450).

There is no synchronisation needed, although there is some serialisation,
which of course doesn't mean much since there can be only 6 or 8 or so
insns executing at once anyway.

Yes I meant serialisation, isn't it the same as synchronisation ?

So, these insns are almost never slower, they can easily win cycles back
because of the smaller code, too.

What 32-bit core do you see where load/store multiple are more than a
fraction of a cycle (per memory access) slower?

SAVE_NVGPRS / REST_NVGPRS are used in only a few places which are
mostly in interrupts entries/exits and in task switch so they are
likely already in the cache.

Nothing is likely in the cache on the older cores (except in
microbenchmarks), the caches are not big enough for that!

Even syscall entries/exit pathes and/or most frequent interrupts entries and interrupt exit ?

Using standard lwz improves null_syscall selftest by:
- 10 cycles on mpc832x.
- 2 cycles on mpc8xx.

And in real benchmarks?

Don't know, what benchmark should I use to evaluate syscall entry/exit if 'null_syscall' selftest is not relevant ?

On mpccore both lmw and stmw are only N+1 btw. But the serialization
might cost another cycle here?

That coherent on MPC8xx, that's only 2 cycles.
But on the mpc832x which has a e300c2 core, it looks like I have 10 cycles difference. Is anything wrong ?