Re: [PATCH v2] powerpc: warn on emulation of dcbz instruction in kernel mode

From: Christian Lamparter
Date: Fri Aug 23 2024 - 19:43:32 EST


On 8/23/24 9:19 PM, Segher Boessenkool wrote:
Hi!

On Fri, Aug 23, 2024 at 03:54:59PM +0200, Christoph Hellwig wrote:
On Fri, Aug 23, 2024 at 08:06:00AM -0500, Segher Boessenkool wrote:
What does "uncached memory" even mean here? Literally it would be
I=1 memory (uncachEABLE memory), but more likely you want M=0 memory
here ("non-memory memory", "not well-behaved memory", MMIO often).

Regular kernel memory vmapped with pgprot_noncached().

So, I=1 (and G=1). Caching inhibited and guarded. But M=1 (memory
coherence required) as with any other real memory :-)

If memset() is expected to be used with M=0, you cannot do any serious
optimisations to it at all. If memset() is expected to be used with I=1
it should use a separate code path for it, probably the caller should
make the distinction.

DMA coherent memory which uses uncached memory for platforms that
do not provide hardware dma coherence can end up just about anywhere
in the kernel. We could use special routines for a few places in
the DMA subsystem, but there might be plenty of others.

Yeah. It will just be plenty slow, as we see here, that's what the
warning is for; but it works just fine :-)

The memset() code itself could chech for the storage attributes, but
that is probably more expensive than just assuming the happy case.
Maybe someone could try it out though!

Hmm, Ok! For what's worth I can at least test memset with dcbz+trap and
what it was in 2015, without dcbz in the code path. How about that?

I figured out of all the offenders (ethernet, crypto and sata).
The sata/hard drive would be the most sensitive device to measure any
performance difference. the MyBook Live already had an harddrive
(Seagate ST380815AS (very old)) installed... so I went with that.

I test with OpenWrt, since it has a fully working PowerPC images for
the device, I can use initramfs (so HDD/SDD is idle) and provides a
very bare minimum the hdparm -t "benchmark".
(hdparm -t ... just reads for three seconds and tells you how much it read).

the unmodified 6.6.47 kernel scored:

| Timing buffered disk reads: 220 MB in 3.02 seconds = 72.93 MB/sec
| Timing buffered disk reads: 222 MB in 3.02 seconds = 73.50 MB/sec
| Timing buffered disk reads: 216 MB in 3.00 seconds = 71.94 MB/sec

from what I can tell, each hdparm -t /dev/sda causes ~77000 fix_alignment traps.
(/sys/devices/system/cpu/cpu0/cache/index0/coherency_line_size says it's 32 and
type is obviously "Data". If I'm not mistaken this means ~2400KiB of emulated
dcbz by the trap.)

For the test, I added the "old" memset from
<https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/arch/powerpc/lib/copy_32.S?id=df087e450d7ddc0b15bd8824206d964720b4f5e4#n120>
and replaced 6.6.47's memset in dma_pool_alloc() with it
<https://elixir.bootlin.com/linux/v6.6.47/source/mm/dmapool.c#L435>

now no WARNINGS are triggered and hdparm -t /dev/sda produces:

| Timing buffered disk reads: 220 MB in 3.00 seconds = 73.32 MB/sec
| Timing buffered disk reads: 218 MB in 3.02 seconds = 72.28 MB/sec
| Timing buffered disk reads: 224 MB in 3.03 seconds = 74.02 MB/sec

virtually no benefit?! Well, the HDD could be too slow. Let's try an old SSD:
Samsung 840 Evo 120 GB. This one manages to read 1276 MB in 3.06 seconds = ~416 MB/sec
in the same hdparm -t test on a reasonably modern PC when connected via a
usb3<->sata adapter.

unmodified 6.6.47 kernel:

| Timing buffered disk reads: 356 MB in 3.00 seconds = 118.61 MB/sec
| Timing buffered disk reads: 358 MB in 3.01 seconds = 119.12 MB/sec
| Timing buffered disk reads: 358 MB in 3.01 seconds = 119.03 MB/sec

modified 6.6.47 kernel:

| Timing buffered disk reads: 380 MB in 3.01 seconds = 126.30 MB/sec
| Timing buffered disk reads: 374 MB in 3.00 seconds = 124.61 MB/sec
| Timing buffered disk reads: 382 MB in 3.02 seconds = 126.62 MB/sec

Ok! There's something there. ~4%.

Cheers,
Christian