Re: [PATCH v2] powerpc: warn on emulation of dcbz instruction in kernel mode

From: Christian Lamparter
Date: Thu Aug 22 2024 - 14:19:51 EST


Hi!

Thank you all for your insightful insights :-) .
... and derp, used the wrong mail.

On 8/22/24 9:14 AM, Christoph Hellwig wrote:
On Thu, Aug 22, 2024 at 06:39:33AM +0000, LEROY Christophe wrote:
powerpc has a magic instruction 'dcbz' which clears a full cacheline in
one go. It is far more efficient than a loop to store zeros, and since
2015 memset(0) has been implemented with that instruction (commit
5b2a32e80634 ("powerpc/32: memset(0): use cacheable_memzero"))

But that instruction generates an alignment exception when used on
non-cached memory (whether it is RAM or not doesn't matter). It is then
emulated by the kernel but it of course leads to a serious performance
degradation, hence the warning added by commit cbe654c77961 ("powerpc:
warn on emulation of dcbz instruction in kernel mode"). Until now it
helped identify and fix use of memset() on IO memory.

But if memset() is expected to be used with non-cached RAM, then I don't
know what to do. Any suggestion ?

I'd suggest two things:

1) remove the warning. The use case is perfectly valid and everything
using uncached memory is already slow, so people will just have to
deal with it. Maybe offer a trace point instead if people care about
it.
2) figure out a way to avoid this case in the dma-coherent allocator,
which is probably the only case where it happens frequently
(a few drivers also zero or re-zero coherent memory, but most of the
time that is cargo cult programming and not actually needed)

I tested your patch below and got the next warning. This time from
dma_alloc_from_pool about a similar memset in:
https://elixir.bootlin.com/linux/v6.6.47/source/kernel/dma/pool.c#L261

it triggers for the sata_dwc_ex460 driver when it access the harddrive.
For anyone interested, please look below for the full warning splat.

(I switched to the MyBook Live, it's easier to disassemble+reassemble
than the MX60... but with similar results).

For 2 I can think of two options:

a) provide a arch hook for zeroing the dma memory that defaults to
memset, but which powerpc can override
a) figure out a way to clear the memory before marking it uncached
if we can

a) it obviously easier to verify, but b) is probably going to give
way better performance.

Below is an untested implementation of b) for dma-direct, I just need to
find out if there is any architecture that requires the memory to be
zeroed after it іt has been remapped. The iommu drivers might also
need similar treatment.

diff --git a/kernel/dma/direct.c b/kernel/dma/direct.c
index 4480a3cd92e087..66e94b32ab0081 100644
--- a/kernel/dma/direct.c
+++ b/kernel/dma/direct.c
@@ -275,6 +275,9 @@ void *dma_direct_alloc(struct device *dev, size_t size,
if (force_dma_unencrypted(dev))
prot = pgprot_decrypted(prot);
+ if (!PageHighMem(page))
+ memset(page_address(page), 0, size);
+
/* remove any dirty cache lines on the kernel alias */
arch_dma_prep_coherent(page, size);
@@ -283,14 +286,15 @@ void *dma_direct_alloc(struct device *dev, size_t size,
__builtin_return_address(0));
if (!ret)
goto out_free_pages;
+ if (PageHighMem(page))
+ memset(ret, 0, size);
} else {
ret = page_address(page);
if (dma_set_decrypted(dev, ret, size))
goto out_leak_pages;
+ memset(ret, 0, size);
}
- memset(ret, 0, size);
-
if (set_uncached) {
arch_dma_prep_coherent(page, size);
ret = arch_dma_set_uncached(ret, size);

I added a counter in fix_alignment to count the number of traps
issued by the kernel code. From what I measured in a quick test,
With this patch, the count during boot went down from ~21000
(unmodified) to 1200 (patched)... but from what I can tell, it
still triggers for the dma_direct_alloc in emac.

if there are more patches, please. I can test them during the next
few days while I'm on holiday.

Cheers,
Christian

---

[ 2.596136] ------------[ cut here ]------------
[ 2.600759] WARNING: CPU: 0 PID: 24 at fix_alignment+0x17c/0x1b4
[ 2.606760] Modules linked in:
[ 2.609813] CPU: 0 PID: 24 Comm: kworker/u2:2 Not tainted 6.6.47 #0
[ 2.616057] Hardware name: MyBook Live APM821XX 0x12c41c83 PowerPC 44x Platform
[ 2.623337] Workqueue: events_unbound async_run_entry_fn
[ 2.628646] NIP: c0003190 LR: c0003070 CTR: c0014744
[ 2.633674] REGS: c11a94b0 TRAP: 0700 Not tainted (6.6.47)
[ 2.639402] MSR: 00021000 <CE,ME> CR: 22008808 XER: 20000000
[ 2.645312]
[ 2.645312] GPR00: c0008640 c11a95a0 c10de300 00000000 00000001 00000018 01f01cdf 00000004
[ 2.645312] GPR08: c0f30000 00000000 0147ae14 c11a9610 c0013abc 00000000 c004c1e4 00000002
[ 2.645312] GPR16: 00000000 00000001 00000000 00000000 d1000000 c0910000 c1182010 c0f50000
[ 2.645312] GPR24: c11a9774 c00804a8 c0f30000 c0910000 c0910000 00001000 c108e3a0 c11a9620
[ 2.680114] NIP [c0003190] fix_alignment+0x17c/0x1b4
[ 2.685065] LR [c0003070] fix_alignment+0x5c/0x1b4
[ 2.689844] Call Trace:
[ 2.692279] [c11a95a0] [c11a9734] 0xc11a9734 (unreliable)
[ 2.697670] [c11a95f0] [c0008640] alignment_exception+0xf0/0x164
[ 2.703666] [c11a9610] [c0000a30] Alignment+0x130/0x180
[ 2.708877] --- interrupt: 600 at memset+0x60/0xc0
[ 2.713672] NIP: c0013abc LR: c0081df8 CTR: 0000007f
[ 2.718700] REGS: c11a9620 TRAP: 0600 Not tainted (6.6.47)
[ 2.724420] MSR: 00021000 <CE,ME> CR: 42008808 XER: 20000000
[ 2.730330] DEAR: d1000020 ESR: 00000000
[ 2.730330] GPR00: 00000007 c11a9710 c10de300 d1000000 00000000 00001000 d100001c 00000004
[ 2.730330] GPR08: 00001000 0000007f 00000001 53c20800 28008808 00000000 c004c1e4 00000002
[ 2.730330] GPR16: 00000000 00000001 00000000 00000000 d1000000 c0910000 c1182010 c0f50000
[ 2.730330] GPR24: c11a9774 c00804a8 c0f30000 c0910000 c0910000 00001000 c108e3a0 000010a0
[ 2.767645] NIP [c0013abc] memset+0x60/0xc0
[ 2.771819] LR [c0081df8] dma_alloc_from_pool+0x118/0x204
[ 2.777201] --- interrupt: 600
[ 2.780242] [c11a9710] [c0081eb0] dma_alloc_from_pool+0x1d0/0x204 (unreliable)
[ 2.787449] [c11a9760] [c00807e0] dma_direct_alloc+0x90/0x2a4
[ 2.793178] [c11a97a0] [c007f850] dma_alloc_attrs+0xa8/0xf8
[ 2.798758] [c11a97e0] [c019d180] dma_pool_alloc+0x11c/0x2ac
[ 2.804409] [c11a9810] [c0434cec] dwc_desc_get+0x2c/0x98
[ 2.809722] [c11a9840] [c04355cc] dwc_prep_slave_sg+0x180/0x524
[ 2.815633] [c11a98b0] [c04d7938] sata_dwc_qc_issue+0x1d8/0x268
[ 2.821559] [c11a9920] [c04bbfb0] ata_qc_issue+0x174/0x2b0
[ 2.827046] [c11a9940] [c04c674c] __ata_scsi_queuecmd+0x200/0x4fc
[ 2.833129] [c11a9960] [c04c6a84] ata_scsi_queuecmd+0x3c/0x88
[ 2.838865] [c11a9980] [c04a7bb0] scsi_queue_rq+0x6fc/0xb40
[ 2.844429] [c11a99c0] [c03801e8] __blk_mq_issue_directly+0x40/0xe0
[ 2.850694] [c11a99f0] [c038462c] blk_mq_try_issue_directly+0xa8/0x10c
[ 2.857208] [c11a9a10] [c03854e4] blk_mq_submit_bio+0x5f4/0x67c
[ 2.863118] [c11a9a70] [c0374ef4] submit_bio_noacct_nocheck+0x210/0x2d0
[ 2.869728] [c11a9aa0] [c0206850] block_read_full_folio+0x1f4/0x3a4
[ 2.875983] [c11a9b10] [c01353d8] filemap_read_folio+0x40/0x240
[ 2.881910] [c11a9b60] [c01382d0] do_read_cache_folio+0xb4/0x228
[ 2.887906] [c11a9b90] [c039144c] read_part_sector+0x50/0x10c
[ 2.893643] [c11a9bb0] [c0392358] read_lba+0xb8/0x198
[ 2.898680] [c11a9bf0] [c0392854] efi_partition+0xd8/0xc30
[ 2.904150] [c11a9cc0] [c0390be4] bdev_disk_changed+0x2ac/0x710
[ 2.910051] [c11a9d30] [c036be5c] blkdev_get_whole+0xc4/0xd8
[ 2.915693] [c11a9d50] [c036cb7c] blkdev_get_by_dev.part.0+0x288/0x380
[ 2.922199] [c11a9d90] [c038d49c] disk_scan_partitions+0x90/0x150
[ 2.928272] [c11a9dc0] [c038d9ac] device_add_disk+0x450/0x484
[ 2.934001] [c11a9df0] [c04b7718] sd_probe+0x360/0x4fc
[ 2.939132] [c11a9e20] [c0479db0] really_probe+0x2a0/0x370
[ 2.944628] [c11a9e50] [c0479f10] __driver_probe_device+0x90/0x200
[ 2.950796] [c11a9e70] [c047a0c8] driver_probe_device+0x48/0x104
[ 2.956793] [c11a9ea0] [c047a268] __device_attach_driver+0xe4/0x130
[ 2.963048] [c11a9ec0] [c04772e0] bus_for_each_drv+0x88/0xe4
[ 2.968699] [c11a9ef0] [c047939c] __device_attach_async_helper+0xa0/0xe8
[ 2.975386] [c11a9f20] [c005096c] async_run_entry_fn+0x40/0x11c
[ 2.981296] [c11a9f40] [c0044b70] process_one_work+0x1bc/0x35c
[ 2.987111] [c11a9f70] [c0045088] worker_thread+0x378/0x4e8
[ 2.992667] [c11a9fc0] [c004c2c8] kthread+0xe4/0xe8
[ 2.997548] [c11a9ff0] [c000d210] start_kernel_thread+0x10/0x14
[ 3.003449] Code: 39400000 7d500ba6 4c00012c 2c090000 41a2fec4 83e1004c 3860fff2 38210050 4e800020 38a00000 3920fff2 4bffffd4 <0fe00000> 4bffff8c 80010054 7c0803a6
[ 3.018138] ---[ end trace 0000000000000000 ]---