Re: [PATCH 06/10] swiotlb: use swiotlb_map_page in swiotlb_map_sg_attrs

From: Robin Murphy
Date: Mon Nov 19 2018 - 14:36:53 EST


On 09/11/2018 16:37, Robin Murphy wrote:
On 09/11/2018 07:49, Christoph Hellwig wrote:
On Tue, Nov 06, 2018 at 05:27:14PM -0800, John Stultz wrote:
But at that point if I just re-apply "swiotlb: use swiotlb_map_page in
swiotlb_map_sg_attrs", I reproduce the hangs.

Any suggestions for how to further debug what might be going wrong
would be appreciated!

Very odd. In the end map_sg and map_page are defined to do the same
things to start with. The only real issue we had in this area was:

"[PATCH v2] of/device: Really only set bus DMA mask when appropriate"

so with current mainline + that you still see a problem, and if you
rever the commit we are replying to it still goes away?

OK, after quite a bit of trying I have managed to provoke a similar-looking problem with straight 4.20-rc1 on my Juno board - so far my "reproducer" is to decompress a ~10GB .tar.xz off an external USB hard disk, wherein after somewhere between 5 minutes and half an hour or so it tends to falls over with xz choking on corrupt data and/or a USB error.

From the presentation, this really smells like there's some corner in which we're either missing cache maintenance or doing it to the wrong address - I've not seen any issues with Juno's main PCIe-attached I/O, but the EHCI here is non-coherent (and 32-bit, so the bus_dma_mask thing doesn't matter) as are the HiKey UFS and SD controller.

I'll keep digging...

OK, having brought my Hikey to life and reproduced John's stall with rc1, what's going on is that at some point dma_map_sg() returns 0, which causes the SCSI/UFS layer to go round in circles repeatedly trying to map the same list(s) equally unsuccessfully.

Why does dma_map_sg() fail? Turns out what we all managed to overlook is that this patch *does* introduce a subtle change in behaviour, in that previously the non-bounced case assigned dev_addr to sg->dma_address without looking at it; now with the swiotlb_map_page() call we check the return value against DIRECT_MAPPING_ERROR regardless of whether it was bounced or not.

Flash back to the other thread when I said "...but I suspect there may well be non-IOMMU platforms where DMA to physical address 0 is a thing :("? I have the 3GB Hikey where all the RAM is below 32 bits so SWIOTLB never ever bounces, but sure enough, guess where that RAM starts...

So in fact it looks like patch #4 technically introduces the first instance of this problem, we're just getting lucky not to hit it with a map_page/map_single case such that direct_mapping_error() would wrongly report failure for page 0. The bad news (for me) is that that can't have anything to do with my apparent memory corruption thing above, so now I still need to figure out what the hell is going on there.

Robin.