Re: [PATCH v4 5/5] iommu/dma: Force swiotlb_max_mapping_size on an untrusted device

From: Robin Murphy
Date: Tue Feb 27 2024 - 08:23:00 EST


On 26/02/2024 9:11 pm, Michael Kelley wrote:
From: Robin Murphy <robin.murphy@xxxxxxx> Sent: Monday, February 26, 2024 11:36 AM

On 21/02/2024 11:35 am, Will Deacon wrote:
From: Nicolin Chen <nicolinc@xxxxxxxxxx>

The swiotlb does not support a mapping size > swiotlb_max_mapping_size().
On the other hand, with a 64KB PAGE_SIZE configuration, it's observed that
an NVME device can map a size between 300KB~512KB, which certainly failed
the swiotlb mappings, though the default pool of swiotlb has many slots:
systemd[1]: Started Journal Service.
=> nvme 0000:00:01.0: swiotlb buffer is full (sz: 327680 bytes), total 32768 (slots), used 32 (slots)
note: journal-offline[392] exited with irqs disabled
note: journal-offline[392] exited with preempt_count 1

Call trace:
[ 3.099918] swiotlb_tbl_map_single+0x214/0x240
[ 3.099921] iommu_dma_map_page+0x218/0x328
[ 3.099928] dma_map_page_attrs+0x2e8/0x3a0
[ 3.101985] nvme_prep_rq.part.0+0x408/0x878 [nvme]
[ 3.102308] nvme_queue_rqs+0xc0/0x300 [nvme]
[ 3.102313] blk_mq_flush_plug_list.part.0+0x57c/0x600
[ 3.102321] blk_add_rq_to_plug+0x180/0x2a0
[ 3.102323] blk_mq_submit_bio+0x4c8/0x6b8
[ 3.103463] __submit_bio+0x44/0x220
[ 3.103468] submit_bio_noacct_nocheck+0x2b8/0x360
[ 3.103470] submit_bio_noacct+0x180/0x6c8
[ 3.103471] submit_bio+0x34/0x130
[ 3.103473] ext4_bio_write_folio+0x5a4/0x8c8
[ 3.104766] mpage_submit_folio+0xa0/0x100
[ 3.104769] mpage_map_and_submit_buffers+0x1a4/0x400
[ 3.104771] ext4_do_writepages+0x6a0/0xd78
[ 3.105615] ext4_writepages+0x80/0x118
[ 3.105616] do_writepages+0x90/0x1e8
[ 3.105619] filemap_fdatawrite_wbc+0x94/0xe0
[ 3.105622] __filemap_fdatawrite_range+0x68/0xb8
[ 3.106656] file_write_and_wait_range+0x84/0x120
[ 3.106658] ext4_sync_file+0x7c/0x4c0
[ 3.106660] vfs_fsync_range+0x3c/0xa8
[ 3.106663] do_fsync+0x44/0xc0

Since untrusted devices might go down the swiotlb pathway with dma-iommu,
these devices should not map a size larger than swiotlb_max_mapping_size.

To fix this bug, add iommu_dma_max_mapping_size() for untrusted devices to
take into account swiotlb_max_mapping_size() v.s. iova_rcache_range() from
the iommu_dma_opt_mapping_size().

On the basis that this is at least far closer to correct than doing nothing,

Acked-by: Robin Murphy <robin.murphy@xxxxxxx>

TBH I'm scared to think about theoretical correctness for all the
interactions between the IOVA granule and min_align_mask, since just the
SWIOTLB stuff is bad enough, even before you realise the ways that the
IOVA allocation isn't necessarily right either. However I reckon as long
as we don't ever see a granule smaller than IO_TLB_SIZE, and/or a
min_align_mask larger than a granule, then this should probably work
well enough as-is.


I'm not convinced. The conditions you describe are reasonable
and reflect upstream code today. But there can still be a failure
due to attempting to allocate a "too large" swiotlb buffer. It
happens with a granule of 64K and min_align_mask of 4K - 1 (the
NVMe case) when the offset passed to iommu_dma_map_page()
is non-zero modulo 4K. With this patch, the size passed into
iommu_dma_map_page() is limited to 252K, but it gets rounded
up to 256K. Then swiotlb_tbl_map_single() adds the offset
modulo 4K. The result exceeds the 256K limit in swiotlb and
the mapping fails.

The case I describe is a reasonable case that happens in the real
world. As is, this patch will work most of the time because for
large xfers the offset is typically at least 4K aligned. But not always.

Indeed that's what my "probably [...] well enough" meant to imply.

I think there's proving to be sufficient complexity here that if it turns out we do manage to still hit real-world issues with the coarse approximation, that will be the point when any further effort would be better spent on finally tackling the thing that's always been on the to-do list, where we'd only bounce the unaligned start and/or end granules while mapping the rest of the buffer in place.

Cheers,
Robin.