RE: [PATCH v4 5/5] iommu/dma: Force swiotlb_max_mapping_size on an untrusted device

From: Michael Kelley
Date: Tue Feb 27 2024 - 09:31:50 EST


From: Robin Murphy <robin.murphy@xxxxxxx> Sent: Tuesday, February 27, 2024 5:23 AM
>
> On 26/02/2024 9:11 pm, Michael Kelley wrote:
> > From: Robin Murphy <robin.murphy@xxxxxxx> Sent: Monday, February
> 26, 2024 11:36 AM
> >>
> >> On 21/02/2024 11:35 am, Will Deacon wrote:
> >>> From: Nicolin Chen <nicolinc@xxxxxxxxxx>
> >>>
> >>> The swiotlb does not support a mapping size > swiotlb_max_mapping_size().
> >>> On the other hand, with a 64KB PAGE_SIZE configuration, it's observed that
> >>> an NVME device can map a size between 300KB~512KB, which certainly failed
> >>> the swiotlb mappings, though the default pool of swiotlb has many slots:
> >>> systemd[1]: Started Journal Service.
> >>> => nvme 0000:00:01.0: swiotlb buffer is full (sz: 327680 bytes), total 32768 (slots), used 32 (slots)
> >>> note: journal-offline[392] exited with irqs disabled
> >>> note: journal-offline[392] exited with preempt_count 1
> >>>
> >>> Call trace:
> >>> [ 3.099918] swiotlb_tbl_map_single+0x214/0x240
> >>> [ 3.099921] iommu_dma_map_page+0x218/0x328
> >>> [ 3.099928] dma_map_page_attrs+0x2e8/0x3a0
> >>> [ 3.101985] nvme_prep_rq.part.0+0x408/0x878 [nvme]
> >>> [ 3.102308] nvme_queue_rqs+0xc0/0x300 [nvme]
> >>> [ 3.102313] blk_mq_flush_plug_list.part.0+0x57c/0x600
> >>> [ 3.102321] blk_add_rq_to_plug+0x180/0x2a0
> >>> [ 3.102323] blk_mq_submit_bio+0x4c8/0x6b8
> >>> [ 3.103463] __submit_bio+0x44/0x220
> >>> [ 3.103468] submit_bio_noacct_nocheck+0x2b8/0x360
> >>> [ 3.103470] submit_bio_noacct+0x180/0x6c8
> >>> [ 3.103471] submit_bio+0x34/0x130
> >>> [ 3.103473] ext4_bio_write_folio+0x5a4/0x8c8
> >>> [ 3.104766] mpage_submit_folio+0xa0/0x100
> >>> [ 3.104769] mpage_map_and_submit_buffers+0x1a4/0x400
> >>> [ 3.104771] ext4_do_writepages+0x6a0/0xd78
> >>> [ 3.105615] ext4_writepages+0x80/0x118
> >>> [ 3.105616] do_writepages+0x90/0x1e8
> >>> [ 3.105619] filemap_fdatawrite_wbc+0x94/0xe0
> >>> [ 3.105622] __filemap_fdatawrite_range+0x68/0xb8
> >>> [ 3.106656] file_write_and_wait_range+0x84/0x120
> >>> [ 3.106658] ext4_sync_file+0x7c/0x4c0
> >>> [ 3.106660] vfs_fsync_range+0x3c/0xa8
> >>> [ 3.106663] do_fsync+0x44/0xc0
> >>>
> >>> Since untrusted devices might go down the swiotlb pathway with dma-iommu,
> >>> these devices should not map a size larger than swiotlb_max_mapping_size.
> >>>
> >>> To fix this bug, add iommu_dma_max_mapping_size() for untrusted devices to
> >>> take into account swiotlb_max_mapping_size() v.s. iova_rcache_range() from
> >>> the iommu_dma_opt_mapping_size().
> >>
> >> On the basis that this is at least far closer to correct than doing nothing,
> >>
> >> Acked-by: Robin Murphy <robin.murphy@xxxxxxx>
> >>
> >> TBH I'm scared to think about theoretical correctness for all the
> >> interactions between the IOVA granule and min_align_mask, since just the
> >> SWIOTLB stuff is bad enough, even before you realise the ways that the
> >> IOVA allocation isn't necessarily right either. However I reckon as long
> >> as we don't ever see a granule smaller than IO_TLB_SIZE, and/or a
> >> min_align_mask larger than a granule, then this should probably work
> >> well enough as-is.
> >>
> >
> > I'm not convinced. The conditions you describe are reasonable
> > and reflect upstream code today. But there can still be a failure
> > due to attempting to allocate a "too large" swiotlb buffer. It
> > happens with a granule of 64K and min_align_mask of 4K - 1 (the
> > NVMe case) when the offset passed to iommu_dma_map_page()
> > is non-zero modulo 4K. With this patch, the size passed into
> > iommu_dma_map_page() is limited to 252K, but it gets rounded
> > up to 256K. Then swiotlb_tbl_map_single() adds the offset
> > modulo 4K. The result exceeds the 256K limit in swiotlb and
> > the mapping fails.
> >
> > The case I describe is a reasonable case that happens in the real
> > world. As is, this patch will work most of the time because for
> > large xfers the offset is typically at least 4K aligned. But not always.
>
> Indeed that's what my "probably [...] well enough" meant to imply.
>
> I think there's proving to be sufficient complexity here that if it
> turns out we do manage to still hit real-world issues with the coarse
> approximation, that will be the point when any further effort would be
> better spent on finally tackling the thing that's always been on the
> to-do list, where we'd only bounce the unaligned start and/or end
> granules while mapping the rest of the buffer in place.
>

Fair enough. It's your call to make. :-)

Michael