RE: [PATCH v4 5/5] iommu/dma: Force swiotlb_max_mapping_size on an untrusted device

From: Michael Kelley
Date: Sun Feb 25 2024 - 16:17:47 EST


From: Michael Kelley <mhklinux@xxxxxxxxxxx> Sent: Friday, February 23, 2024 1:11 PM
>
> From: Nicolin Chen <nicolinc@xxxxxxxxxx> Sent: Friday, February 23, 2024 11:58 AM
> >
> > On Wed, Feb 21, 2024 at 11:39:29PM +0000, Michael Kelley wrote:
> > > From: Will Deacon <will@xxxxxxxxxx> Sent: Wednesday, February 21, 2024 3:35 AM
> > > > +static size_t iommu_dma_max_mapping_size(struct device *dev)
> > > > +{
> > > > + if (is_swiotlb_active(dev) && dev_is_untrusted(dev))
> > > > + return swiotlb_max_mapping_size(dev);
> > > > + return SIZE_MAX;
> > > > +}
> > > > +
> > >
> > > In this [1] email, Nicolin had a version of this function that incorporated
> > > the IOMMU granule. For granules bigger than 4K, I think that's needed
> > > so that when IOMMU code sets the swiotlb alloc_align_mask to the
> > > IOMMU granule - 1, the larger offset plus the size won't exceed the
> > > max number of slots. swiotlb_max_mapping_size() by itself may
> > > return a value that's too big when alloc_align_mask is used.
> > >
> > > Michael
> > >
> > > [1] https://lore.kernel.org/linux-iommu/SN6PR02MB415727E61B5295C259CCB268D4512@xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx/T/#m14dd359c5a4dd13e6cb0c35cf94f8d746257ae48
> >
> > Yea, I just confirmed that with 64KB PAGE_SIZE the alloc_size
> > can be over swiotlb_max_mapping_size, i.e. 256KB > 252KB. Yet,
> > the max size corresponding to the max number of slots should
> > be 256KB. So, I feel that this is marginally safe?
>
> Yes, in the specific case you tested. But I don't think the general
> case is safe. In your specific case, the "size" argument to
> iommu_dma_map_page() is presumably 252 Kbytes because that's
> what your new iommu_dma_max_mapping_size() returns.
> iommu_dma_map_page() calls iova_align() to round up the 252K
> to 256K. Also in your specific case, the "offset" argument to
> iommu_dma_map_page() is 0, so the "phys" variable (which embeds
> the offset) passed to swiotlb_tbl_map_single() is aligned on a
> 64 Kbyte page boundary. swiotlb_tbl_map_single() then
> computes an offset in the orig_addr (i.e., "phys") based on the
> DMA min_align_mask (4 Kbytes), and that value is 0 in your specific
> case. swiotlb_tbl_map_single() then calls swiotlb_find_slots()
> specifying a size that is 256K bytes plus an offset of 0, so everything
> works.
>
> But what if the original offset passed to iommu_dma_map_page()
> is 3 Kbytes (for example)? swiotlb_tbl_map_single() will have an
> orig_addr that is offset from a page boundary by 3 Kbytes. The
> value passed to swiotlb_find_slots() will be 256 Kbytes plus an
> offset of 3 Kbytes, which is too big.
>
> >
> > With that being said, there seems to be a 4KB size waste, due
> > to aligning with the iommu_domain granule, in this particular
> > alloc_size=256KB case?
> >
> > On the other hand, if swiotlb_max_mapping_size was subtracted
> > by 64KB (granule=64KB), i.e. alloc_size=192KB, which seems to
> > generate more swiotlb fragments?
>
> dma_max_mapping_size() returns a fixed value for a particular
> device, so the fixed value must be conversative enough to handle
> dma_map_page() calls with a non-zero offset (or the similar
> dma_map_sg() where the scatter/gather list has non-zero
> offsets). So yes, the higher layers must limit I/O size, which
> can produce more separate I/Os and swiotlb fragments to
> fulfill an original request that is 256 Kbytes or larger. The
> same thing happens with 4K page size in that a 256K I/O
> gets broken into a 252K I/O and a 4K I/O if the swiotlb is
> being used.
>
> I'm trying to think through if there's a way to make things
> work with iommu_max_mapping_size() being less
> conversative than subtracting the full granule size. I'm
> doubtful that there is.
>

With the current interface between iommu_dma_map_page()
and swiotlb_tbl_map_single(), iommu_max_mapping_size()
really has no choice but to be more conservative and subtract
the full granule size.

The underlying problem is how iommu_dma_map_page() and
swiotlb_tbl_map_single() interact. iommu_dma_map_page()
rounds up the size to a granule boundary, and then
swiotlb_tbl_map_single() adds the offset modulo DMA
min_align_mask. These steps should be done in the opposite
order -- first add the offset then do the rounding up. With
that opposite order, iommu_max_mapping_size() could be
less conservative and be based solely on min_align_mask.

I could see using the following approach, where alignment
and rounding up are both done in swiotlb_tbl_map_single(),
based on the alloc_align_mask parameter. Conceptually it
makes sense to align the start and the end of the allocated
buffer in the same function rather than splitting them.

1. Remove the alloc_size parameter from
swiotlb_tbl_map_single(). Fixup two other call sites, where
alloc_size is redundant anyway.

2. swiotlb_tbl_map_single() calculates its own local
alloc_size as map_size + offset, rounded up based on
alloc_align_mask. If alloc_align_mask is zero, this
rounding up is a no-op, and alloc_size equals map_size
+ offset. Pass this local alloc_size to swiotlb_find_slots.

3. iommu_dma_map_page() removes the iova_align() call
just prior to invoking swiotlb_tbl_map_single().

Looking at the code, there are also problems in
iommu_dma_map_page() in zero'ing out portions of the
allocated swiotlb buffer. Evidently the zero'ing is done
to avoid giving an untrusted device DMA access to kernel
memory areas that could contain random information
that's sensitive. But I see these issues in the code:

1. The zero'ing doesn't cover any "padding" at the
beginning of the swiotlb buffer due to min_align_mask
offsetting. The cover letter for the last set of changes in
this area explicitly says that such padding isn't zero'ed [1]
but I can't reconcile why. The swiotlb *does* allocate
padding slots at the beginning, even in the 5.16 kernel
when those commits first went in. Maybe I'm missing
something.

2. With commit 901c7280ca0d, swiotlb_tbl_map_single()
always calls swiotlb_bounce(), which initializes the
mapped portion regardless of the DMA xfer direction. But
then in the case of DMA_FROM_DEVICE (i.e., read I/Os)
iommu_dma_map_page() will zero what swiotlb_bounce()
already initialized. The discussion that led to commit
901c7280ca0d concluded that the buffer must be
initialized to the ultimate target buffer contents so that
partial reads don't change data that wasn't read. This
zero'ing violates that and should be removed (unless
there's some unique handling of untrusted devices I'm
not aware of).

3. If the swiotlb path is taken for reasons other than an
untrusted device, iommu_dma_map_page() still does the
zero'ing. That doesn't break anything, but it seems wasteful
as there could be a lot of padding area that is unnecessarily
zero'ed in the dma_kmalloc_needs_bounce() case, especially
if the IOMMU granule is 64K.

Net, it seems like some additional work beyond Patch 5
of Will's series is needed to get Nicolin's path working
robustly. I could help with coding, but I can't test IOMMU
paths as I only have access to VMs (x64 and arm64).

Michael

[1] https://lore.kernel.org/all/20210929023300.335969-1-stevensd@xxxxxxxxxx/