Re: [PATCH v3 0/6] add non-strict mode support for arm-smmu-v3

From: Leizhen (ThunderTown)
Date: Thu Jul 26 2018 - 22:50:38 EST

On 2018/7/26 22:16, Robin Murphy wrote:
> On 2018-07-26 4:44 AM, Leizhen (ThunderTown) wrote:
>> On 2018/7/25 5:51, Robin Murphy wrote:
>>> On 2018-07-12 7:18 AM, Zhen Lei wrote:
>>>> v2 -> v3: Add a bootup option "iommu_strict_mode" to make the
>>>> manager can choose which mode to be used. The first 5 patches
>>>> have not changed. + iommu_strict_mode= [arm-smmu-v3] +
>>>> 0 - strict mode (default) + 1 - non-strict mode
>>>> v1 -> v2: Use the lowest bit of the io_pgtable_ops.unmap's iova
>>>> parameter to pass the strict mode: 0, IOMMU_STRICT; 1,
>>>> IOMMU_NON_STRICT; Treat 0 as IOMMU_STRICT, so that the unmap
>>>> operation can compatible with other IOMMUs which still use strict
>>>> mode. In other words, this patch series will not impact other
>>>> IOMMU drivers. I tried add a new quirk
>>>> IO_PGTABLE_QUIRK_NON_STRICT in io_pgtable_cfg.quirks, but it can
>>>> not pass the strict mode of the domain from SMMUv3 driver to
>>>> io-pgtable module.
>>> What exactly is the issue there? We don't have any problem with
>>> other quirks like NO_DMA, and as I said before, by the time we're
>>> allocating the io-pgtable in arm_smmu_domain_finalise() we already
>>> know everything there is to know about the domain.
>> Because userspace can map/unamp and start devices to access memory
>> through VFIO. So that, the attacker can: 1. alloc memory 2. map 3.
>> unmap 4. free memory 5. repeatedly accesssing the just freed memory
>> base on the just unmapped iova, this attack may success if the freed
>> memory is reused by others and the mapping still staying in TLB
> Right, but that's why we don't set non-strict mode on unmanaged domains; what I was asking about was specifically why "it can not pass the strict mode of the domain from SMMUv3 driver to io-pgtable module", because we don't get anywhere near io-pgtable until we already know whether the domain in question can allow lazy unmaps or not.

Sorry, it's my fault. I've all through mistaken that "data->iop.cfg" is shared by all domains until you mentioned me again.

>> But if only root user can use VFIO, this is an unnecessary worry.
>> Then both normal and VFIO will use the same strict mode, so that the
>> new quirk IO_PGTABLE_QUIRK_NON_STRICT can easily be applied.
>>>> Add a new member domain_non_strict in struct iommu_dma_cookie,
>>>> this member will only be initialized when the related domain and
>>>> IOMMU driver support non-strict mode.
>>>> v1: In common, a IOMMU unmap operation follow the below steps: 1.
>>>> remove the mapping in page table of the specified iova range 2.
>>>> execute tlbi command to invalid the mapping which is cached in
>>>> TLB 3. wait for the above tlbi operation to be finished 4. free
>>>> the IOVA resource 5. free the physical memory resource
>>>> This maybe a problem when unmap is very frequently, the
>>>> combination of tlbi and wait operation will consume a lot of
>>>> time. A feasible method is put off tlbi and iova-free operation,
>>>> when accumulating to a certain number or reaching a specified
>>>> time, execute only one tlbi_all command to clean up TLB, then
>>>> free the backup IOVAs. Mark as non-strict mode.
>>>> But it must be noted that, although the mapping has already been
>>>> removed in the page table, it maybe still exist in TLB. And the
>>>> freed physical memory may also be reused for others. So a
>>>> attacker can persistent access to memory based on the just freed
>>>> IOVA, to obtain sensible data or corrupt memory. So the VFIO
>>>> should always choose the strict mode.
>>>> Some may consider put off physical memory free also, that will
>>>> still follow strict mode. But for the map_sg cases, the memory
>>>> allocation is not controlled by IOMMU APIs, so it is not
>>>> enforceable.
>>>> Fortunately, Intel and AMD have already applied the non-strict
>>>> mode, and put queue_iova() operation into the common file
>>>> dma-iommu.c., and my work is based on it. The difference is that
>>>> arm-smmu-v3 driver will call IOMMU common APIs to unmap, but
>>>> Intel and AMD IOMMU drivers are not.
>>>> Below is the performance data of strict vs non-strict for NVMe
>>>> device: Randomly Read IOPS: 146K(strict) vs 573K(non-strict) Randomly Write IOPS: 143K(strict) vs 513K(non-strict)
>>> How does that compare to passthrough performance? One thing I'm not
>>> entirely clear about is what the realistic use-case for this is -
>>> even if invalidation were infinitely fast, enabling translation
>>> still typically has a fair impact on overall system performance in
>>> terms of latency, power, memory bandwidth, etc., so I can't help
>>> wonder what devices exist today for which performance is critical
>>> and robustness* is unimportant, yet have crippled addressing
>>> capabilities such that they can't just use passthrough.
>> I have no passthrough performance data yet, I will ask my team to do
>> it. But we have tested the Global bypass: Randomly Read IOPS: 744K,
>> and Randomly Write IOPS: is the same to non-strict.
>> I'm also not clear. But I think in most cases, the system does not
>> need to run at full capacity, but the system should have that
>> ability. For example, a system's daily load may only 30-50%, but the
>> load may increase to 80%+ on festival day.
>> Passthrough is not enough to support VFIO, and virtualization need
>> the later.
> Huh? The whole point of passthrough mode is that the IOMMU can still be used for VFIO, but without imposing the overhead of dynamic mapping on host DMA.

I said that from my experience. Userspace do not known the PA, so I think the user can not fill dma_map.iova correctly.

/* Allocate some space and setup a DMA mapping */
dma_map.vaddr = (__u64)mmap(0, 0x1000, PROT_READ | PROT_WRITE,
dma_map.size = 0x1000;
dma_map.iova = 0x2f00000000UL; /* user specified */

ioctl(container, VFIO_IOMMU_MAP_DMA, &dma_map);

Further more, dma_map is also suitable for iommu_map_sg usage scenario.

> Robin.
>>> * I don't want to say "security" here, since I'm actually a lot
>>> less concerned about the theoretical malicious endpoint/wild write
>>> scenarios than the the much more straightforward malfunctioning
>>> device and/or buggy driver causing use-after-free style memory
>>> corruption. Also, I'm sick of the word "security"...
>> OKïWe really have no need to consider buggy devices.
>>>> Zhen Lei (6): iommu/arm-smmu-v3: fix the implementation of
>>>> flush_iotlb_all hook iommu/dma: add support for non-strict mode iommu/amd: use default branch to deal with all non-supported capabilities iommu/io-pgtable-arm: add support for non-strict
>>>> mode iommu/arm-smmu-v3: add support for non-strict mode iommu/arm-smmu-v3: add bootup option "iommu_strict_mode"
>>>> Documentation/admin-guide/kernel-parameters.txt | 12 +++++++ drivers/iommu/amd_iommu.c | 4 +-- drivers/iommu/arm-smmu-v3.c | 42
>>>> +++++++++++++++++++++++-- drivers/iommu/dma-iommu.c
>>>> | 25 +++++++++++++++ drivers/iommu/io-pgtable-arm.c
>>>> | 23 ++++++++------ include/linux/iommu.h
>>>> | 7 +++++ 6 files changed, 98 insertions(+), 15 deletions(-)
>>> .
> .