Re: [RFC 1/3] mm: iommu: An API to unify IOMMU, CPU and devicememory management
From: Randy Dunlap
Date: Wed Jun 30 2010 - 19:42:00 EST
On Tue, 29 Jun 2010 22:55:48 -0700 Zach Pfeffer wrote:
> This patch contains the documentation for the API, termed the Virtual
> Contiguous Memory Manager. Its use would allow all of the IOMMU to VM,
> VM to device and device to IOMMU interoperation code to be refactored
> into platform independent code.
> Comments, suggestions and criticisms are welcome and wanted.
> Signed-off-by: Zach Pfeffer <zpfeffer@xxxxxxxxxxxxxx>
> Documentation/vcm.txt | 583 +++++++++++++++++++++++++++++++++++++++++++++++++
> 1 files changed, 583 insertions(+), 0 deletions(-)
> create mode 100644 Documentation/vcm.txt
> diff --git a/Documentation/vcm.txt b/Documentation/vcm.txt
> new file mode 100644
> index 0000000..d29c757
> --- /dev/null
> +++ b/Documentation/vcm.txt
> @@ -0,0 +1,583 @@
> +What is this document about?
> +This document covers how to use the Virtual Contiguous Memory Manager
> +(VCMM), how the first implmentation works with a specific low-level
> +Input/Output Memory Management Unit (IOMMU) and the way the VCMM is used
> +from user-space. It also contains a section that describes why something
> +like the VCMM is needed in the kernel.
> +If anything in this document is wrong please send patches to the
> +maintainer of this file, listed at the bottom of the document.
> +The Virtual Contiguous Memory Manager
> +The VCMM was built to solve the system-wide memory mapping issues that
> +occur when many bus-masters have IOMMUs.
> +An IOMMU maps device addresses to physical addresses. It also insulates
> +the system from spurious or malicious device bus transactions and allows
> +fine-grained mapping attribute control. The Linux kernel core does not
> +contain a generic API to handle IOMMU mapped memory; device driver writers
> +must implement device specific code to interoperate with the Linux kernel
> +core. As the number of IOMMUs increases, coordinating the many address
> +spaces mapped by all discrete IOMMUs becomes difficult without in-kernel
> +The VCMM API enables device independent IOMMU control, virtual memory
> +manager (VMM) interoperation and non-IOMMU enabled device interoperation
> +by treating devices with or without IOMMUs and all CPUs with or without
> +MMUs, their mapping contexts and their mappings using common
> +abstractions. Physical hardware is given a generic device type and mapping
> +contexts are abstracted into Virtual Contiguous Memory (VCM)
> +regions. Users "reserve" memory from VCMs and "back" their reservations
> +with physical memory.
> +Why the VCMM is Needed
> +Driver writers who control devices with IOMMUs must contend with device
> +control and memory management. Driver writers have a large device driver
> +API that they can leverage to control their devices, but they are lacking
> +a unified API to help them program mappings into IOMMUs and share those
> +mappings with other devices and CPUs in the system.
> +Sharing is complicated by Linux's CPU centric VMM. The CPU centric model
> +generally makes sense because average hardware only contains a MMU for the
> +CPU and possibly a graphics MMU. If every device in the system has one or
> +more MMUs the CPU centric memory management (MM) programming model breaks
> +Abstracting IOMMU device programming into a common API has already begun
> +in the Linux kernel. It was built to abstract the difference between AMDs
(or just "AMD")
> +and Intels IOMMUs to support x86 virtualization on both platforms. The
Intel's (or just "Intel")
> +interface is listed in kernel/include/linux/iommu.h. It contains
> +interfaces for mapping and unmapping as well as domain management. This
> +interface has not gained widespread use outside the x86; PA-RISC, Alpha
> +and SPARC architectures and ARM and PowerPC platforms all use their own
> +mapping modules to control their IOMMUs. The VCMM contains an IOMMU
> +programming layer, but since its abstraction supports map management
> +independent of device control, the layer is not used directly. This
> +higher-level view enables a new kernel service, not just an IOMMU
> +interoperation layer.
> +The General Idea: Map Management using Graphs
> +Looking at mapping from a system-wide perspective reveals a general graph
> +problem. The VCMMs API is built to manage the general mapping graph. Each
> +node that talks to memory, either through an MMU or directly (physically
> +mapped) can be thought of as the device-end of a mapping edge. The other
> +edge is the physical memory (or intermediate virtual space) that is
> +In the direct mapped case the device is assigned a one-to-one MMU. This
> +scheme allows direct mapped devices to participate in general graph
> +The CPU nodes can also be brought under the same mapping abstraction with
> +the use of a light overlay on the existing VMM. This light overlay allows
> +VMM managed mappings to interoperate with the common API. The light
> +overlay enables this without substantial modifications to the existing
> +In addition to CPU nodes that are running Linux (and the VMM), remote CPU
> +nodes that may be running other operating systems can be brought into the
> +general abstraction. Routing all memory management requests from a remote
> +node through the central memory management framework enables new features
> +like system-wide memory migration. This feature may only be feasible for
> +large buffers that are managed outside of the fast-path, but having remote
> +allocation in a system enables features that are impossible to build
> +without it.
> +The fundamental objects that support graph-based map management are:
> +1) Virtual Contiguous Memory Regions
> +2) Reservations
> +3) Associated Virtual Contiguous Memory Regions
> +4) Memory Targets
> +5) Physical Memory Allocations
> +Usage Overview
> +In a nut-shell, users allocate Virtual Contiguous Memory Regions and
> +associate those regions with one or more devices by creating an Associated
> +Virtual Contiguous Memory Region. Users then create Reservations from the
> +Virtual Contiguous Memory Region. At this point no physical memory has
> +been committed to the reservation. To associate physical memory with a
> +reservation a Physical Memory Allocation is created and the Reservation is
> +backed with this allocation.
> +include/linux/vcm.h includes comments documenting each API.
> +Virtual Contiguous Memory Regions
> +A Virtual Contiguous Memory Region (VCM) abstracts the memory space a
> +device sees. The addresses of the region are only used by the devices
> +which are associated with the region. This address space would normally be
> +implemented as a device page-table.
> +A VCM is created and destroyed with three functions:
> + struct vcm *vcm_create(size_t start_addr, size_t len);
Seems odd to use size_t for start_addr.
> + struct vcm *vcm_create_from_prebuilt(size_t ext_vcm_id);
> + int vcm_free(struct vcm *vcm);
> +start_addr is an offset into the address space where allocations will
> +start from. len is the length from start_addr of the VCM. Both functions
> +generate an instance of a VCM.
> +ext_vcm_id is used to pass a request to the VMM to generate a VCM
> +instance. In the current implementation the call simply makes a note that
> +the VCM instance is a VMM VCM instance for other interfaces usage. This
> +muxing is seen throughout the implementation.
> +vcm_create() and vcm_create_from_prebuilt() produce VCM instances for
> +virtually mapped devices (IOMMUs and CPUs). To create a one-to-one mapped
> +VCM users pass the start_addr and len of the physical region. The VCMM
> +matches this and records that the VCM instance is a one-to-one VCM.
> +The newly created VCM instance can be passed to any function that needs to
> +operate on or with a virtual contiguous memory region. Its main attributes
> +are a start_addr and a len as well as an internal setting that allows the
> +implementation to mux between true virtual spaces, one-to-one mapped
> +spaces and VMM managed spaces.
> +The current implementation uses the genalloc library to manage the VCM for
> +IOMMU devices. Return values and more in-depth per-function documentation
> +for these and the ones listed below are in include/linux/vcm.h.
> +A Reservation is a contiguous region allocated from a VCM. There is no
> +physical memory associated with it.
> +A Reservation is created and destroyed with:
> + struct res *vcm_reserve(struct vcm *vcm, size_t len, uint32_t attr);
> + int vcm_unreserve(struct res *res);
> +A vcm is a VCM created above. len is the length of the request. It can be
> +up to the length of the VCM region the reservation is being created
> +from. attr are mapping attributes: read, write, execute, user, supervisor,
> +secure, not-cached, write-back/write-allocate, write-back/no
> +write-allocate, write-through. These attrs are appropriate for ARM but can
> +be changed to match to any architecture.
> +The implementation calls gen_pool_alloc() for IOMMU devices,
> +alloc_vm_area() for VMM areas and is a pass through for one-to-one mapped
> +Associated Virtual Contiguous Memory Regions and Activation
> +An Associated Virtual Contiguous Memory Region (AVCM) is a mapping of a
> +VCM to a device. The mapping can be active or inactive.
> +An AVCM is managed with:
> +struct avcm *vcm_assoc(struct vcm *vcm, size_t dev, uint32_t attr);
> + int vcm_deassoc(struct avcm *avcm);
> + int vcm_activate(struct avcm *avcm);
> + int vcm_deactivate(struct avcm *avcm);
> +A VCM instance is a VCM created above. A dev is an opaque device handle
> +thats passed down to the device driver the VCMM muxes in to handle a
> +request. attr are association attributes: split, use-high or
size_t used for an opaque device handle seems odd.
> +use-low. split controls which transactions hit a high-address page-table
> +and which transactions hit a low-address page-table. For instance, all
> +transactions whose most significant address bit is one would use the
> +high-address page-table, any other transaction would use the low address
> +page-table. This scheme is ARM specific and could be changed in other
> +architectures. One VCM instance can be associated with many devices and
> +many VCM instances can be associated with one device.
> +An AVCM is only a link. To program and deprogram a device with a VCM the
> +user calls vcm_activate() and vcm_deactivate().For IOMMU devices,
> +activating a mapping programs the base address of a page-table into an
> +IOMMU. For VMM and one-to-one based devices, mappings are active
> +immediately but the API does require an activation call for them for
> +internal reference counting.
> +Memory Targets
> +A Memory Target is a platform independent way of specifying a physical
> +pool; it abstracts a pool of physical memory. The physical memory pool may
> +be physically discontiguous, need to be allocated from in a unique way or
> +have other user-defined attributes.
> +Physical Memory Allocation and Reservation Backing
> +Physical memory is allocated as a separate step from reserving
> +memory. This allows multiple reservations to back the same physical
> +A Physical Memory Allocation is managed using the following functions:
> + struct physmem *vcm_phys_alloc(enum memtype_t memtype, size_t len,
> + uint32_t attr);
> + int vcm_phys_free(struct physmem *physmem);
> + int vcm_back(struct res *res, struct physmem *physmem);
> + int vcm_unback(struct res *res);
> +attr can include an alignment request, a specification to map memory using
> +various block sizes and/or to use physically contiguous memory. memtype is
> +one of the memory types listed in Memory Targets.
> +The current implementation manages two pools of memory. One pool is a
> +contiguous block of memory and the other is a set of contiguous block
> +pools. In the current implementation the block pools contain 4K, 64K and
> +1M blocks. The physical allocator does not try to split blocks from the
> +contiguous block pools to satisfy requests.
> +The use of 4K, 64K and 1M blocks solves a problem with some IOMMU
> +hardware. IOMMUs are placed in front of multimedia engines to provide a
> +contiguous address space to the device. Multimedia devices need large
> +buffers and large buffers may map to a large number of physical
> +blocks. IOMMUs tend to have small translation lookaside buffers
> +(TLBs). Since the TLB is small the number of physical blocks that map a
> +given range needs to be small or else the IOMMU will continually fetch new
> +translations during a typical streamed multimedia flow. By using a 1 MB
> +mapping (or 64K mapping) instead of a 4K mapping the number of misses can
> +be minimized, allowing the multimedia block to meet its performance goals.
> +Low Level Control
> +It is necessary in some instances to access attributes and provide
> +higher-level control of the low-level hardware abstraction. The API
> +contains many functions for this task but the two that are typically used
> + size_t vcm_get_dev_addr(struct res *res);
> + int vcm_hook(size_t dev, vcm_handler handler, void *data);
> +The first function, vcm_get_dev_addr() returns a device address given a
> +reservation. This device address is a virtual IOMMU address for
> +reservations on IOMMU VCMs, a virtual VMM address for reservations on VMM
> +VCMs and a virtual (really physical since its one-to-one mapped) address
> +for one-to-one devices.
> +The second function, vcm_hook allows a caller in the kernel to register a
> +user_handler. The handler is passed the data member passed to vcm_hook
> +during a fault. The user can return 1 to indicate that the underlying
> +driver should handle the fault and retry the transaction or they can
> +return 0 to halt the transaction. If the user doesn't register a handler
> +the low-level driver will print a warning and terminate the transaction.
> +A Detailed Walk Through
> +The following call sequence walks through a typical allocation
> +sequence. In the first stage the memory for a device is reserved and
> +backed. This occurs without mapping the memory into a VMM VCM region. The
> +second stage maps the first VCM region into a VMM VCM region so the kernel
> +can read or write it. The second stage is not necessary if the VMM does
> +not need to read or modify the contents of the original mapping.
> + Stage 1: Map and Allocate Memory for a Device
> + The call sequence starts by creating a VCM region:
> + vcm = vcm_create(start_addr, len);
> + The next call associates a VCM region with a device:
> + avcm = vcm_assoc(vcm, dev, attr);
> + To activate the association users call vcm_activate() on the avcm from
> + the associate call. This programs the underlining device with the
> + mappings.
> + ret = vcm_activate(avcm);
> + Once a VCM region is created and associated it can be reserved from
> + with:
> + res = vcm_reserve(vcm, res_len, res_attr);
> + A user then allocates physical memory with:
> + physmem = vcm_phys_alloc(memtype, len, phys_attr);
> + To back the reservation with the physical memory allocation the user
> + calls:
> + vcm_back(res, physmem);
> + Stage 2: Map the Device's Memory into the VMM's VCM region
> + If the VMM needs to read and/or write the region that was just created
> + the following calls are made.
> + The first call creates a prebuilt VCM with:
> + vcm_vmm = vcm_from_prebuilt(ext_vcm_id);
> + The prebuilt VCM is associated with the CPU device and activated with:
> + avcm_vmm = vcm_assoc(vcm_vmm, dev_cpu, attr);
> + vcm_activate(avcm_vmm);
> + A reservation is made on the VMM VCM with:
> + res_vmm = vcm_reserve(vcm_vmm, res_len, attr);
> + Finally, once the topology has been set up a vcm_back() allows the VMM
> + to read the memory using the physmem generated in stage 1:
> + vcm_back(res_vmm, physmem);
> +Mapping IOMMU, one-to-one and VMM Reservations
> +The following example demonstrates mapping IOMMU, one-to-one and VMM
> +reservations to the same physical memory. It shows the use of phys_addr
> +and phys_size to create a contiguous VCM for one-to-one mapped devices.
> + The user allocates physical memory:
> + physmem = vcm_phys_alloc(memtype, SZ_2MB + SZ_4K, CONTIGUOUS);
> + Creates an IOMMU VCM:
> + vcm_iommu = vcm_create(SZ_1K, SZ_16M);
> + Creates an one-to-one VCM:
> + vcm_onetoone = vcm_create(phys_addr, phys_size);
> + Creates a Prebuit VCM:
> + vcm_vmm = vcm_from_prebuit(ext_vcm_id);
> + Associate and activate all three to their respective devices:
> + avcm_iommu = vcm_assoc(vcm_iommu, dev_iommu, attr0);
> + avcm_onetoone = vcm_assoc(vcm_onetoone, dev_onetoone, attr1);
> + avcm_vmm = vcm_assoc(vcm_vmm, dev_cpu, attr2);
error handling on vcm_assoc() failures?
> + vcm_activate(avcm_iommu);
> + vcm_activate(avcm_onetoone);
> + vcm_activate(avcm_vmm);
> + And finally, creates and backs reservations on all 3 such that they
> + all point to the same memory:
> + res_iommu = vcm_reserve(vcm_iommu, SZ_2MB + SZ_4K, attr);
> + res_onetoone = vcm_reserve(vcm_onetoone, SZ_2MB + SZ_4K, attr);
> + res_vmm = vcm_reserve(vcm_vmm, SZ_2MB + SZ_4K, attr);
> + vcm_back(res_iommu, physmem);
> + vcm_back(res_onetoone, physmem);
> + vcm_back(res_vmm, physmem);
> +VCM Summary
> +The VCMM is an attempt to abstract attributes of three distinct classes of
> +mappings into one API. The VCMM allows users to reason about mappings as
> +first class objects. It also allows memory mappings to flow from the
> +traditional 4K mappings prevalent on systems today to more efficient block
> +sizes. Finally, it allows users to manage mapping interoperation without
> +becoming VMM experts. These features will allow future systems with many
> +MMU mapped devices to interoperate simply and therefore correctly.
> +IOMMU Hardware Control
> +The VCM currently supports a single type of IOMMU, a Qualcomm System MMU
> +(SMMU). The SMMU interface contains functions to map and unmap virtual
> +addresses, perform address translations and initialize hardware. A
> +Qualcomm SMMU can contain multiple MMU contexts. Each context can
> +translate in parallel. All contexts in a SMMU share one global translation
> +look-aside buffer (TLB).
> +To support context muxing the SMMU module creates and manages device
> +independent virtual contexts. These context abstractions are bound to
> +actual contexts at run-time. Once bound, a context can be activated. This
> +activation programs the underlying context with the virtual context
> +affecting a context switch.
> +The following functions are all documented in:
> + arch/arm/mach-msm/include/mach/smmu_driver.h.
> +To map and unmap a virtual page into physical space the VCM calls:
> + int smmu_map(struct smmu_dev *dev, unsigned long pa,
> + unsigned long va, unsigned long len, unsigned int attr);
> + int smmu_unmap(struct smmu_dev *dev, unsigned long va,
> + unsigned long len);
> + int smmu_update_start(struct smmu_dev *dev);
> + int smmu_update_done(struct smmu_dev *dev);
> +The size given to map must be 4K, 64K, 1M or 16M and the VA and PA must be
> +aligned to the given size. smmu_update_start() and smmu_update_done()
> +should be called before and after each map or unmap.
> +To request a hardware VA to PA translation on a single address the VCM
> + unsigned long smmu_translate(struct smmu_dev *dev,
> + unsigned long va);
> +Fault Handling
> +To register an interrupt handler for a context the VCM calls:
> + int smmu_hook_irpt(struct smmu_dev *dev, vcm_handler handler,
> + void *data);
We try to spell out things like "interrupt" in function names
because I/we can't remember just how you decided to abbreviate it.
> +The registered interrupt handler should return 1 if it wants the SMMU
> +driver to retry the transaction again and 0 if it wants the SMMU driver to
> +terminate the transaction.
> +Managing SMMU Initialization and Contexts
> +SMMU hardware initialization and management happens in 2 steps. The first
> +step initializes global SMMU devices and abstract device contexts. The
> +second step binds contexts and devices.
> +A SMMU hardware instance is built with:
> + int smmu_drvdata_init(struct smmu_driver *drv, unsigned long base,
> + int irq);
> +A SMMU context is Initialized and deinitialized with:
An SMMU initialized
> + struct smmu_dev *smmu_ctx_init(int ctx);
> + int smmu_ctx_deinit(struct smmu_dev *dev);
> +An abstract SMMU context is bound to a particular SMMU with:
> + int smmu_ctx_bind(struct smmu_dev *ctx, struct smmu_driver *drv);
> +Activation affects a context switch.
> +Activation, deactivation and activation state testing are done with:
> + int smmu_activate(struct smmu_dev *dev);
> + int smmu_deactivate(struct smmu_dev *dev);
> + int smmu_is_active(struct smmu_dev *dev);
> +Userspace Access to Devices with IOMMUs
> +A device that issues transactions through an IOMMU must work with two
> +APIs. The first API is the VCM. The VCM API is device independent. Users
> +pass the VCM a dev_id and the VCM makes calls on the hardware device it
> +has been configured with using this dev_id. The second API is whatever
> +device topology has been created to organize the particular IOMMUs in a
> +system. The only constraint on this second API is that it must give the
> +user a single dev_id that it can pass through the VCM.
> +For the Qualcomm SMMUs the second API consists of a tree of platform
> +devices and two platform drivers as well as a context lookup function that
> +traverses the device tree and returns a dev_id given a context name.
> +Qualcomm SMMU Device Tree
> +The current tree organizes the devices into a tree that looks like the
> + smmu0/
> + ctx0
> + ctx1
> + ctx2
> + smmu1/
> + ctx3
> +Each context, ctx[n] and each smmu, smmu[n] is given a name. Since users
> +are interested in contexts not smmus, the contexts name is passed to a
> +function to find the dev_id associated with that name. The functions to
> +find, free and get the base address (since the device probe function calls
> +ioremap to map the SMMUs configuration registers into the kernel) are
> +listed here:
> + struct smmu_dev *smmu_get_ctx_instance(char *ctx_name);
> + int smmu_free_ctx_instance(struct smmu_dev *dev);
> + unsigned long smmu_get_base_addr(struct smmu_dev *dev);
> +Documentation for these functions is in:
> + arch/arm/mach-msm/include/mach/smmu_device.h
> +Each context is given a dev node named after the context. For example:
> + /dev/vcodec_a_mm1
> + /dev/vcodec_b_mm2
> + /dev/vcodec_stream
> + etc...
> +Users open, close and mmap these nodes to access VCM buffers from
> +userspace in the same way that they used to open, close and mmap /dev
> +nodes that represented large physically contiguous buffers (called PMEM
> +buffers on Android).
> +An abbreviated example is shown here:
> +Users get the dev_id associated with their target context, create a VCM
> +topology appropriate for their device and finally associate the VCMs of
> +the topology with the contexts that will take the VCMs:
> + dev_id = smmu_get_ctx_instance(vcodec_a_stream);
> +create vcm and needed topology
> + avcm = vcm_assoc(vcm, dev_id, attr);
> +Tying it all Together
> +VCMs, IOMMUs and the device tree all work to support system-wide memory
> +mappings. The use of each API in this system allows users to concentrate
> +on the relevant details without needing to worry about low-level
> +details. The APIs clear separation of memory spaces and the devices that
> +support those memory spaces continues Linuxs tradition of abstracting the
the Linux tradition
> +what from the how.
> +Maintainer: Zach Pfeffer <zpfeffer@xxxxxxxxxxxxxx>
*** Remember to use Documentation/SubmitChecklist when testing your code ***
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/