Re: [PATCH RFC 01/18] accel/qda: Add Qualcomm QDA DSP accelerator driver docs
From: Dmitry Baryshkov
Date: Wed Feb 25 2026 - 12:19:02 EST
On Wed, Feb 25, 2026 at 07:27:47PM +0530, Ekansh Gupta wrote:
>
>
> On 2/24/2026 2:47 AM, Dmitry Baryshkov wrote:
> > On Tue, Feb 24, 2026 at 12:38:55AM +0530, Ekansh Gupta wrote:
> >> Add initial documentation for the Qualcomm DSP Accelerator (QDA) driver
> >> integrated in the DRM accel subsystem.
> >>
> >> The new docs introduce QDA as a DRM/accel-based implementation of
> >> Hexagon DSP offload that is intended as a modern alternative to the
> >> legacy FastRPC driver in drivers/misc. The text describes the driver
> >> motivation, high-level architecture and interaction with IOMMU context
> >> banks, GEM-based buffer management and the RPMsg transport.
> >>
> >> The user-space facing section documents the main QDA IOCTLs used to
> >> establish DSP sessions, manage GEM buffer objects and invoke remote
> >> procedures using the FastRPC protocol, along with a typical lifecycle
> >> example for applications.
> >>
> >> Finally, the driver is wired into the Compute Accelerators
> >> documentation index under Documentation/accel, and a brief debugging
> >> section shows how to enable dynamic debug for the QDA implementation.
> >>
> >> Signed-off-by: Ekansh Gupta <ekansh.gupta@xxxxxxxxxxxxxxxx>
> >> ---
> >> Documentation/accel/index.rst | 1 +
> >> Documentation/accel/qda/index.rst | 14 +++++
> >> Documentation/accel/qda/qda.rst | 129 ++++++++++++++++++++++++++++++++++++++
> >> 3 files changed, 144 insertions(+)
> >>
> >> diff --git a/Documentation/accel/index.rst b/Documentation/accel/index.rst
> >> index cbc7d4c3876a..5901ea7f784c 100644
> >> --- a/Documentation/accel/index.rst
> >> +++ b/Documentation/accel/index.rst
> >> @@ -10,4 +10,5 @@ Compute Accelerators
> >> introduction
> >> amdxdna/index
> >> qaic/index
> >> + qda/index
> >> rocket/index
> >> diff --git a/Documentation/accel/qda/index.rst b/Documentation/accel/qda/index.rst
> >> new file mode 100644
> >> index 000000000000..bce188f21117
> >> --- /dev/null
> >> +++ b/Documentation/accel/qda/index.rst
> >> @@ -0,0 +1,14 @@
> >> +.. SPDX-License-Identifier: GPL-2.0-only
> >> +
> >> +==============================
> >> + accel/qda Qualcomm DSP Driver
> >> +==============================
> >> +
> >> +The **accel/qda** driver provides support for Qualcomm Hexagon DSPs (Digital
> >> +Signal Processors) within the DRM accelerator framework. It serves as a modern
> >> +replacement for the legacy FastRPC driver, offering improved resource management
> >> +and standard subsystem integration.
> >> +
> >> +.. toctree::
> >> +
> >> + qda
> >> diff --git a/Documentation/accel/qda/qda.rst b/Documentation/accel/qda/qda.rst
> >> new file mode 100644
> >> index 000000000000..742159841b95
> >> --- /dev/null
> >> +++ b/Documentation/accel/qda/qda.rst
> >> @@ -0,0 +1,129 @@
> >> +.. SPDX-License-Identifier: GPL-2.0-only
> >> +
> >> +==================================
> >> +Qualcomm Hexagon DSP (QDA) Driver
> >> +==================================
> >> +
> >> +Introduction
> >> +============
> >> +
> >> +The **QDA** (Qualcomm DSP Accelerator) driver is a new DRM-based
> >> +accelerator driver for Qualcomm's Hexagon DSPs. It provides a standardized
> >> +interface for user-space applications to offload computational tasks ranging
> >> +from audio processing and sensor offload to computer vision and AI
> >> +inference to the Hexagon DSPs found on Qualcomm SoCs.
> >> +
> >> +This driver is designed to align with the Linux kernel's modern **Compute
> >> +Accelerators** subsystem (`drivers/accel/`), providing a robust and modular
> >> +alternative to the legacy FastRPC driver in `drivers/misc/`, offering
> >> +improved resource management and better integration with standard kernel
> >> +subsystems.
> >> +
> >> +Motivation
> >> +==========
> >> +
> >> +The existing FastRPC implementation in the kernel utilizes a custom character
> >> +device and lacks integration with modern kernel memory management frameworks.
> >> +The QDA driver addresses these limitations by:
> >> +
> >> +1. **Adopting the DRM accel Framework**: Leveraging standard uAPIs for device
> >> + management, job submission, and synchronization.
> >> +2. **Utilizing GEM for Memory**: Providing proper buffer object management,
> >> + including DMA-BUF import/export capabilities.
> >> +3. **Improving Isolation**: Using IOMMU context banks to enforce memory
> >> + isolation between different DSP user sessions.
> >> +
> >> +Key Features
> >> +============
> >> +
> >> +* **Standard Accelerator Interface**: Exposes a standard character device
> >> + node (e.g., `/dev/accel/accel0`) via the DRM subsystem.
> >> +* **Unified Offload Support**: Supports all DSP domains (ADSP, CDSP, SDSP,
> >> + GDSP) via a single driver architecture.
> >> +* **FastRPC Protocol**: Implements the reliable Remote Procedure Call
> >> + (FastRPC) protocol for communication between the application processor
> >> + and DSP.
> >> +* **DMA-BUF Interop**: Seamless sharing of memory buffers between the DSP
> >> + and other multimedia subsystems (GPU, Camera, Video) via standard DMA-BUFs.
> >> +* **Modular Design**: Clean separation between the core DRM logic, the memory
> >> + manager, and the RPMsg-based transport layer.
> >> +
> >> +Architecture
> >> +============
> >> +
> >> +The QDA driver is composed of several modular components:
> >> +
> >> +1. **Core Driver (`qda_drv`)**: Manages device registration, file operations,
> >> + and bridges the driver with the DRM accelerator subsystem.
> >> +2. **Memory Manager (`qda_memory_manager`)**: A flexible memory management
> >> + layer that handles IOMMU context banks. It supports pluggable backends
> >> + (such as DMA-coherent) to adapt to different SoC memory architectures.
> >> +3. **GEM Subsystem**: Implements the DRM GEM interface for buffer management:
> >> +
> >> + * **`qda_gem`**: Core GEM object management, including allocation, mmap
> >> + operations, and buffer lifecycle management.
> >> + * **`qda_prime`**: PRIME import functionality for DMA-BUF interoperability,
> >> + enabling seamless buffer sharing with other kernel subsystems.
> >> +
> >> +4. **Transport Layer (`qda_rpmsg`)**: Abstraction over the RPMsg framework
> >> + to handle low-level message passing with the DSP firmware.
> >> +5. **Compute Bus (`qda_compute_bus`)**: A custom virtual bus used to
> >> + enumerate and manage the specific compute context banks defined in the
> >> + device tree.
> > I'm really not sure if it's a bonus or not. I'm waiting for iommu-map
> > improvements to land to send patches reworking FastRPC CB from using
> > probe into being created by the main driver: it would remove some of the
> > possible race conditions between main driver finishing probe and the CB
> > devices probing in the background.
> >
> > What's the actual benefit of the CB bus?
> I tried following the Tegra host1x logic here as was discussed here[1]. My understanding is that
> with this the CB will become more manageable reducing the scope of races that exists in the
> current fastrpc driver.
It's nice, but then it can also be used by the existing fastrpc driver.
Would you mind splitting it to a separate changeset and submitting it?
>
> That said, I'm not completely aware about the iommu-map improvements. Is it the one
> being discussed for this patch[2]? If it helps in main driver to create CB devices directly, then I
> would be happy to adapt the design.
That would mostly mean a change to the way we describe CBs (using the
property instead of the in-tree subdevices). Anyway, as I wrote, please
submit it separately.
>
> [1] https://lore.kernel.org/all/245d602f-3037-4ae3-9af9-d98f37258aae@xxxxxxxxxxxxxxxx/
> [2] https://lore.kernel.org/all/20260126-kaanapali-iris-v1-3-e2646246bfc1@xxxxxxxxxxxxxxxx/
> >
> >> +6. **FastRPC Core (`qda_fastrpc`)**: Implements the protocol logic for
> >> + marshalling arguments and handling remote invocations.
> >> +
> >> +User-Space API
> >> +==============
> >> +
> >> +The driver exposes a set of DRM-compliant IOCTLs. Note that these are designed
> >> +to be familiar to existing FastRPC users while adhering to DRM standards.
> >> +
> >> +* `DRM_IOCTL_QDA_QUERY`: Query DSP type (e.g., "cdsp", "adsp")
> >> + and capabilities.
> >> +* `DRM_IOCTL_QDA_INIT_ATTACH`: Attach a user session to the DSP's protection
> >> + domain.
> >> +* `DRM_IOCTL_QDA_INIT_CREATE`: Initialize a new process context on the DSP.
> > You need to explain the difference between these two.
> Ack.
> >
> >> +* `DRM_IOCTL_QDA_INVOKE`: Submit a remote method invocation (the primary
> >> + execution unit).
> >> +* `DRM_IOCTL_QDA_GEM_CREATE`: Allocate a GEM buffer object for DSP usage.
> >> +* `DRM_IOCTL_QDA_GEM_MMAP_OFFSET`: Retrieve mmap offsets for memory mapping.
> >> +* `DRM_IOCTL_QDA_MAP` / `DRM_IOCTL_QDA_MUNMAP`: Map or unmap buffers into the
> >> + DSP's virtual address space.
> > Do we need to make this separate? Can we map/unmap buffers on their
> > usage? Or when they are created? I'm thinking about that the
> > virtualization.
> The lib provides ways(fastrpc_mmap/remote_mmap64) for users to map/unmap the
> buffers on DSP as per processes requirement. The ioctls are added to support the same.
If the buffers are mapped, then library calls become empty calls. Let's
focus on the API first and adapt to the library later on.
> > An alternative approach would be to merge
> > GET_MMAP_OFFSET with _MAP: once you map it to the DSP memory, you will
> > get the offset.
> _MAP is not need for all the buffers. Most of the remote call buffers that are passed to DSP
> are automatically mapped by DSP before invoking the DSP implementation so the user-space
> does not need to call _MAP for these.
Is there a reason for that? I'd really prefer if we change it, making it
more effective and more controllable.
>
> Some buffers(e.g., shared persistent buffers) do require explicit mapping, which is why
> MAP/MUNMAP exists in FastRPC.
>
> Because of this behavioral difference, merging GET_MMAP_OFFSET with MAP is not accurate.
> GET_MMAP_OFFSET is for CPU‑side mmap via GEM, whereas MAP is specifically for DSP
> virtual address assignment.
> >
> >> +
> >> +Usage Example
> >> +=============
> >> +
> >> +A typical lifecycle for a user-space application:
> >> +
> >> +1. **Discovery**: Open `/dev/accel/accel*` and check
> >> + `DRM_IOCTL_QDA_QUERY` to find the desired DSP (e.g., CDSP for
> >> + compute workloads).
> >> +2. **Initialization**: Call `DRM_IOCTL_QDA_INIT_ATTACH` and
> >> + `DRM_IOCTL_QDA_INIT_CREATE` to establish a session.
> >> +3. **Memory**: Allocate buffers via `DRM_IOCTL_QDA_GEM_CREATE` or import
> >> + DMA-BUFs (PRIME fd) from other drivers using `DRM_IOCTL_PRIME_FD_TO_HANDLE`.
> >> +4. **Execution**: Use `DRM_IOCTL_QDA_INVOKE` to pass arguments and execute
> >> + functions on the DSP.
> >> +5. **Cleanup**: Close file descriptors to automatically release resources and
> >> + detach the session.
> >> +
> >> +Internal Implementation
> >> +=======================
> >> +
> >> +Memory Management
> >> +-----------------
> >> +The driver's memory manager creates virtual "IOMMU devices" that map to
> >> +hardware context banks. This allows the driver to manage multiple isolated
> >> +address spaces. The implementation currently uses a **DMA-coherent backend**
> >> +to ensure data consistency between the CPU and DSP without manual cache
> >> +maintenance in most cases.
> >> +
> >> +Debugging
> >> +=========
> >> +The driver includes extensive dynamic debug support. Enable it via the
> >> +kernel's dynamic debug control:
> >> +
> >> +.. code-block:: bash
> >> +
> >> + echo "file drivers/accel/qda/* +p" > /sys/kernel/debug/dynamic_debug/control
> > Please add documentation on how to build the test apps and how to load
> > them to the DSP.
> Ack.
> >
> >> --
> >> 2.34.1
> >>
>
--
With best wishes
Dmitry