Re: [PATCH 00/83] AMD HSA kernel driver

From: Daniel Vetter
Date: Sat Jul 12 2014 - 07:10:34 EST

On Sat, Jul 12, 2014 at 11:24:49AM +0200, Christian König wrote:
> Am 11.07.2014 23:18, schrieb Jerome Glisse:
> >On Thu, Jul 10, 2014 at 10:51:29PM +0000, Gabbay, Oded wrote:
> >>On Thu, 2014-07-10 at 18:24 -0400, Jerome Glisse wrote:
> >>>On Fri, Jul 11, 2014 at 12:45:27AM +0300, Oded Gabbay wrote:
> >>>> This patch set implements a Heterogeneous System Architecture
> >>>>(HSA) driver
> >>>> for radeon-family GPUs.
> >>>This is just quick comments on few things. Given size of this, people
> >>>will need to have time to review things.
> >>>> HSA allows different processor types (CPUs, DSPs, GPUs, etc..) to
> >>>>share
> >>>> system resources more effectively via HW features including
> >>>>shared pageable
> >>>> memory, userspace-accessible work queues, and platform-level
> >>>>atomics. In
> >>>> addition to the memory protection mechanisms in GPUVM and
> >>>>IOMMUv2, the Sea
> >>>> Islands family of GPUs also performs HW-level validation of
> >>>>commands passed
> >>>> in through the queues (aka rings).
> >>>> The code in this patch set is intended to serve both as a sample
> >>>>driver for
> >>>> other HSA-compatible hardware devices and as a production driver
> >>>>for
> >>>> radeon-family processors. The code is architected to support
> >>>>multiple CPUs
> >>>> each with connected GPUs, although the current implementation
> >>>>focuses on a
> >>>> single Kaveri/Berlin APU, and works alongside the existing radeon
> >>>>kernel
> >>>> graphics driver (kgd).
> >>>> AMD GPUs designed for use with HSA (Sea Islands and up) share
> >>>>some hardware
> >>>> functionality between HSA compute and regular gfx/compute (memory,
> >>>> interrupts, registers), while other functionality has been added
> >>>> specifically for HSA compute (hw scheduler for virtualized
> >>>>compute rings).
> >>>> All shared hardware is owned by the radeon graphics driver, and
> >>>>an interface
> >>>> between kfd and kgd allows the kfd to make use of those shared
> >>>>resources,
> >>>> while HSA-specific functionality is managed directly by kfd by
> >>>>submitting
> >>>> packets into an HSA-specific command queue (the "HIQ").
> >>>> During kfd module initialization a char device node (/dev/kfd) is
> >>>>created
> >>>> (surviving until module exit), with ioctls for queue creation &
> >>>>management,
> >>>> and data structures are initialized for managing HSA device
> >>>>topology.
> >>>> The rest of the initialization is driven by calls from the radeon
> >>>>kgd at
> >>>> the following points :
> >>>> - radeon_init (kfd_init)
> >>>> - radeon_exit (kfd_fini)
> >>>> - radeon_driver_load_kms (kfd_device_probe, kfd_device_init)
> >>>> - radeon_driver_unload_kms (kfd_device_fini)
> >>>> During the probe and init processing per-device data structures
> >>>>are
> >>>> established which connect to the associated graphics kernel
> >>>>driver. This
> >>>> information is exposed to userspace via sysfs, along with a
> >>>>version number
> >>>> allowing userspace to determine if a topology change has occurred
> >>>>while it
> >>>> was reading from sysfs.
> >>>> The interface between kfd and kgd also allows the kfd to request
> >>>>buffer
> >>>> management services from kgd, and allows kgd to route interrupt
> >>>>requests to
> >>>> kfd code since the interrupt block is shared between regular
> >>>> graphics/compute and HSA compute subsystems in the GPU.
> >>>> The kfd code works with an open source usermode library
> >>>>("libhsakmt") which
> >>>> is in the final stages of IP review and should be published in a
> >>>>separate
> >>>> repo over the next few days.
> >>>> The code operates in one of three modes, selectable via the
> >>>>sched_policy
> >>>> module parameter :
> >>>> - sched_policy=0 uses a hardware scheduler running in the MEC
> >>>>block within
> >>>> CP, and allows oversubscription (more queues than HW slots)
> >>>> - sched_policy=1 also uses HW scheduling but does not allow
> >>>> oversubscription, so create_queue requests fail when we run out
> >>>>of HW slots
> >>>> - sched_policy=2 does not use HW scheduling, so the driver
> >>>>manually assigns
> >>>> queues to HW slots by programming registers
> >>>> The "no HW scheduling" option is for debug & new hardware bringup
> >>>>only, so
> >>>> has less test coverage than the other options. Default in the
> >>>>current code
> >>>> is "HW scheduling without oversubscription" since that is where
> >>>>we have the
> >>>> most test coverage but we expect to change the default to "HW
> >>>>scheduling
> >>>> with oversubscription" after further testing. This effectively
> >>>>removes the
> >>>> HW limit on the number of work queues available to applications.
> >>>> Programs running on the GPU are associated with an address space
> >>>>through the
> >>>> VMID field, which is translated to a unique PASID at access time
> >>>>via a set
> >>>> of 16 VMID-to-PASID mapping registers. The available VMIDs
> >>>>(currently 16)
> >>>> are partitioned (under control of the radeon kgd) between current
> >>>> gfx/compute and HSA compute, with each getting 8 in the current
> >>>>code. The
> >>>> VMID-to-PASID mapping registers are updated by the HW scheduler
> >>>>when used,
> >>>> and by driver code if HW scheduling is not being used.
> >>>> The Sea Islands compute queues use a new "doorbell" mechanism
> >>>>instead of the
> >>>> earlier kernel-managed write pointer registers. Doorbells use a
> >>>>separate BAR
> >>>> dedicated for this purpose, and pages within the doorbell
> >>>>aperture are
> >>>> mapped to userspace (each page mapped to only one user address
> >>>>space).
> >>>> Writes to the doorbell aperture are intercepted by GPU hardware,
> >>>>allowing
> >>>> userspace code to safely manage work queues (rings) without
> >>>>requiring a
> >>>> kernel call for every ring update.
> >>>> First step for an application process is to open the kfd device.
> >>>>Calls to
> >>>> open create a kfd "process" structure only for the first thread
> >>>>of the
> >>>> process. Subsequent open calls are checked to see if they are
> >>>>from processes
> >>>> using the same mm_struct and, if so, don't do anything. The kfd
> >>>>per-process
> >>>> data lives as long as the mm_struct exists. Each mm_struct is
> >>>>associated
> >>>> with a unique PASID, allowing the IOMMUv2 to make userspace
> >>>>process memory
> >>>> accessible to the GPU.
> >>>> Next step is for the application to collect topology information
> >>>>via sysfs.
> >>>> This gives userspace enough information to be able to identify
> >>>>specific
> >>>> nodes (processors) in subsequent queue management calls.
> >>>>Application
> >>>> processes can create queues on multiple processors, and
> >>>>processors support
> >>>> queues from multiple processes.
> >>>I am not a fan to use sysfs to discover topoly.
> >>>> At this point the application can create work queues in userspace
> >>>>memory and
> >>>> pass them through the usermode library to kfd to have them mapped
> >>>>onto HW
> >>>> queue slots so that commands written to the queues can be
> >>>>executed by the
> >>>> GPU. Queue operations specify a processor node, and so the bulk
> >>>>of this code
> >>>> is device-specific.
> >>>> Written by John Bridgman <John.Bridgman@xxxxxxx>
> >>>So general comment is you need to collapse many patches things like
> >>>58 fixing
> >>>kernel style should be collapsed ie fix all previous patch that have
> >>>broken
> >>>style.
> >>>Even worse is thing like 71, removing code you just added few patch
> >>>earlier
> >>>in the patchset.
> >>Not quite, the code was added on patch 11.
> >>
> >>>This means that if i start reviewing following
> >>>patch order
> >>>i might spend time on code that is later deleted/modified/fixed ie
> >>>time i
> >>>spend understanding and reading some code might be just wasted.
> >>Quick answer is that you are of course right, but having said that, I
> >>think it would be not simple at all to do that squashing.
> >>I squashed what I could, and probably I can do a little more (like
> >>58), but the major problem is that one of the main modules of the
> >>driver - the scheduler (QCM) - was completely re-written (patches
> >>46-53). Therefore, from patch 1 to 53, we use the old scheduler code
> >>and from 54 we use the new QCM (and the old scheduler code was
> >>completely remove at 71). So I could maybe squash 71 into 54, but that
> >>won't help you much, I think.
> >>
> >>So, the current advice I can give is to completely ignore the
> >>following files because they do not exist in the final commit:
> >>- kfd_sched_cik_static.c (stopped using in 54)
> >>- kfd_registers.c (stopped using in 81 because we moved all register
> >>writes to radeon)
> >>- kfd_hw_pointer_store.c (alive from 46 to 67)
> >>- kfd_hw_pointer_store.h (alive from 46 to 67)
> >>
> >> Oded
> >Ok i try but this is driving me crazy, i am jungling btw full applied
> >patchset and individual patch going back and forth trying to find which
> >patch is the one i want to comment on. Not even mentioning that after
> >that people would need to ack bad patch just because they know a latter
> >good patch fix the badness.
> >
> >This patchset can be reduced to less then 10 big patches. I would first
> >do core patches that touch core things like iommu and are preparatory
> >to the patchset. Then kfd patches that do not touch anything in radeon
> >but implement the skeleton and core infrastructure. Then radeon specific
> >patches.
> >
> >This lead me to the design, all the core kfd helper should be in hsa
> >directory, and should be helper, while all specific bits to radeon
> >should most likely be part of the radeon gfx kernel module. I am not
> >against a radeon_kfd but realy i do not see the point. There should
> >be a hsa.ko like there is a drm.ko.
> Yeah totally agree with Jerome here.
> Maybe I can explain a bit further what the difference in the design is.
> First of all for other good examples see not only the DRM infrastructure,
> but also V4L or other Linux driver subsystems as well.
> In those subsystems it's not the helper functions that control the driver,
> but instead the driver that controls the helper functions. E.g. if HSA wants
> to be a new subsystem with a standardized IOCTL interface it should provide
> functions that assists drivers with implementing the interface instead of
> implementing it themselves and then trying to talk to the drivers for
> necessary resources/operations.
> Coming back to the structure of the patches one of the very first patches
> should be the IOCTL definition a driver needs to implement to be HSA
> conform.
> I think it would be a good idea to split out changes to core structures like
> IOMMU in it's own separately submitted patches, only with the notice that
> this functionality is needed by the following "core HSA" and "HSA for
> radeon" patchsets.

Hm, so the hsa part is a completely new driver/subsystem, not just an
additional ioctl tacked onto radoen? The history of drm is littered with
"generic" ioctls that turned out to be useful for exactly one driver.
Which is why _all_ the command submission is now done with driver-private

I'd be quite a bit surprised if that suddenly works differently, so before
we bless a generic hsa interface I really want to see some implementation
from a different vendor (i.e. nvdidia or intel) using the same ioctls.
Otherwise we just repeat history and I'm not terribly inclined to keep on
cleanup up cruft forever - one drm legacy is enough ;-)

Jesse is the guy from our side to talk to about this.
Daniel Vetter
Software Engineer, Intel Corporation
+41 (0) 79 365 57 48 -
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at
Please read the FAQ at