Re: [PATCHv5,RESEND 3/8] gpu: host1x: Add channel support

From: Thierry Reding
Date: Mon Feb 25 2013 - 10:25:22 EST


On Tue, Jan 15, 2013 at 01:43:59PM +0200, Terje Bergstrom wrote:
[...]
> diff --git a/drivers/gpu/host1x/Kconfig b/drivers/gpu/host1x/Kconfig
> index e89fb2b..57680a6 100644
> --- a/drivers/gpu/host1x/Kconfig
> +++ b/drivers/gpu/host1x/Kconfig
> @@ -3,4 +3,27 @@ config TEGRA_HOST1X
> help
> Driver for the Tegra host1x hardware.
>
> - Required for enabling tegradrm.
> + Required for enabling tegradrm and 2D acceleration.

I don't think I commented on this in the other patches, but I think this
could use a bit more information about what host1x is. Also mentioning
that it is a requirement for tegra-drm and 2D acceleration isn't very
useful because it can equally well be expressed in Kconfig. If you add
some description about what host1x is, people will know that they want
to enable it.

> +if TEGRA_HOST1X
> +
> +config TEGRA_HOST1X_CMA
> + bool "Support DRM CMA buffers"
> + depends on DRM
> + default y
> + select DRM_GEM_CMA_HELPER
> + select DRM_KMS_CMA_HELPER
> + help
> + Say yes if you wish to use DRM CMA buffers.
> +
> + If unsure, choose Y.

Perhaps make this not user-selectable (for now)? If somebody disables
this explicitly they won't get a working driver, right?

> diff --git a/drivers/gpu/host1x/cdma.c b/drivers/gpu/host1x/cdma.c
[...]
> +#include "cdma.h"
> +#include "channel.h"
> +#include "dev.h"
> +#include "memmgr.h"
> +#include "job.h"
> +#include <asm/cacheflush.h>
> +
> +#include <linux/slab.h>
> +#include <linux/kfifo.h>
> +#include <linux/interrupt.h>
> +#include <trace/events/host1x.h>
> +
> +#define TRACE_MAX_LENGTH 128U

"" includes generally follow <> ones.

> +/*
> + * Add an entry to the sync queue.
> + */
> +static void add_to_sync_queue(struct host1x_cdma *cdma,
> + struct host1x_job *job,
> + u32 nr_slots,
> + u32 first_get)
> +{
> + if (job->syncpt_id == NVSYNCPT_INVALID) {
> + dev_warn(&job->ch->dev->dev, "%s: Invalid syncpt\n",
> + __func__);
> + return;
> + }
> +
> + job->first_get = first_get;
> + job->num_slots = nr_slots;
> + host1x_job_get(job);
> + list_add_tail(&job->list, &cdma->sync_queue);
> +}

It's a bit odd that you pass a job in here along with some parameters
that are then assigned to the job's fields. Couldn't you just assign
them to the job's fields before passing the job into this function?

I also see that you only use this function once, so maybe you could
open-code it instead.

> +/*
> + * Return the status of the cdma's sync queue or push buffer for the given event
> + * - sq empty: returns 1 for empty, 0 for not empty (as in "1 empty queue" :-)
> + * - pb space: returns the number of free slots in the channel's push buffer
> + * Must be called with the cdma lock held.
> + */
> +static unsigned int cdma_status_locked(struct host1x_cdma *cdma,
> + enum cdma_event event)
> +{
> + struct host1x *host1x = cdma_to_host1x(cdma);
> + switch (event) {
> + case CDMA_EVENT_SYNC_QUEUE_EMPTY:
> + return list_empty(&cdma->sync_queue) ? 1 : 0;
> + case CDMA_EVENT_PUSH_BUFFER_SPACE: {
> + struct push_buffer *pb = &cdma->push_buffer;
> + return host1x->cdma_pb_op.space(pb);
> + }
> + default:
> + return 0;
> + }
> +}

Similarly this function is only used in one place and it requires a
whole lot of documentation to define the meaning of the return value. If
you implement this functionality directly in host1x_cdma_wait_locked()
you have much more context and don't require all this "protocol".

> +/*
> + * Start timer for a buffer submition that has completed yet.

"submission". And I don't understand the "that has completed yet" part.

> + * Must be called with the cdma lock held.
> + */
> +static void cdma_start_timer_locked(struct host1x_cdma *cdma,
> + struct host1x_job *job)

You use two different styles to indent the function parameters. You
might want to stick to one, preferably aligning them with the first
parameter on the first line.

> +{
> + struct host1x *host = cdma_to_host1x(cdma);
> +
> + if (cdma->timeout.clientid) {
> + /* timer already started */
> + return;
> + }
> +
> + cdma->timeout.clientid = job->clientid;
> + cdma->timeout.syncpt = host1x_syncpt_get(host, job->syncpt_id);
> + cdma->timeout.syncpt_val = job->syncpt_end;
> + cdma->timeout.start_ktime = ktime_get();
> +
> + schedule_delayed_work(&cdma->timeout.wq,
> + msecs_to_jiffies(job->timeout));
> +}
> +
> +/*
> + * Stop timer when a buffer submition completes.

"submission"

> +/*
> + * For all sync queue entries that have already finished according to the
> + * current sync point registers:
> + * - unpin & unref their mems
> + * - pop their push buffer slots
> + * - remove them from the sync queue
> + * This is normally called from the host code's worker thread, but can be
> + * called manually if necessary.
> + * Must be called with the cdma lock held.
> + */
> +static void update_cdma_locked(struct host1x_cdma *cdma)
> +{
> + bool signal = false;
> + struct host1x *host1x = cdma_to_host1x(cdma);
> + struct host1x_job *job, *n;
> +
> + /* If CDMA is stopped, queue is cleared and we can return */
> + if (!cdma->running)
> + return;
> +
> + /*
> + * Walk the sync queue, reading the sync point registers as necessary,
> + * to consume as many sync queue entries as possible without blocking
> + */
> + list_for_each_entry_safe(job, n, &cdma->sync_queue, list) {
> + struct host1x_syncpt *sp = host1x->syncpt + job->syncpt_id;

host1x_syncpt_get()?

> +
> + /* Check whether this syncpt has completed, and bail if not */
> + if (!host1x_syncpt_is_expired(sp, job->syncpt_end)) {
> + /* Start timer on next pending syncpt */
> + if (job->timeout)
> + cdma_start_timer_locked(cdma, job);
> + break;
> + }
> +
> + /* Cancel timeout, when a buffer completes */
> + if (cdma->timeout.clientid)
> + stop_cdma_timer_locked(cdma);
> +
> + /* Unpin the memory */
> + host1x_job_unpin(job);
> +
> + /* Pop push buffer slots */
> + if (job->num_slots) {
> + struct push_buffer *pb = &cdma->push_buffer;
> + host1x->cdma_pb_op.pop_from(pb, job->num_slots);
> + if (cdma->event == CDMA_EVENT_PUSH_BUFFER_SPACE)
> + signal = true;
> + }
> +
> + list_del(&job->list);
> + host1x_job_put(job);
> + }
> +
> + if (list_empty(&cdma->sync_queue) &&
> + cdma->event == CDMA_EVENT_SYNC_QUEUE_EMPTY)
> + signal = true;

This looks funny, maybe:

if (cdma->event == CDMA_EVENT_SYNC_QUEUE_EMPTY &&
list_empty(&cdma->sync_queue))
signal = true;

?

> +
> + /* Wake up CdmaWait() if the requested event happened */

CdmaWait()? Where's that?

> + if (signal) {
> + cdma->event = CDMA_EVENT_NONE;
> + up(&cdma->sem);
> + }
> +}
> +
> +void host1x_cdma_update_sync_queue(struct host1x_cdma *cdma,
> + struct platform_device *dev)

There's nothing in this function that requires a platform_device, so
passing struct device should be enough. Or maybe host1x_cdma should get
a struct device * field?

> +{
> + u32 get_restart;

Maybe just call this "restart" or "restart_addr". get_restart sounds
like a function name.

> + u32 syncpt_incrs;
> + struct host1x_job *job = NULL;
> + u32 syncpt_val;
> + struct host1x *host1x = cdma_to_host1x(cdma);
> +
> + syncpt_val = host1x_syncpt_load_min(cdma->timeout.syncpt);
> +
> + dev_dbg(&dev->dev,
> + "%s: starting cleanup (thresh %d)\n",
> + __func__, syncpt_val);

This fits on two lines.

> +
> + /*
> + * Move the sync_queue read pointer to the first entry that hasn't
> + * completed based on the current HW syncpt value. It's likely there
> + * won't be any (i.e. we're still at the head), but covers the case
> + * where a syncpt incr happens just prior/during the teardown.
> + */
> +
> + dev_dbg(&dev->dev,
> + "%s: skip completed buffers still in sync_queue\n",
> + __func__);

This too.

> + list_for_each_entry(job, &cdma->sync_queue, list) {
> + if (syncpt_val < job->syncpt_end)
> + break;
> +
> + host1x_job_dump(&dev->dev, job);
> + }

That's potentially a lot of debug output. I wonder if it might make
sense to control parts of this via a module parameter. Then again, if
somebody really needs to debug this, maybe they really want *all* the
information.

> + /*
> + * Walk the sync_queue, first incrementing with the CPU syncpts that
> + * are partially executed (the first buffer) or fully skipped while
> + * still in the current context (slots are also NOP-ed).
> + *
> + * At the point contexts are interleaved, syncpt increments must be
> + * done inline with the pushbuffer from a GATHER buffer to maintain
> + * the order (slots are modified to be a GATHER of syncpt incrs).
> + *
> + * Note: save in get_restart the location where the timed out buffer
> + * started in the PB, so we can start the refetch from there (with the
> + * modified NOP-ed PB slots). This lets things appear to have completed
> + * properly for this buffer and resources are freed.
> + */
> +
> + dev_dbg(&dev->dev,
> + "%s: perform CPU incr on pending same ctx buffers\n",
> + __func__);

Can be collapsed to two lines.

> +
> + get_restart = cdma->last_put;
> + if (!list_empty(&cdma->sync_queue))
> + get_restart = job->first_get;

Perhaps:

if (list_empty(&cdma->sync_queue))
restart = cdma->last_put;
else
restart = job->first_get;

?

> + list_for_each_entry_from(job, &cdma->sync_queue, list)
> + if (job->clientid == cdma->timeout.clientid)
> + job->timeout = 500;

I think this warrants a comment.

> +/*
> + * Destroy a cdma
> + */
> +void host1x_cdma_deinit(struct host1x_cdma *cdma)
> +{
> + struct push_buffer *pb = &cdma->push_buffer;
> + struct host1x *host1x = cdma_to_host1x(cdma);
> +
> + if (cdma->running) {
> + pr_warn("%s: CDMA still running\n",
> + __func__);
> + } else {
> + host1x->cdma_pb_op.destroy(pb);
> + host1x->cdma_op.timeout_destroy(cdma);
> + }
> +}

There's no way to recover from the situation where a cdma is still
running. Can this not return an error code (-EBUSY?) if the cdma can't
be destroyed?

> +/*
> + * End a cdma submit
> + * Kick off DMA, add job to the sync queue, and a number of slots to be freed
> + * from the pushbuffer. The handles for a submit must all be pinned at the same
> + * time, but they can be unpinned in smaller chunks.
> + */
> +void host1x_cdma_end(struct host1x_cdma *cdma,
> + struct host1x_job *job)
> +{
> + struct host1x *host1x = cdma_to_host1x(cdma);
> + bool was_idle = list_empty(&cdma->sync_queue);

Maybe just "idle"? It reflects the current state of the CDMA, not any
old state.

> +
> + host1x->cdma_op.kick(cdma);
> +
> + add_to_sync_queue(cdma,
> + job,
> + cdma->slots_used,
> + cdma->first_get);

No need to split this over so many lines. Also, shouldn't the order be
reversed here? I.e. first add to sync queue, then start DMA?

> + /* start timer on idle -> active transitions */
> + if (job->timeout && was_idle)
> + cdma_start_timer_locked(cdma, job);

This could be part of add_to_sync_queue(), but if you open-code that as
I suggest earlier it should obviously stay.

> diff --git a/drivers/gpu/host1x/cdma.h b/drivers/gpu/host1x/cdma.h
[...]
> +struct platform_device;

No need for this if you pass struct device * instead.

> +/*
> + * cdma
> + *
> + * This is in charge of a host command DMA channel.
> + * Sends ops to a push buffer, and takes responsibility for unpinning
> + * (& possibly freeing) of memory after those ops have completed.
> + * Producer:
> + * begin
> + * push - send ops to the push buffer
> + * end - start command DMA and enqueue handles to be unpinned
> + * Consumer:
> + * update - call to update sync queue and push buffer, unpin memory
> + */

I find the name to be a bit confusing. For some reason I automatically
think of GSM when I read CDMA. This really is more of a job queue, so
maybe calling it host1x_job_queue might be more appropriate. But I've
already requested a lot of things to be renamed, so I think I can live
with this being called CDMA if you don't want to change it.

Alternatively all of these could be moved to the struct host1x_channel
given that there's only one of each of the push_buffer, buffer_timeout
and host1x_cma objects per channel.

> diff --git a/drivers/gpu/host1x/channel.c b/drivers/gpu/host1x/channel.c
[...]
> +#include "channel.h"
> +#include "dev.h"
> +#include "job.h"
> +
> +#include <linux/slab.h>
> +#include <linux/module.h>

Again the include ordering is strange.

> +/*
> + * Iterator function for host1x device list
> + * It takes a fptr as an argument and calls that function for each
> + * device in the list
> + */
> +void host1x_channel_for_all(struct host1x *host1x, void *data,
> + int (*fptr)(struct host1x_channel *ch, void *fdata))
> +{
> + struct host1x_channel *ch;
> + int ret;
> +
> + list_for_each_entry(ch, &host1x->chlist.list, list) {
> + if (ch && fptr) {
> + ret = fptr(ch, data);
> + if (ret) {
> + pr_info("%s: iterator error\n", __func__);
> + break;
> + }
> + }
> + }
> +}

Couldn't you rewrite this as a macro, similar to list_for_each_entry()
so that users could do something like:

host1x_for_each_channel(channel, host1x) {
...
}

That's a bit friendlier than having each user write a separate function
to be called from this iterator.

> +int host1x_channel_submit(struct host1x_job *job)
> +{
> + return host1x_get_host(job->ch->dev)->channel_op.submit(job);
> +}

I'd expect a function named host1x_channel_submit() to take a struct
host1x_channel *. Should this perhaps be called host1x_job_submit()?

> +struct host1x_channel *host1x_channel_get(struct host1x_channel *ch)
> +{
> + int err = 0;
> +
> + mutex_lock(&ch->reflock);
> + if (ch->refcount == 0)
> + err = host1x_cdma_init(&ch->cdma);
> + if (!err)
> + ch->refcount++;
> +
> + mutex_unlock(&ch->reflock);
> +
> + return err ? NULL : ch;
> +}

Why don't you use any of the kernel's reference counting mechanisms?

> +void host1x_channel_put(struct host1x_channel *ch)
> +{
> + mutex_lock(&ch->reflock);
> + if (ch->refcount == 1) {
> + host1x_get_host(ch->dev)->cdma_op.stop(&ch->cdma);
> + host1x_cdma_deinit(&ch->cdma);
> + }
> + ch->refcount--;
> + mutex_unlock(&ch->reflock);
> +}

I think you can do all of this using a kref.

> +struct host1x_channel *host1x_channel_alloc(struct platform_device *pdev)
> +{
> + struct host1x_channel *ch = NULL;
> + struct host1x *host1x = host1x_get_host(pdev);
> + int chindex;
> + int max_channels = host1x->info.nb_channels;
> + int err;
> +
> + mutex_lock(&host1x->chlist_mutex);
> +
> + chindex = host1x->allocated_channels;
> + if (chindex > max_channels)
> + goto fail;
> +
> + ch = kzalloc(sizeof(*ch), GFP_KERNEL);
> + if (ch == NULL)
> + goto fail;
> +
> + /* Link platform_device to host1x_channel */
> + err = host1x->channel_op.init(ch, host1x, chindex);
> + if (err < 0)
> + goto fail;
> +
> + ch->dev = pdev;
> +
> + /* Add to channel list */
> + list_add_tail(&ch->list, &host1x->chlist.list);
> +
> + host1x->allocated_channels++;
> +
> + mutex_unlock(&host1x->chlist_mutex);
> + return ch;
> +
> +fail:
> + dev_err(&pdev->dev, "failed to init channel\n");
> + kfree(ch);
> + mutex_unlock(&host1x->chlist_mutex);
> + return NULL;
> +}

I think the critical section could be shorter here. It's probably not
worth the extra trouble, though, given that channels are not often
allocated.

> +void host1x_channel_free(struct host1x_channel *ch)
> +{
> + struct host1x *host1x = host1x_get_host(ch->dev);
> + struct host1x_channel *chiter, *tmp;
> + list_for_each_entry_safe(chiter, tmp, &host1x->chlist.list, list) {
> + if (chiter == ch) {
> + list_del(&chiter->list);
> + kfree(ch);
> + host1x->allocated_channels--;
> +
> + return;
> + }
> + }
> +}

This doesn't free the channel if it happens to not be part of the host1x
channel list. Perhaps an easier way to write it would be:

host1x = host1x_get_host(ch->dev);

list_del(&ch->list);
kfree(ch);

host1x->allocated_channels--;

Looking at the rest of the code, it seems like a channel will never not
be part of the host1x channel list, so I don't think there's a need to
to scan the list.

On a side-note: generally if you break out of the loop right after
freeing the memory of a removed node, there's no need to use the _safe
variant since you won't be accessing the .next field of the freed node
anyway.

Maybe these should also adopt a similar naming as what we discussed for
the syncpoints. That is:

struct host1x_channel *host1x_channel_request(struct device *dev);

?

> diff --git a/drivers/gpu/host1x/channel.h b/drivers/gpu/host1x/channel.h
[...]
> +
> +/*
> + * host1x device list in debug-fs dump of host1x and client device
> + * as well as channel state
> + */

I don't understand this comment.

> +struct host1x_channel {
> + struct list_head list;
> +
> + int refcount;
> + int chid;

This can probably just be id. It is a field of host1x_channel, so the ch
prefix is redundant.

> + struct mutex reflock;
> + struct mutex submitlock;
> + void __iomem *regs;
> + struct device *node;

This is never used.

> + struct platform_device *dev;

Can this be just struct device *?

> + struct cdev cdev;

This is never used.

> +/* channel list operations */
> +void host1x_channel_list_init(struct host1x *);
> +void host1x_channel_for_all(struct host1x *, void *data,
> + int (*fptr)(struct host1x_channel *ch, void *fdata));
> +
> +struct host1x_channel *host1x_channel_alloc(struct platform_device *pdev);
> +void host1x_channel_free(struct host1x_channel *ch);

Is it a good idea to make host1x_channel_free() publicly available?
Shouldn't the host1x_channel_alloc()/host1x_channel_request() return a
host1x_channel with a reference count of 1 and everybody release their
reference using host1x_channel_put() to make sure the channel is freed
only after the last reference disappears?

Otherwise whoever calls host1x_channel_free() will confuse everybody
else that's still keeping a reference.

> diff --git a/drivers/gpu/host1x/cma.c b/drivers/gpu/host1x/cma.c
[...]

Various spurious blank lines in this file, and the alignment of function
parameters is off.

> +struct mem_handle *host1x_cma_get(u32 id, struct platform_device *dev)

I don't think this needs platform_device either.

> +{
> + struct drm_gem_cma_object *obj = to_cma_obj((void *)id);
> + struct mutex *struct_mutex = &obj->base.dev->struct_mutex;
> +
> + mutex_lock(struct_mutex);
> + drm_gem_object_reference(&obj->base);
> + mutex_unlock(struct_mutex);

I think it's more customary to obtain a pointer to struct drm_device and
then use mutex_{lock,unlock}(&drm->struct_mutex). Or you could just use
drm_gem_object_reference_unlocked(&obj->base) instead. Which doesn't
exist yet, apparently. But it could be added.

> +int host1x_cma_pin_array_ids(struct platform_device *dev,
> + long unsigned *ids,
> + long unsigned id_type_mask,
> + long unsigned id_type,
> + u32 count,
> + struct host1x_job_unpin_data *unpin_data,
> + dma_addr_t *phys_addr)

struct device * and unsigned long please. count can also doesn't need to
be a sized type. unsigned int will do just fine. The return value can
also be unsigned int if you don't expect to return any error conditions.

> +{
> + int i;
> + int pin_count = 0;

Both should be unsigned as well, and can go on one line:

unsigned int pin_count = 0, i;

> diff --git a/drivers/gpu/host1x/dev.h b/drivers/gpu/host1x/dev.h
[...]
> struct host1x;
> +struct host1x_intr;
> struct host1x_syncpt;
> +struct host1x_channel;
> +struct host1x_cdma;
> +struct host1x_job;
> +struct push_buffer;
> +struct dentry;

I think this already belongs in a previous patch. The debugfs dentry
isn't added in this patch.

> +struct host1x_channel_ops {
> + int (*init)(struct host1x_channel *,
> + struct host1x *,
> + int chid);

Please add the parameter names as well (the same goes for all ops
declared in this file). And "id" will be enough. Also the channel ID can
surely be unsigned, right?

> +struct host1x_cdma_ops {
> + void (*start)(struct host1x_cdma *);
> + void (*stop)(struct host1x_cdma *);
> + void (*kick)(struct host1x_cdma *);
> + int (*timeout_init)(struct host1x_cdma *,
> + u32 syncpt_id);
> + void (*timeout_destroy)(struct host1x_cdma *);
> + void (*timeout_teardown_begin)(struct host1x_cdma *);
> + void (*timeout_teardown_end)(struct host1x_cdma *,
> + u32 getptr);
> + void (*timeout_cpu_incr)(struct host1x_cdma *,
> + u32 getptr,
> + u32 syncpt_incrs,
> + u32 syncval,
> + u32 nr_slots);
> +};

Can the timeout_ prefix not be dropped? The functions are generally
useful and not directly related to timeouts, even though they seem to
only be used during timeout handling.

Also, is it really necessary to abstract these into an ops structure? I
get that newer hardware revisions might require different ops for sync-
point handling because the register layout or number of syncpoints may
be different, but the CDMA and push buffer (below) concepts are pretty
much a software abstraction, and as such its implementation is unlikely
to change with some future hardware revision.

> +struct host1x_pushbuffer_ops {
> + void (*reset)(struct push_buffer *);
> + int (*init)(struct push_buffer *);
> + void (*destroy)(struct push_buffer *);
> + void (*push_to)(struct push_buffer *,
> + struct mem_handle *,
> + u32 op1, u32 op2);
> + void (*pop_from)(struct push_buffer *,
> + unsigned int slots);

Maybe just push() and pop()?

> + u32 (*space)(struct push_buffer *);
> + u32 (*putptr)(struct push_buffer *);
> +};
>
> struct host1x_syncpt_ops {
> void (*reset)(struct host1x_syncpt *);
> @@ -64,9 +111,19 @@ struct host1x {
> struct host1x_device_info info;
> struct clk *clk;
>
> + /* Sync point dedicated to replacing waits for expired fences */
> + struct host1x_syncpt *nop_sp;
> +
> + struct host1x_channel_ops channel_op;
> + struct host1x_cdma_ops cdma_op;
> + struct host1x_pushbuffer_ops cdma_pb_op;
> struct host1x_syncpt_ops syncpt_op;
> struct host1x_intr_ops intr_op;
>
> + struct mutex chlist_mutex;
> + struct host1x_channel chlist;

Shouldn't this just be struct list_head?

> + int allocated_channels;

unsigned int? And maybe just "num_channels"?

> diff --git a/drivers/gpu/host1x/host1x.h b/drivers/gpu/host1x/host1x.h
[...]
> +enum host1x_class {
> + NV_HOST1X_CLASS_ID = 0x1,
> + NV_GRAPHICS_2D_CLASS_ID = 0x51,

This entry belongs in a later patch, right? And I find it convenient if
enumeration constants start with the enum name as prefix. Furthermore
it'd be nice to reuse the hardware module names, like so:

enum host1x_class {
HOST1X_CLASS_HOST1X,
HOST1X_CLASS_GR2D,
HOST1X_CLASS_GR3D,
};

> diff --git a/drivers/gpu/host1x/hw/cdma_hw.c b/drivers/gpu/host1x/hw/cdma_hw.c
[...]
> +#include <linux/slab.h>
> +#include <linux/scatterlist.h>
> +#include <linux/dma-mapping.h>
> +#include "cdma.h"
> +#include "channel.h"
> +#include "dev.h"
> +#include "memmgr.h"
> +
> +#include "cdma_hw.h"
> +
> +static inline u32 host1x_channel_dmactrl(int stop, int get_rst, int init_get)
> +{
> + return HOST1X_CHANNEL_DMACTRL_DMASTOP_F(stop)
> + | HOST1X_CHANNEL_DMACTRL_DMAGETRST_F(get_rst)
> + | HOST1X_CHANNEL_DMACTRL_DMAINITGET_F(init_get);

I think it is more customary to put the | at the end of the preceding
line:

return HOST1X_CHANNEL_DMACTRL_DMASTOP_F(stop) |
HOST1X_CHANNEL_DMACTRL_DMAGETRST_F(get_rst) |
HOST1X_CHANNEL_DMACTRL_DMAINITGET_F(init_get);

Also since these are all single bits, I'd prefer if you could drop the
_F suffix and not make them take a parameter. I think it'd even be
better not to have this function at all, but make the intent explicit
where the register is written. That is, have each call site set the bits
explicitly instead of calling this helper. Having a parameter list such
as (true, false, false) or (true, true, true) is confusing since you
have to keep looking up the meaning of the parameters.

> +}
> +
> +static void cdma_timeout_handler(struct work_struct *work);

Can this prototype be avoided?

> +/**
> + * Reset to empty push buffer
> + */
> +static void push_buffer_reset(struct push_buffer *pb)
> +{
> + pb->fence = PUSH_BUFFER_SIZE - 8;
> + pb->cur = 0;

Maybe position is a better name than cur.

> +/**
> + * Init push buffer resources
> + */
> +static void push_buffer_destroy(struct push_buffer *pb);

You should be careful with these comment blocks. If you start them with
/**, then you should make them proper kerneldoc comments. But you don't
really need that for static functions, so you could just make them /*-
style.

Also this particular comment is confusingly place on top of the proto-
type of push_buffer_destroy().

> +/*
> + * Push two words to the push buffer
> + * Caller must ensure push buffer is not full
> + */
> +static void push_buffer_push_to(struct push_buffer *pb,
> + struct mem_handle *handle,
> + u32 op1, u32 op2)
> +{
> + u32 cur = pb->cur;
> + u32 *p = (u32 *)((u32)pb->mapped + cur);

You do all this extra casting to make sure to increment by bytes and not
32-bit words. How about you change pb->cur to contain the word index, so
that you don't have to go through hoops each time around.

Alternatively you could make it a pointer to u32 and not have to index
or cast at all. So you'd end up with something like:

struct push_buffer {
u32 *start;
u32 *end;
u32 *ptr;
};

> +/*
> + * Return the number of two word slots free in the push buffer
> + */
> +static u32 push_buffer_space(struct push_buffer *pb)
> +{
> + return ((pb->fence - pb->cur) & (PUSH_BUFFER_SIZE - 1)) / 8;
> +}

Why & (PUSH_BUFFER_SIZE - 1) here? fence - cur can never be larger than
PUSH_BUFFER_SIZE, can it?

> +/*
> + * Init timeout resources
> + */
> +static int cdma_timeout_init(struct host1x_cdma *cdma,
> + u32 syncpt_id)
> +{
> + if (syncpt_id == NVSYNCPT_INVALID)
> + return -EINVAL;

Do we really need the syncpt_id check here? It is the only reason why we
need to pass the parameter in the first place, and if we get to this
point we should already have made sure that the syncpoint is actually
valid.

> +/*
> + * Increment timedout buffer's syncpt via CPU.

Nit: "timed out buffer's"

> + */
> +static void cdma_timeout_cpu_incr(struct host1x_cdma *cdma, u32 getptr,
> + u32 syncpt_incrs, u32 syncval, u32 nr_slots)

The syncval parameter isn't used.

> +{
> + struct host1x *host1x = cdma_to_host1x(cdma);
> + struct push_buffer *pb = &cdma->push_buffer;
> + u32 i, getidx;
> +
> + for (i = 0; i < syncpt_incrs; i++)
> + host1x_syncpt_cpu_incr(cdma->timeout.syncpt);
> +
> + /* after CPU incr, ensure shadow is up to date */
> + host1x_syncpt_load_min(cdma->timeout.syncpt);
> +
> + /* NOP all the PB slots */
> + getidx = getptr - pb->phys;
> + while (nr_slots--) {
> + u32 *p = (u32 *)((u32)pb->mapped + getidx);
> + *(p++) = HOST1X_OPCODE_NOOP;
> + *(p++) = HOST1X_OPCODE_NOOP;
> + dev_dbg(&host1x->dev->dev, "%s: NOP at 0x%x\n",
> + __func__, pb->phys + getidx);
> + getidx = (getidx + 8) & (PUSH_BUFFER_SIZE - 1);
> + }
> + wmb();

Why the memory barrier?

> +/*
> + * Similar to cdma_start(), but rather than starting from an idle
> + * state (where DMA GET is set to DMA PUT), on a timeout we restore
> + * DMA GET from an explicit value (so DMA may again be pending).
> + */
> +static void cdma_timeout_restart(struct host1x_cdma *cdma, u32 getptr)
> +{
> + struct host1x *host1x = cdma_to_host1x(cdma);
> + struct host1x_channel *ch = cdma_to_channel(cdma);
> +
> + if (cdma->running)
> + return;
> +
> + cdma->last_put = host1x->cdma_pb_op.putptr(&cdma->push_buffer);
> +
> + host1x_ch_writel(ch, host1x_channel_dmactrl(true, false, false),
> + HOST1X_CHANNEL_DMACTRL);
> +
> + /* set base, end pointer (all of memory) */
> + host1x_ch_writel(ch, 0, HOST1X_CHANNEL_DMASTART);
> + host1x_ch_writel(ch, 0xFFFFFFFF, HOST1X_CHANNEL_DMAEND);

According to the TRM, writing to HOST1X_CHANNEL_DMASTART will start a
DMA transfer on the channel (if DMA_PUT != DMA_GET). Irrespective of
that, why set the valid range to all of physical memory? We know the
valid range of the push buffer, why not set the limits accordingly?

> +/*
> + * Kick channel DMA into action by writing its PUT offset (if it has changed)
> + */
> +static void cdma_kick(struct host1x_cdma *cdma)
> +{
> + struct host1x *host1x = cdma_to_host1x(cdma);
> + struct host1x_channel *ch = cdma_to_channel(cdma);
> + u32 put;
> +
> + put = host1x->cdma_pb_op.putptr(&cdma->push_buffer);
> +
> + if (put != cdma->last_put) {
> + host1x_ch_writel(ch, put, HOST1X_CHANNEL_DMAPUT);
> + cdma->last_put = put;
> + }
> +}

kick() sounds unusual. Maybe flush or commit or something similar would
be more accurate.

> +static void cdma_stop(struct host1x_cdma *cdma)
> +{
> + struct host1x_channel *ch = cdma_to_channel(cdma);
> +
> + mutex_lock(&cdma->lock);
> + if (cdma->running) {
> + host1x_cdma_wait_locked(cdma, CDMA_EVENT_SYNC_QUEUE_EMPTY);
> + host1x_ch_writel(ch, host1x_channel_dmactrl(true, false, false),
> + HOST1X_CHANNEL_DMACTRL);
> + cdma->running = false;
> + }
> + mutex_unlock(&cdma->lock);
> +}

Perhaps this should be ranem cdma_stop_sync() or similar to make it
clear that it waits for the queue to run empty.

> +static void cdma_timeout_teardown_end(struct host1x_cdma *cdma, u32 getptr)

Maybe the last parameter should be called restart to match its purpose?

> +{
> + struct host1x *host1x = cdma_to_host1x(cdma);
> + struct host1x_channel *ch = cdma_to_channel(cdma);
> + u32 cmdproc_stop;
> +
> + dev_dbg(&host1x->dev->dev,
> + "end channel teardown (id %d, DMAGET restart = 0x%x)\n",
> + ch->chid, getptr);
> +
> + cmdproc_stop = host1x_sync_readl(host1x, HOST1X_SYNC_CMDPROC_STOP);
> + cmdproc_stop &= ~(BIT(ch->chid));

No need for the extra parentheses.

> + host1x_sync_writel(host1x, cmdproc_stop, HOST1X_SYNC_CMDPROC_STOP);
> +
> + cdma->torndown = false;
> + cdma_timeout_restart(cdma, getptr);
> +}

I find this a bit non-intuitive. We teardown a channel, and when we're
done tearing down, the torndown variable is set to false and the channel
is actually restarted. Maybe you could explain some more how this works
and what its purpose is.

> +/*
> + * If this timeout fires, it indicates the current sync_queue entry has
> + * exceeded its TTL and the userctx should be timed out and remaining
> + * submits already issued cleaned up (future submits return an error).
> + */

I can't seem to find what causes subsequent submits to return an error.
Also, how is the channel reset so that new jobs can be submitted?

> +static void cdma_timeout_handler(struct work_struct *work)
> +{
> + struct host1x_cdma *cdma;
> + struct host1x *host1x;
> + struct host1x_channel *ch;
> +
> + u32 syncpt_val;
> +
> + u32 prev_cmdproc, cmdproc_stop;
> +
> + cdma = container_of(to_delayed_work(work), struct host1x_cdma,
> + timeout.wq);
> + host1x = cdma_to_host1x(cdma);
> + ch = cdma_to_channel(cdma);
> +
> + mutex_lock(&cdma->lock);
> +
> + if (!cdma->timeout.clientid) {
> + dev_dbg(&host1x->dev->dev,
> + "cdma_timeout: expired, but has no clientid\n");
> + mutex_unlock(&cdma->lock);
> + return;
> + }

How can the CDMA not have a client?

> +
> + /* stop processing to get a clean snapshot */
> + prev_cmdproc = host1x_sync_readl(host1x, HOST1X_SYNC_CMDPROC_STOP);
> + cmdproc_stop = prev_cmdproc | BIT(ch->chid);
> + host1x_sync_writel(host1x, cmdproc_stop, HOST1X_SYNC_CMDPROC_STOP);
> +
> + dev_dbg(&host1x->dev->dev, "cdma_timeout: cmdproc was 0x%x is 0x%x\n",
> + prev_cmdproc, cmdproc_stop);
> +
> + syncpt_val = host1x_syncpt_load_min(host1x->syncpt);
> +
> + /* has buffer actually completed? */
> + if ((s32)(syncpt_val - cdma->timeout.syncpt_val) >= 0) {
> + dev_dbg(&host1x->dev->dev,
> + "cdma_timeout: expired, but buffer had completed\n");

Maybe this should really be a warning?

> + /* restore */
> + cmdproc_stop = prev_cmdproc & ~(BIT(ch->chid));

No need for the extra parentheses. Also, why not just use prev_cmdproc,
which shouldn't have the bit set anyway?

> diff --git a/drivers/gpu/host1x/hw/cdma_hw.h b/drivers/gpu/host1x/hw/cdma_hw.h
[...]
> +/*
> + * Size of the sync queue. If it is too small, we won't be able to queue up
> + * many command buffers. If it is too large, we waste memory.
> + */
> +#define HOST1X_SYNC_QUEUE_SIZE 512

I don't see this used anywhere.

> +/*
> + * Number of gathers we allow to be queued up per channel. Must be a
> + * power of two. Currently sized such that pushbuffer is 4KB (512*8B).
> + */
> +#define HOST1X_GATHER_QUEUE_SIZE 512

More pieces falling into place.

> diff --git a/drivers/gpu/host1x/hw/channel_hw.c b/drivers/gpu/host1x/hw/channel_hw.c
[...]
> +#include "host1x.h"
> +#include "channel.h"
> +#include "dev.h"
> +#include <linux/slab.h>
> +#include "intr.h"
> +#include "job.h"
> +#include <trace/events/host1x.h>

More include ordering issues.

> +static void submit_gathers(struct host1x_job *job)
> +{
> + /* push user gathers */
> + int i;

unsigned int?

> + for (i = 0 ; i < job->num_gathers; i++) {
> + struct host1x_job_gather *g = &job->gathers[i];
> + u32 op1 = host1x_opcode_gather(g->words);
> + u32 op2 = g->mem_base + g->offset;
> + host1x_cdma_push_gather(&job->ch->cdma,
> + job->gathers[i].ref,
> + job->gathers[i].offset,
> + op1, op2);
> + }
> +}

Perhaps inline this into channel_submit()? I'm not sure how useful it
really is to split off smallish functions such as this which aren't
reused anywhere else. I don't have any major objection though, so you
can keep it separate if you want.

> +static inline void __iomem *host1x_channel_regs(void __iomem *p, int ndx)
> +{
> + p += ndx * NV_HOST1X_CHANNEL_MAP_SIZE_BYTES;
> + return p;
> +}
> +
> +static int host1x_channel_init(struct host1x_channel *ch,
> + struct host1x *dev, int index)
> +{
> + ch->chid = index;
> + mutex_init(&ch->reflock);
> + mutex_init(&ch->submitlock);
> +
> + ch->regs = host1x_channel_regs(dev->regs, index);
> + return 0;
> +}

You only use host1x_channel_regs() once, so I really don't think it buys
you anything to split it off. Both host1x_channel_regs() and
host1x_channel_init() are short enough that they can be collapsed.

> diff --git a/drivers/gpu/host1x/hw/host1x01.c b/drivers/gpu/host1x/hw/host1x01.c
[...]
> #include "hw/host1x01.h"
> #include "dev.h"
> +#include "channel.h"
> #include "hw/host1x01_hardware.h"
>
> +#include "hw/channel_hw.c"
> +#include "hw/cdma_hw.c"
> #include "hw/syncpt_hw.c"
> #include "hw/intr_hw.c"
>
> int host1x01_init(struct host1x *host)
> {
> + host->channel_op = host1x_channel_ops;
> + host->cdma_op = host1x_cdma_ops;
> + host->cdma_pb_op = host1x_pushbuffer_ops;
> host->syncpt_op = host1x_syncpt_ops;
> host->intr_op = host1x_intr_ops;

I think I mentioned this before, but I'd prefer not to have the .c files
included here, but rather reference the ops structures externally. But I
still think that especially CDMA and push buffer ops don't need to be in
separate structures since they aren't likely to change with new hardware
revisions.

> diff --git a/drivers/gpu/host1x/hw/host1x01_hardware.h b/drivers/gpu/host1x/hw/host1x01_hardware.h
[...]
> index c1d5324..03873c0 100644
> --- a/drivers/gpu/host1x/hw/host1x01_hardware.h
> +++ b/drivers/gpu/host1x/hw/host1x01_hardware.h
> @@ -21,6 +21,130 @@
>
> #include <linux/types.h>
> #include <linux/bitops.h>
> +#include "hw_host1x01_channel.h"
> #include "hw_host1x01_sync.h"
> +#include "hw_host1x01_uclass.h"
> +
> +/* channel registers */
> +#define NV_HOST1X_CHANNEL_MAP_SIZE_BYTES 16384

The only user of this seems to be host1x_channel_regs(), so it could be
moved to that file. Also the name is overly long, why not something like
HOST1X_CHANNEL_SIZE?

> +#define HOST1X_OPCODE_NOOP host1x_opcode_nonincr(0, 0)

HOST1X_OPCODE_NOP would be more canonical in my opinion.


> +static inline u32 host1x_mask2(unsigned x, unsigned y)
> +{
> + return 1 | (1 << (y - x));
> +}

What's this? I don't see it used anywhere.

> diff --git a/drivers/gpu/host1x/hw/hw_host1x01_channel.h b/drivers/gpu/host1x/hw/hw_host1x01_channel.h
[...]
> +#define HOST1X_CHANNEL_DMACTRL_DMASTOP_F(v) \
> + host1x_channel_dmactrl_dmastop_f(v)

I mentioned this elsewhere already, but I think the _F suffix (and _f
for that matter) along with the v parameter should go away.

> diff --git a/drivers/gpu/host1x/hw/hw_host1x01_uclass.h b/drivers/gpu/host1x/hw/hw_host1x01_uclass.h
[...]

What does the "uclass" stand for? It seems a bit useless to me.

> diff --git a/drivers/gpu/host1x/hw/syncpt_hw.c b/drivers/gpu/host1x/hw/syncpt_hw.c
> index 16e3ada..ba48cee 100644
> --- a/drivers/gpu/host1x/hw/syncpt_hw.c
> +++ b/drivers/gpu/host1x/hw/syncpt_hw.c
> @@ -97,6 +97,15 @@ static void syncpt_cpu_incr(struct host1x_syncpt *sp)
> wmb();
> }
>
> +/* remove a wait pointed to by patch_addr */
> +static int syncpt_patch_wait(struct host1x_syncpt *sp, void *patch_addr)
> +{
> + u32 override = host1x_class_host_wait_syncpt(
> + NVSYNCPT_GRAPHICS_HOST, 0);
> + __raw_writel(override, patch_addr);

__raw_writel() isn't meant to be used for regular memory addresses, but
only for MMIO addresses. patch_addr will be a kernel virtual address to
an location in RAM, so you can just treat it as a normal pointer, so:

*(u32 *)patch_addr = override;

A small optimization might be to make override a static const, so that
it doesn't have to be composed every time.

> diff --git a/drivers/gpu/host1x/intr.c b/drivers/gpu/host1x/intr.c
[...]
> +static void action_submit_complete(struct host1x_waitlist *waiter)
> +{
> + struct host1x_channel *channel = waiter->data;
> + int nr_completed = waiter->count;

No need for this variable.

> diff --git a/drivers/gpu/host1x/job.c b/drivers/gpu/host1x/job.c
[...]
> +#ifdef CONFIG_TEGRA_HOST1X_FIREWALL
> +static int host1x_firewall = 1;
> +#else
> +static int host1x_firewall;
> +#endif

You could use IS_ENABLED(CONFIG_TEGRA_HOST1X_FIREWALL) in the code,
which will have the nice side-effect of compiling code out if the symbol
isn't selected.

> +struct host1x_job *host1x_job_alloc(struct host1x_channel *ch,
> + u32 num_cmdbufs, u32 num_relocs, u32 num_waitchks)

Maybe make the parameters unsigned int instead of u32?

> +{
> + struct host1x_job *job = NULL;
> + int num_unpins = num_cmdbufs + num_relocs;

unsigned int?

> + s64 total;

This doesn't need to be signed, u64 will be good enough. None of the
terms in the expression that assigns to total can be negative.

> + void *mem;
> +
> + /* Check that we're not going to overflow */
> + total = sizeof(struct host1x_job)
> + + num_relocs * sizeof(struct host1x_reloc)
> + + num_unpins * sizeof(struct host1x_job_unpin_data)
> + + num_waitchks * sizeof(struct host1x_waitchk)
> + + num_cmdbufs * sizeof(struct host1x_job_gather)
> + + num_unpins * sizeof(dma_addr_t)
> + + num_unpins * sizeof(u32 *);

"+"s at the end of the preceding lines.

> + if (total > ULONG_MAX)
> + return NULL;
> +
> + mem = job = kzalloc(total, GFP_KERNEL);
> + if (!job)
> + return NULL;
> +
> + kref_init(&job->ref);
> + job->ch = ch;
> +
> + /* First init state to zero */
> +
> + /*
> + * Redistribute memory to the structs.
> + * Overflows and negative conditions have
> + * already been checked in job_alloc().
> + */

The last two lines don't really apply here. The checks are in this same
function and they check only for overflow, not negative conditions,
which can't happen anyway since the counts are all unsigned.

> +void host1x_job_get(struct host1x_job *job)
> +{
> + kref_get(&job->ref);
> +}

I think it is common for *_get() functions to return a pointer to the
referenced object.

> +void host1x_job_add_gather(struct host1x_job *job,
> + u32 mem_id, u32 words, u32 offset)
> +{
> + struct host1x_job_gather *cur_gather =
> + &job->gathers[job->num_gathers];

Should this check for overflow?

> +/*
> + * Check driver supplied waitchk structs for syncpt thresholds
> + * that have already been satisfied and NULL the comparison (to
> + * avoid a wrap condition in the HW).
> + */
> +static int do_waitchks(struct host1x_job *job, struct host1x *host,
> + u32 patch_mem, struct mem_handle *h)
> +{
> + int i;
> +
> + /* compare syncpt vs wait threshold */
> + for (i = 0; i < job->num_waitchk; i++) {
> + struct host1x_waitchk *wait = &job->waitchk[i];
> + struct host1x_syncpt *sp =
> + host1x_syncpt_get(host, wait->syncpt_id);
> +
> + /* validate syncpt id */
> + if (wait->syncpt_id > host1x_syncpt_nb_pts(host))
> + continue;
> +
> + /* skip all other gathers */
> + if (patch_mem != wait->mem)
> + continue;
> +
> + trace_host1x_syncpt_wait_check(wait->mem, wait->offset,
> + wait->syncpt_id, wait->thresh,
> + host1x_syncpt_read_min(sp));
> + if (host1x_syncpt_is_expired(
> + host1x_syncpt_get(host, wait->syncpt_id),
> + wait->thresh)) {

You already have the sp variable that you could use here to make it more
readable.

> + struct host1x_syncpt *sp =
> + host1x_syncpt_get(host, wait->syncpt_id);

And you don't need this then, since you already have sp pointing to the
same syncpoint.

> + void *patch_addr = NULL;
> +
> + /*
> + * NULL an already satisfied WAIT_SYNCPT host method,
> + * by patching its args in the command stream. The
> + * method data is changed to reference a reserved
> + * (never given out or incr) NVSYNCPT_GRAPHICS_HOST
> + * syncpt with a matching threshold value of 0, so
> + * is guaranteed to be popped by the host HW.
> + */
> + dev_dbg(&host->dev->dev,
> + "drop WAIT id %d (%s) thresh 0x%x, min 0x%x\n",
> + wait->syncpt_id, sp->name, wait->thresh,
> + host1x_syncpt_read_min(sp));
> +
> + /* patch the wait */
> + patch_addr = host1x_memmgr_kmap(h,
> + wait->offset >> PAGE_SHIFT);
> + if (patch_addr) {
> + host1x_syncpt_patch_wait(sp,
> + (patch_addr +
> + (wait->offset & ~PAGE_MASK)));
> + host1x_memmgr_kunmap(h,
> + wait->offset >> PAGE_SHIFT,
> + patch_addr);
> + } else {
> + pr_err("Couldn't map cmdbuf for wait check\n");
> + }

This is a case where splitting out a small function would actually be
useful to make the code more readable since you can remove two levels of
indentation. You can just pass in the handle and the offset, let it do
the actual patching. Maybe

host1x_syncpt_patch_offset(sp, h, wait->offset);

?

> + }
> +
> + wait->mem = 0;
> + }
> + return 0;
> +}
> +
> +

There's a gratuitous blank line.

> +static int pin_job_mem(struct host1x_job *job)
> +{
> + int i;
> + int count = 0;
> + int result;

These (and the return value) can all be unsigned int.

> +static int do_relocs(struct host1x_job *job,
> + u32 cmdbuf_mem, struct mem_handle *h)
> +{
> + int i = 0;

This can also be unsigned int.

> + int last_page = -1;

And this should match the type of cmdbuf_offset (u32). You can initially
set it to something like ~0 to make sure it doesn't match any valid
offset.

> + void *cmdbuf_page_addr = NULL;
> +
> + /* pin & patch the relocs for one gather */
> + while (i < job->num_relocs) {
> + struct host1x_reloc *reloc = &job->relocarray[i];
> +
> + /* skip all other gathers */
> + if (cmdbuf_mem != reloc->cmdbuf_mem) {
> + i++;
> + continue;
> + }
> +
> + if (last_page != reloc->cmdbuf_offset >> PAGE_SHIFT) {
> + if (cmdbuf_page_addr)
> + host1x_memmgr_kunmap(h,
> + last_page, cmdbuf_page_addr);
> +
> + cmdbuf_page_addr = host1x_memmgr_kmap(h,
> + reloc->cmdbuf_offset >> PAGE_SHIFT);
> + last_page = reloc->cmdbuf_offset >> PAGE_SHIFT;
> +
> + if (unlikely(!cmdbuf_page_addr)) {
> + pr_err("Couldn't map cmdbuf for relocation\n");
> + return -ENOMEM;
> + }
> + }
> +
> + __raw_writel(
> + (job->reloc_addr_phys[i] +
> + reloc->target_offset) >> reloc->shift,
> + (cmdbuf_page_addr +
> + (reloc->cmdbuf_offset & ~PAGE_MASK)));

Again, wrong __raw_writel() usage.

> +
> + /* remove completed reloc from the job */
> + if (i != job->num_relocs - 1) {
> + struct host1x_reloc *reloc_last =
> + &job->relocarray[job->num_relocs - 1];
> + reloc->cmdbuf_mem = reloc_last->cmdbuf_mem;
> + reloc->cmdbuf_offset = reloc_last->cmdbuf_offset;
> + reloc->target = reloc_last->target;
> + reloc->target_offset = reloc_last->target_offset;
> + reloc->shift = reloc_last->shift;
> + job->reloc_addr_phys[i] =
> + job->reloc_addr_phys[job->num_relocs - 1];
> + job->num_relocs--;
> + } else {
> + break;
> + }
> + }
> +
> + if (cmdbuf_page_addr)
> + host1x_memmgr_kunmap(h, last_page, cmdbuf_page_addr);
> +
> + return 0;
> +}

Also the algorithm seems a bit strange and hard to follow. Instead of
removing relocs from the job, replacing them with the last entry and
decrementing job->num_relocs, how much is the penalty for always
iterating over all relocs? This is one of the other cases where I'd
argue that simplicity is key. Furthermore you need to copy quite a bit
of data to replace the completed relocs, so I'm not sure it buys you
much.

It could always be optimized later on by just setting a bit in the reloc
to mark it as completed, or keep a bitmask of completed relocations or
whatever.

> +static int check_reloc(struct host1x_reloc *reloc,
> + u32 cmdbuf_id, int offset)

offset can be unsigned int.

> +{
> + int err = 0;
> + if (reloc->cmdbuf_mem != cmdbuf_id
> + || reloc->cmdbuf_offset != offset * sizeof(u32))
> + err = -EINVAL;
> +
> + return err;
> +}

More canonically:

offset *= sizeof(u32);

if (reloc->cmdbuf_mem != cmdbuf_id || reloc->cmdbuf_offset != offset)
return -EINVAL;

return 0;

> +
> +static int check_mask(struct host1x_job *job,
> + struct platform_device *pdev,
> + struct host1x_reloc **reloc, int *num_relocs,
> + u32 cmdbuf_id, int *offset,
> + u32 *words, u32 class, u32 reg, u32 mask)

num_relocs and offset can be unsigned int *.

Same comment for the other check_*() functions. That said I think the
code would become a lot more readable if you were to wrap all of these
parameters into a structure, say host1x_firewall, and just pass that
into the functions.

> +static inline int copy_gathers(struct host1x_job *job,
> + struct platform_device *pdev)

struct device *

> +{
> + size_t size = 0;
> + size_t offset = 0;
> + int i;
> +
> + for (i = 0; i < job->num_gathers; i++) {
> + struct host1x_job_gather *g = &job->gathers[i];
> + size += g->words * sizeof(u32);
> + }
> +
> + job->gather_copy_mapped = dma_alloc_writecombine(&pdev->dev,
> + size, &job->gather_copy, GFP_KERNEL);
> + if (IS_ERR(job->gather_copy_mapped)) {

dma_alloc_writecombine() returns NULL on failure, so this check is
wrong.

> + int err = PTR_ERR(job->gather_copy_mapped);
> + job->gather_copy_mapped = NULL;
> + return err;
> + }
> +
> + job->gather_copy_size = size;
> +
> + for (i = 0; i < job->num_gathers; i++) {
> + struct host1x_job_gather *g = &job->gathers[i];
> + void *gather = host1x_memmgr_mmap(g->ref);
> + memcpy(job->gather_copy_mapped + offset,
> + gather + g->offset,
> + g->words * sizeof(u32));
> +
> + g->mem_base = job->gather_copy;
> + g->offset = offset;
> + g->mem_id = 0;
> + g->ref = 0;
> +
> + host1x_memmgr_munmap(g->ref, gather);
> + offset += g->words * sizeof(u32);
> + }
> +
> + return 0;
> +}

I wonder, where's this DMA buffer actually used? I can't find any use
between this copy and the corresponding dma_free_writecombine() call.

> +int host1x_job_pin(struct host1x_job *job, struct platform_device *pdev)
> +{
> + int err = 0, i = 0, j = 0;

No need to initialize these here. i and j can also be unsigned.

> + struct host1x *host = host1x_get_host(pdev);
> + DECLARE_BITMAP(waitchk_mask, host1x_syncpt_nb_pts(host));
> +
> + bitmap_zero(waitchk_mask, host1x_syncpt_nb_pts(host));
> + for (i = 0; i < job->num_waitchk; i++) {
> + u32 syncpt_id = job->waitchk[i].syncpt_id;
> + if (syncpt_id < host1x_syncpt_nb_pts(host))
> + set_bit(syncpt_id, waitchk_mask);
> + }
> +
> + /* get current syncpt values for waitchk */
> + for_each_set_bit(i, &waitchk_mask[0], sizeof(waitchk_mask))
> + host1x_syncpt_load_min(host->syncpt + i);
> +
> + /* pin memory */
> + err = pin_job_mem(job);
> + if (err <= 0)
> + goto out;

pin_job_mem() never returns negative.

> + /* patch gathers */
> + for (i = 0; i < job->num_gathers; i++) {
> + struct host1x_job_gather *g = &job->gathers[i];
> +
> + /* process each gather mem only once */
> + if (!g->ref) {
> + g->ref = host1x_memmgr_get(g->mem_id, job->ch->dev);
> + if (IS_ERR(g->ref)) {

host1x_memmgr_get() also seems to return NULL on error.

> + err = PTR_ERR(g->ref);
> + g->ref = NULL;
> + break;
> + }
> +
> + g->mem_base = job->gather_addr_phys[i];
> +
> + for (j = 0; j < job->num_gathers; j++) {
> + struct host1x_job_gather *tmp =
> + &job->gathers[j];
> + if (!tmp->ref && tmp->mem_id == g->mem_id) {
> + tmp->ref = g->ref;
> + tmp->mem_base = g->mem_base;
> + }
> + }
> + err = 0;
> + if (host1x_firewall)

if (IS_ENABLED(CONFIG_TEGRA_HOST1X_FIREWALL))

> + err = validate(job, pdev, g);
> + if (err)
> + dev_err(&pdev->dev,
> + "Job validate returned %d\n", err);
> + if (!err)
> + err = do_relocs(job, g->mem_id, g->ref);
> + if (!err)
> + err = do_waitchks(job, host,
> + g->mem_id, g->ref);
> + host1x_memmgr_put(g->ref);
> + if (err)
> + break;
> + }
> + }
> +
> + if (host1x_firewall && !err) {

And here.

> +/*
> + * Debug routine used to dump job entries
> + */
> +void host1x_job_dump(struct device *dev, struct host1x_job *job)
> +{
> + dev_dbg(dev, " SYNCPT_ID %d\n",
> + job->syncpt_id);
> + dev_dbg(dev, " SYNCPT_VAL %d\n",
> + job->syncpt_end);
> + dev_dbg(dev, " FIRST_GET 0x%x\n",
> + job->first_get);
> + dev_dbg(dev, " TIMEOUT %d\n",
> + job->timeout);
> + dev_dbg(dev, " NUM_SLOTS %d\n",
> + job->num_slots);
> + dev_dbg(dev, " NUM_HANDLES %d\n",
> + job->num_unpins);
> +}

These don't need to be wrapped.

> diff --git a/drivers/gpu/host1x/job.h b/drivers/gpu/host1x/job.h
[...]
> +struct host1x_job_gather {
> + u32 words;
> + dma_addr_t mem_base;
> + u32 mem_id;
> + int offset;
> + struct mem_handle *ref;
> +};
> +
> +struct host1x_cmdbuf {
> + __u32 mem;
> + __u32 offset;
> + __u32 words;
> + __u32 pad;
> +};
> +
> +struct host1x_reloc {
> + __u32 cmdbuf_mem;
> + __u32 cmdbuf_offset;
> + __u32 target;
> + __u32 target_offset;
> + __u32 shift;
> + __u32 pad;
> +};
> +
> +struct host1x_waitchk {
> + __u32 mem;
> + __u32 offset;
> + __u32 syncpt_id;
> + __u32 thresh;
> +};

None of these are shared with userspace, so they shouldn't take the
__u32 types, but the regular u32 ones.

> +/*
> + * Each submit is tracked as a host1x_job.
> + */
> +struct host1x_job {
> + /* When refcount goes to zero, job can be freed */
> + struct kref ref;
> +
> + /* List entry */
> + struct list_head list;
> +
> + /* Channel where job is submitted to */
> + struct host1x_channel *ch;

Maybe write it out as "channel"?

> +
> + int clientid;

Subsequent patches assign u32 to this field, so maybe the type should be
changed here. And maybe leave out the id suffix. It doesn't really add
any information.

> + /* Gathers and their memory */
> + struct host1x_job_gather *gathers;
> + int num_gathers;

unsigned int

> + /* Wait checks to be processed at submit time */
> + struct host1x_waitchk *waitchk;
> + int num_waitchk;

unsigned int

> + u32 waitchk_mask;

This might need to be changed to a bitfield once future Tegra versions
start supporting more than 32 syncpoints.

> + /* Array of handles to be pinned & unpinned */
> + struct host1x_reloc *relocarray;
> + int num_relocs;

unsigned int

> + struct host1x_job_unpin_data *unpins;
> + int num_unpins;

unsigned int

> + dma_addr_t *addr_phys;
> + dma_addr_t *gather_addr_phys;
> + dma_addr_t *reloc_addr_phys;
> +
> + /* Sync point id, number of increments and end related to the submit */
> + u32 syncpt_id;
> + u32 syncpt_incrs;
> + u32 syncpt_end;
> +
> + /* Maximum time to wait for this job */
> + int timeout;

unsigned int. I think we discussed this already in a slightly different
context in patch 2.

> + /* Null kickoff prevents submit from being sent to hardware */
> + bool null_kickoff;

I don't think this is used anywhere.

> + /* Index and number of slots used in the push buffer */
> + int first_get;
> + int num_slots;

unsigned int

> +
> + /* Copy of gathers */
> + size_t gather_copy_size;
> + dma_addr_t gather_copy;
> + u8 *gather_copy_mapped;

Are these really needed? They don't seem to be used anywhere except to
store a copy and free that copy sometime later.

> +
> + /* Temporary space for unpin ids */
> + long unsigned int *pin_ids;

unsigned long

> + /* Check if register is marked as an address reg */
> + int (*is_addr_reg)(struct platform_device *dev, u32 reg, u32 class);

is_addr_reg() sounds a bit unusual. Maybe match this to the name of the
main firewall routine, validate()?

> + /* Request a SETCLASS to this class */
> + u32 class;
> +
> + /* Add a channel wait for previous ops to complete */
> + u32 serialize;

This is used in code as a boolean. Why does it need to be 32 bits?

> diff --git a/drivers/gpu/host1x/memmgr.h b/drivers/gpu/host1x/memmgr.h
[...]
> +struct mem_handle;
> +struct platform_device;
> +
> +struct host1x_job_unpin_data {
> + struct mem_handle *h;
> + struct sg_table *mem;
> +};
> +
> +enum mem_mgr_flag {
> + mem_mgr_flag_uncacheable = 0,
> + mem_mgr_flag_write_combine = 1,
> +};

I'd like to see this use a more object-oriented approach and more common
terminology. All of these handles are essentially buffer objects, so
maybe something like host1x_bo would be a nice and short name.

To make this more object-oriented, I propose something like:

struct host1x_bo_ops {
int (*alloc)(struct host1x_bo *bo, size_t size, unsigned long align,
unsigned long flags);
int (*free)(struct host1x_bo *bo);
...
};

struct host1x_bo {
const struct host1x_bo_ops *ops;
};

struct host1x_cma_bo {
struct host1x_bo base;
struct drm_gem_cma_object *obj;
};

static inline struct host1x_cma_bo *to_host1x_cma_bo(struct host1x_bo *bo)
{
return container_of(bo, struct host1x_cma_bo, base);
}

static inline int host1x_bo_alloc(struct host1x_bo *bo, size_t size,
unsigned long align, unsigned long flags)
{
return bo->ops->alloc(bo, size, align, flags);
}

...

That should be easy to extend with a new type of BO once the IOMMU-based
allocator is ready. And as I said it is much closer in terminology to
what other drivers do.

> diff --git a/drivers/gpu/host1x/syncpt.h b/drivers/gpu/host1x/syncpt.h
> index b46d044..255a3a3 100644
> --- a/drivers/gpu/host1x/syncpt.h
> +++ b/drivers/gpu/host1x/syncpt.h
> @@ -26,6 +26,7 @@
> struct host1x;
>
> #define NVSYNCPT_INVALID (-1)
> +#define NVSYNCPT_GRAPHICS_HOST 0

I think these should match other naming, so:

#define HOST1X_SYNCPT_INVALID -1
#define HOST1X_SYNCPT_HOST1X 0

There are a few more occurrences where platform_device is used but I
haven't commented on them. I don't think any of them won't work with
just a struct device instead. Also I may not have caught all of the
places where you should rather be using unsigned int instead of int,
so you might want to look out for some of those.

Generally I very much like where this is going. Are there any plans to
move the userspace binary driver to this interface at some point so we
can more actively test it? Also, is anything else blocking adding a
gr3d device similar to gr2d from this patch series?

Thierry

Attachment: pgp00000.pgp
Description: PGP signature