Re: [PATCH v2 00/25] AMDKFD kernel driver

From: Oded Gabbay
Date: Tue Jul 22 2014 - 04:20:13 EST


On 22/07/14 10:28, Daniel Vetter wrote:
On Mon, Jul 21, 2014 at 03:03:07PM -0400, Jerome Glisse wrote:
On Mon, Jul 21, 2014 at 09:41:29PM +0300, Oded Gabbay wrote:
On 21/07/14 21:22, Daniel Vetter wrote:
On Mon, Jul 21, 2014 at 7:28 PM, Oded Gabbay <oded.gabbay@xxxxxxx> wrote:
I'm not sure whether we can do the same trick with the hw scheduler. But
then unpinning hw contexts will drain the pipeline anyway, so I guess we
can just stop feeding the hw scheduler until it runs dry. And then unpin
and evict.
So, I'm afraid but we can't do this for AMD Kaveri because:

Well as long as you can drain the hw scheduler queue (and you can do
that, worst case you have to unmap all the doorbells and other stuff
to intercept further submission from userspace) you can evict stuff.

I can't drain the hw scheduler queue, as I can't do mid-wave preemption.
Moreover, if I use the dequeue request register to preempt a queue
during a dispatch it may be that some waves (wave groups actually) of
the dispatch have not yet been created, and when I reactivate the mqd,
they should be created but are not. However, this works fine if you use
the HIQ. the CP ucode correctly saves and restores the state of an
outstanding dispatch. I don't think we have access to the state from
software at all, so it's not a bug, it is "as designed".


I think here Daniel is suggesting to unmapp the doorbell page, and track
each write made by userspace to it and while unmapped wait for the gpu to
drain or use some kind of fence on a special queue. Once GPU is drain we
can move pinned buffer, then remap the doorbell and update it to the last
value written by userspace which will resume execution to the next job.

Exactly, just prevent userspace from submitting more. And if you have
misbehaving userspace that submits too much, reset the gpu and tell it
that you're sorry but won't schedule any more work.

I'm not sure how you intend to know if a userspace misbehaves or not. Can you elaborate ?

Oded

We have this already in i915 (since like all other gpus we're not
preempting right now) and it works. There's some code floating around to
even restrict the reset to _just_ the offending submission context, with
nothing else getting corrupted.

You can do all this with the doorbells and unmapping them, but it's a
pain. Much easier if you have a real ioctl, and I haven't seen anyone with
perf data indicating that an ioctl would be too much overhead on linux.
Neither in this thread nor internally here at intel.
-Daniel


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/