Re: [PATCH 2/6] gpu: host1x: Fix syncpoint wait return value

From: Thierry Reding
Date: Tue Jun 11 2013 - 14:19:43 EST


On Tue, Jun 11, 2013 at 02:09:31PM +0200, Daniel Vetter wrote:
> On Tue, Jun 11, 2013 at 1:43 PM, Terje BergstrÃm <tbergstrom@xxxxxxxxxx> wrote:
> > On 11.06.2013 14:00, Daniel Vetter wrote:
> >> We don't use the EAGAIN ioctl restarting to resubmit the batchbuffer
> >> which blew up the gpu (that one has been submitted already in a
> >> different ioctl call), but to be able to restart the ioctl after the
> >> reset has completed: We need to kick every thread which is potentially
> >> holding GEM locks and make sure that we block them (at a point where
> >> they don't hold any locks) until the reset handler completed. To avoid
> >> a validation nightmare we use the same codepaths as we use for signal
> >> interrupts, so ioctl restarting is a very natural fit for this.
> >>
> >> Resubmitting victim workloads when a gpu crash happened is something
> >> the reset handler would do (kernel work item currently), not any
> >> userspace process doing an ioctl. But atm we don't resubmit victimized
> >> workloads.
> >
> > I don't understand the end-to-end of how resubmit is supposed to work.
> > User space is not supposed to resubmit, but still EAGAIN is returned to
> > user space, and drmIoctl() in user space just calls the came ioctl
> > again. Sounds like drmIoctl() is completely wrong.
>
> Maybe it wasn't clear, but -EAGAIN does _not_ resubmit work. -EAGAIN
> is used to restart the ioctl if we had to kick a thread (to make sure
> it doesn't hold any locks), e.g. for a blocking wait on oustanding
> rendering. The codepaths taken work exactly as if the thread is
> interrupt with a signal.

Okay. So if I understand correctly that reserves EAGAIN for a very
specific purpose DRM-wide and therefore we can't really use it for
something Tegra-specific. Even if we wanted to side-step the issue
by not using drmIoctl(), it might be a bad idea (or even break in
some other path) if we use EAGAIN. I guess we'll have to find some
other error-code for the case where a zero timeout is given to the
syncpoint wait ioctl.

Maybe it's best to use ETIMEDOUT after, just like for the case of a
non-zero timeout and the syncpoint not being incremented in time.
Userspace should be able to differentiate between the two because it
knows what timeout it did pass to the ioctl.

Thierry

Attachment: pgp00000.pgp
Description: PGP signature