Re: Interacting with coherent memory on external devices

From: Jerome Glisse
Date: Thu Apr 23 2015 - 11:42:48 EST

Next message: David Miller: "Re: [PATCHv2] ibmveth: Fix off-by-one error in ibmveth_change_mtu()"
Previous message: Andy Lutomirski: "Re: [PATCH] x86/asm/entry/64: better check for canonical address"
In reply to: Christoph Lameter: "Re: Interacting with coherent memory on external devices"
Next in thread: Benjamin Herrenschmidt: "Re: Interacting with coherent memory on external devices"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]

On Thu, Apr 23, 2015 at 09:10:13AM -0500, Christoph Lameter wrote:
> On Thu, 23 Apr 2015, Benjamin Herrenschmidt wrote:
>
> > > Anyone
> > > wanting performance (and that is the prime reason to use a GPU) would
> > > switch this off because the latencies are otherwise not controllable and
> > > those may impact performance severely. There are typically multiple
> > > parallel strands of executing that must execute with similar performance
> > > in order to allow a data exchange at defined intervals. That is no longer
> > > possible if you add variances that come with the "transparency" here.
> >
> > Stop trying to apply your unique usage model to the entire world :-)
>
> Much of the HPC apps that the world is using is severely impacted by what
> you are proposing. Its the industries usage model not mine. That is why I
> was asking about the use case. Does not seem to fit the industry you are
> targeting. This is also the basic design principle that got GPUs to work
> as fast as they do today. Introducing random memory latencies there will
> kill much of the benefit of GPUs there too.

We obviously have different experience and i fear yours is restricted to
a specific uncommon application. You care about latency all my previous
experience (i developped application for HPC platform in the past) is
that latency is not the issue, throughput is. For instance i developed
on HPC where the data was coming from magnetic tape, latency here was
several minutes before the data starts streaming (yes a robot arm had
to pick the tape and load it into one of the available readers). All
people i interacted with accross various fields (physics, biology, data
mining) where not worried a bit about latency. They could not care more
about latency actually. What they care about was overall throughput and
ease of use.

You need to stop thinking HPC == low latency. Low latency is only useful
in time critical application such as the high frequency trading you seem
to care about. For people working on physics, biology, data mining, CAD,
... they do care more about throughput than latency. I strongly believe
here that this cover a far greater number of users of HPC than yours
(maybe not in term of money power ... alas).

On GPU front i have a lot of experience, more than 15 years working on
open source driver for them. I would like to think that i have a clue or
two on how they work. So when i say latency is not the primary concern
in most cases, i do mean it. GPU is about having many threads in flight
and hidding memory latency through this many threads. If you have
1000 "core" on a GPU and you have 5000 threads in flight then you have
big chance that no matter of memory latency for each clock cycle you
will still have 1000 threads ready to compute something.

I am not saying latency never matter, it is all about the kind of app
that is running and how much data it needs to consume and how much
thread the hw can keep in flight at the same time.

So yes, autonuma solution are worth investigating, as a matter of fact
even today driver actually use heuristic (taking into account hint
provided by userspace) to decide what to put into video memory or not.

For many applications the driver stack will be able to provide good
hint on what to migrate or not, but you still need to think multiple
process and so you need to share resources. This is the role of the
kernel to share resources among process, it always have been.

Now for your use case, you know before hand how many process there
gonna be and you can partition the resources accordingly and you make
better taylored decission on where things should reside. But again
this is not the common case. All HPC i know can not predict the number
of process nor partition resource for them. Program that run on those
system are updated frequently and you need to share resources with
others. For all those people and for people just working on a work
station the solution of autonuma is most likely the best. It might
not lead to 100% saturation of GPU but it will be good enough to
make a difference.

The numa code we have today for CPU case exist because it does make
a difference but you keep trying to restrict GPU user to a workload
that is specific. Go talk to people doing physic, biology, data
mining, CAD most of them do not care about latency. They have not
hard deadline to meet with their computation. They just want things
to compute as fast as possible and programming to be as easy as it
can get.

Jérôme
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

Next message: David Miller: "Re: [PATCHv2] ibmveth: Fix off-by-one error in ibmveth_change_mtu()"
Previous message: Andy Lutomirski: "Re: [PATCH] x86/asm/entry/64: better check for canonical address"
In reply to: Christoph Lameter: "Re: Interacting with coherent memory on external devices"
Next in thread: Benjamin Herrenschmidt: "Re: Interacting with coherent memory on external devices"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]