Re: [PATCH 0/7] devcg: device cgroup extension for rdma resource

From: Jason Gunthorpe
Date: Mon Sep 14 2015 - 16:19:38 EST


On Tue, Sep 15, 2015 at 12:24:41AM +0530, Parav Pandit wrote:
> On Mon, Sep 14, 2015 at 10:58 PM, Jason Gunthorpe
> <jgunthorpe@xxxxxxxxxxxxxxxxxxxx> wrote:
> > On Mon, Sep 14, 2015 at 04:39:33PM +0530, Parav Pandit wrote:
> >
> >> 1. How does the % of resource, is different than absolute number? With
> >> rest of the cgroups systems we define absolute number at most places
> >> to my knowledge.
> >
> > There isn't really much choice if the abstraction is a bundle of all
> > resources. You can't use an absolute number unless every possible
> > hardware limited resource is defined, which doesn't seem smart to me
> > either.
>
> Absolute number of percentage is representation for a given property.
> That property needs definition. Isn't it?
> How do we say that "Some undefined" resource you give certain amount,
> which user doesn't know about what to administer, or configure.
> It has to be quantifiable entity.

Each vendor can quantify exactly what HW resources their
implementation has and how the above limit impacts their card. There
will be many variations, and IIRC, some vendors have resource pools
not directly related to the standard PD/QP/MR/CQ/AH verbs resources.

> > It is not abstract enough, and doesn't match our universe of
> > hardware very well.

> Why does the user need to know the actual hardware resource limits or
> define hardware based resource.

Because actual hardware resources *ARE* the limit. We cannot abstract
it away. The hardware/driver has real, fixed, immutable limits. No API
abstraction can possibly change that.

The limits are such there *IS NO* API boundary that can bundle them
into something simpler. There will always be apps that require wildly
different ratios of the basic verbs resources (PD/QP/CQ/AH/MR)

Either we control each and every vendor's limited resource directly
(which is where you started), or we just roll them up into a 'all
resource' bundle and control them indirectly. There just isn't a
mythical third 'better API' choice with the hardware we have today.

> (a) how many number of RDMA connections are allowed instead of QP, or CQ or AH.
> (b) how many data transfer buffers to use.

None of that accurately reflects what the real HW limits actually are.

> > ie Presumably some fairly small limitation like 10MB is enough for
> > most non-MPI jobs.
>
> Container application always write a simple for loop code to take away
> majority of QP with 10MB limit.

No, the HW and kmem limits must work together, the HW limit would
prevent exhaustion outside the container.

> Imagine instead of tcp_bytes or kmem bytes, its "some memory
> resource", how would someone debug/tune a system with abstract knobs?

Well, we have the memcg controller that does track kmem. The subsystem
specific kmem limit is to force fair sharing of the limited kmem
resource within the overall memcg limit.

They are complementary.

A fictional rdma_kmem and tcp_kmem would serve very similar purposes.

> > UAPI wise, nobdy has to care if the limit is actually # of QPs or
> > something else.

> If we dont care about resource, we cannot tune or limit it. number of
> MRs used by MPI vs rsocket vs accelio is way different.

So? I don't think it is really important to have an exact, precise,
limit. The HW pools are pretty big, unless you plan to run tens of
thousands of containers eacg with tiny RDMA limits, it is fine to talk
in broader terms (ie 10% of all HW limited resource) which is totally
adaquate to hard-prevent run away or exhaustion scenarios.

Jason
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/