Re: [PATCH 0/7] devcg: device cgroup extension for rdma resource

From: Parav Pandit
Date: Mon Sep 14 2015 - 10:04:17 EST


Hi Tejun,

I missed to acknowledge your point that we need both - hard limit and
soft limit/weight. Current patchset is only based on hard limit.
I see that weight would be another helfpul layer in chain that we can
implement after this as incremental that makes review, debugging
manageable?

Parav



On Mon, Sep 14, 2015 at 4:39 PM, Parav Pandit <pandit.parav@xxxxxxxxx> wrote:
> On Sat, Sep 12, 2015 at 1:36 AM, Hefty, Sean <sean.hefty@xxxxxxxxx> wrote:
>>> > Trying to limit the number of QPs that an app can allocate,
>>> > therefore, just limits how much of the address space an app can use.
>>> > There's no clear link between QP limits and HW resource limits,
>>> > unless you assume a very specific underlying implementation.
>>>
>>> Isn't that the point though? We have several vendors with hardware
>>> that does impose hard limits on specific resources. There is no way to
>>> avoid that, and ultimately, those exact HW resources need to be
>>> limited.
>>
>> My point is that limiting the number of QPs that an app can allocate doesn't necessarily mean anything. Is allocating 1000 QPs with 1 entry each better or worse than 1 QP with 10,000 entries? Who knows?
>
> I think it means if its RDMA RC QP, than whether you can talk to 1000
> nodes or 1 node in network.
> When we deploy MPI application, it know the rank of the application,
> we know the cluster size of the deployment and based on that resource
> allocation can be done.
> If you meant to say from performance point of view, than resource
> count is possibly not the right measure.
>
> Just because we have not defined those interface for performance today
> in this patch set, doesn't mean that we won't do it.
> I could easily see a number_of_messages/sec as one interface to be
> added in future.
> But that won't stop process hoarders to stop taking away all the QPs,
> just the way we needed PID controller.
>
> Now when it comes to Intel implementation, if it driver layer knows
> (in future we new APIs) that whether 10 or 100 user QPs should map to
> few hw-QPs or more hw-QPs (uSNIC).
> so that hw-QP exposed to one cgroup is isolated from hw-QP exposed to
> other cgroup.
> If hw- implementation doesn't require isolation, it could just
> continue from single pool, its left to the vendor implementation on
> how to use this information (this API is not present in the patch).
>
> So cgroup can also provides a control point for vendor layer to tune
> internal resource allocation based on provided matrix, which cannot be
> done by just providing "memory usage by RDMA structures".
>
> If I have to compare it with other cgroup knobs, low level individual
> knobs by itself, doesn't serve any meaningful purpose either.
> Just by defined how much CPU to use or how much memory to use, it
> cannot define the application performance either.
> I am not sure, whether iocontroller can achieve 10 million IOPs by
> defining single CPU and 64KB of memory.
> all the knobs needs to be set in right way to reach desired number.
>
> In similar line RDMA resource knobs as individual knobs are not
> definition of performance, its just another knob.
>
>>
>>> If we want to talk about abstraction, then I'd suggest something very
>>> general and simple - two limits:
>>> '% of the RDMA hardware resource pool' (per device or per ep?)
>>> 'bytes of kernel memory for RDMA structures' (all devices)
>>
>> Yes - this makes more sense to me.
>>
>
> Sean, Jason,
> Help me to understand this scheme.
>
> 1. How does the % of resource, is different than absolute number? With
> rest of the cgroups systems we define absolute number at most places
> to my knowledge.
> Such as (a) number_of_tcp_bytes, (b) IOPs of block device, (c) cpu cycles etc.
> 20% of QP = 20 QPs when 100 QPs are with hw.
> I prefer to keep the resource scheme consistent with other resource
> control points - i.e. absolute number.
>
> 2. bytes of kernel memory for RDMA structures
> One QP of one vendor might consume X bytes and other Y bytes. How does
> the application knows how much memory to give.
> application can allocate 100 QP of each 1 entry deep or 1 QP of 100
> entries deep as in Sean's example.
> Both might consume almost same memory.
> Application doing 100 QP allocation, still within limit of memory of
> cgroup leaves other applications without any QP.
> I don't see a point of memory footprint based scheme, as memory limits
> are well addressed by more smarter memory controller anyway.
>
> I do agree with Tejun, Sean on the point that abstraction level has to
> be different for using RDMA and thats why libfabrics and other
> interfaces are emerging which will take its own time to get stabilize,
> integrated.
>
> Until pure IB style RDMA programming model exist - based on RDMA
> resource based scheme, I think control point also has to be on
> resources.
> Once a stable abstraction level is on table (possibly across fabric
> not just RDMA), than a right resource controller can be implemented.
> Even when RDMA abstraction layer arrives, as Jason mentioned, at the
> end it would consume some hw resource anyway, that needs to be
> controlled too.
>
> Jason,
> If the hardware vendor defines the resource pool without saying its
> resource QP or MR, how would actually management/control point can
> decide what should be controlled to what limit?
> We will need additional user space library component to decode than,
> after that it needs to be abstracted out as QP or MR so that it can be
> deal in vendor agnostic way as application layer.
> and than it would look similar to what is being proposed here?
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/