Re: RFC rdma cgroup

From: Parav Pandit
Date: Thu Oct 29 2015 - 14:46:17 EST


Hi Haggai,

On Thu, Oct 29, 2015 at 8:27 PM, Haggai Eran <haggaie@xxxxxxxxxxxx> wrote:
> On 28/10/2015 10:29, Parav Pandit wrote:
>> 3. Resources are not defined by the RDMA cgroup. Resources are defined
>> by RDMA/IB subsystem and optionally by HCA vendor device drivers.
>> Rationale: This allows rdma cgroup to remain constant while RDMA/IB
>> subsystem can evolve without the need of rdma cgroup update. A new
>> resource can be easily added by the RDMA/IB subsystem without touching
>> rdma cgroup.
> Resources exposed by the cgroup are basically a UAPI, so we have to be
> careful to make it stable when it evolves. I understand the need for
> vendor specific resources, following the discussion on the previous
> proposal, but could you write on how you plan to allow these set of
> resources to evolve?

Its fairly simple.
Here is the code snippet on how resources are defined in my tree.
It doesn't have the RSS work queues yet, but can be added right after
this patch.

Resource are defined as index and as match_table_t.

enum rdma_resource_type {
RDMA_VERB_RESOURCE_UCTX,
RDMA_VERB_RESOURCE_AH,
RDMA_VERB_RESOURCE_PD,
RDMA_VERB_RESOURCE_CQ,
RDMA_VERB_RESOURCE_MR,
RDMA_VERB_RESOURCE_MW,
RDMA_VERB_RESOURCE_SRQ,
RDMA_VERB_RESOURCE_QP,
RDMA_VERB_RESOURCE_FLOW,
RDMA_VERB_RESOURCE_MAX,
};
So UAPI RDMA resources can evolve by just adding more entries here.

>
>> 8. Typically each RDMA cgroup will have 0 to 4 RDMA devices. Therefore
>> each cgroup will have 0 to 4 verbs resource pool and optionally 0 to 4
>> hw resource pool per such device.
>> (Nothing stops to have more devices and pools, but design is around
>> this use case).
> In what way does the design depend on this assumption?

Current code when performs resource charging/uncharging, it needs to
identify the resource pool which one to charge to.
This resource pool is maintained as list_head and so its linear search
per device.
If we are thinking of 100 of RDMA devices per container, than liner
search will not be good way and different data structure needs to be
deployed.


>
>> 9. Resource pool object is created in following situations.
>> (a) administrative operation is done to set the limit and no previous
>> resource pool exist for the device of interest for the cgroup.
>> (b) no resource limits were configured, but IB/RDMA subsystem tries to
>> charge the resource. so that when applications are running without
>> limits and later on when limits are enforced, during uncharging, it
>> correctly uncharges them, otherwise usage count will drop to negative.
>> This is done using default resource pool.
>> Instead of implementing any sort of time markers, default pool
>> simplifies the design.
> Having a default resource pool kind of implies there is a non-default
> one. Is the only difference between the default and non-default the fact
> that the second was created with an administrative operation and has
> specified limits or is there some other difference?
>
You described it correctly.

>> (c) When process migrate from one to other cgroup, resource is
>> continue to be owned by the creator cgroup (rather css).
>> After process migration, whenever new resource is created in new
>> cgroup, it will be owned by new cgroup.
> It sounds a little different from how other cgroups behave. I agree that
> mostly processes will create the resources in their cgroup and won't
> migrate, but why not move the charge during migration?
>
With fork() process doesn't really own the resource (unlike other file
and socket descriptors).
Parent process might have died also.
There is possibly no clear way to transfer resource to right child.
Child that cgroup picks might not even want to own RDMA resources.
RDMA resources might be allocated by one process and freed by other
process (though this might not be the way they use it).
Its pretty similar to other cgroups with exception in migration area,
such exception comes from different behavior of how RDMA resources are
owned, created and used.
Recent unified hierarchy patch from Tejun equally highlights to not
frequently migrate processes among cgroups.

So in current implementation, (like other),
if process created a RDMA resource, forked a child.
child and parent both can allocate and free more resources.
child moved to different cgroup. But resource is shared among them.
child can free also the resource. All crazy combinations are possible
in theory (without much use cases).
So at best they are charged to the first cgroup css in which
parent/child are created and reference is hold to CSS.
cgroup, process can die, cut css remains until RDMA resources are freed.
This is similar to process behavior where task struct is release but
id is hold up for a while.


> I finally wanted to ask about other limitations an RDMA cgroup could
> handle. It would be great to be able to limit a container to be allowed
> to use only a subset of the MAC/VLAN pairs programmed to a device,

Truly. I agree. That was one of the prime reason I originally has it
as part of the device cgroup.
Where RDMA was just one category.
But Tejun's opinion was to have rdma's own cgroup.
Current internal data structure and interface between rdma cgroup and
uverbs are tied to ib_device structure.
which I think easy to overcome by abstracting out as new
resource_device which can be used beyond RDMA as well.

However my bigger concern is interface to user land.
We already have two use cases and I am inclined to make it as as
"device resource cgroup" instead of "rdma cgroup".
I seek Tejun's input here.
Initial implementation can expose rdma resources under device resource
cgroup, as it evolves we can add other net resources such as mac, vlan
as you described.

or
> only a subset of P_Keys and GIDs it has. Do you see such limitations
> also as part of this cgroup?
>
At present no. Because GID, P_key resources are created from the
bottom up, either by stack or by network. They are kind of not tied to
the user processes, unlike mac, vlan, qp which are more application
driven or administrative driven.

For applications that doesn't use RDMA-CM, query_device and query_port
will filter out the GID entries based on the network namespace in
which caller process is running.
It was in my TODO list while we were working on RoCEv2 and GID
movement changes but I never got chance to chase that fix.

One of the idea I was considering is: to create virtual RDMA device
mapped to physical device.
And configure GID count limit via configfs for each such device.

> Thanks,
> Haggai
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/