On 1/4/19 12:03 PM, Greg Thelen wrote:
Yang Shi <yang.shi@xxxxxxxxxxxxxxxxx> wrote:
On 1/3/19 11:23 AM, Michal Hocko wrote:If kernel workqueues are doing more work (i.e. force_empty processing),
On Thu 03-01-19 11:10:00, Yang Shi wrote:Er, I may not articulate in the earlier email, force_empty can not be
On 1/3/19 10:53 AM, Michal Hocko wrote:[...]
On Thu 03-01-19 10:40:54, Yang Shi wrote:
On 1/3/19 10:13 AM, Michal Hocko wrote:
I do not really care it is few LOC. It is more important that it isYes, it does introduce some additional code and semantic, but IMHO, it isIn that case I do not see a strong reason to implement the offlodingIs there any reason for your scripts to be strictly sequential here? InI would say it has not to be strictly sequential. The above script is just
other words why cannot you offload those expensive operations to a
detached context in _userspace_?
an example to illustrate the pattern. But, sometimes it may hit such pattern
due to the complicated cluster scheduling and container scheduling in the
production environment, for example the creation process might be scheduled
to the same CPU which is doing force_empty. I have to say I don't know too
much about the internals of the container scheduling.
into the kernel. It is an additional code and semantic to maintain.
quite simple and very straight forward, isn't it? Just utilize the existing
css offline worker. And, that a couple of lines of code do improve some
throughput issues for some real usecases.
conflating force_empty into offlining logic. There was a good reason to
remove reparenting/emptying the memcg during the offline. Considering
that you can offload force_empty from userspace trivially then I do not
see any reason to implement it in the kernel.
offloaded from userspace *trivially*. IOWs the container scheduler may
unexpectedly overcommit something due to the stall of synchronous force
empty, which can't be figured out by userspace before it actually
happens. The scheduler doesn't know how long force_empty would take. If
the force_empty could be offloaded by kernel, it would make scheduler's
life much easier. This is not something userspace could do.
then it seem like the time to offline could grow. I'm not sure if
that's important.
One thing I can think of is this may slow down the recycling of memcg id. This may cause memcg id exhausted for some extreme workload. But, I don't see this as a problem in our workload.
Thanks,
Yang
I assume that if we make force_empty an async side effect of rmdir then
user space scheduler would not be unable to immediately assume the
rmdir'd container memory is available without subjecting a new container
to direct reclaim. So it seems like user space would use a mechanism to
wait for reclaim: either the existing sync force_empty or polling
meminfo/etc waiting for free memory to appear.
Sure. Will prepare the patches later.Then bring this up in a separate email thread please.I think it is more important to discuss whether we want to introduceWe would prefer have it in v2 as well.
force_empty in cgroup v2.
Thanks,
Yang