Re: [PATCH 2/4] staging/lustre/obdclass: read jobid from proc

From: Oleg Drokin
Date: Tue Feb 04 2014 - 12:28:10 EST

Next message: Mark Brown: "Re: [PATCH 4/6] regulator: add bcm59056 regulator driver"
Previous message: Grant Likely: "RE: [PATCH] of: fix of_update_property()"
In reply to: Greg Kroah-Hartman: "Re: [PATCH 2/4] staging/lustre/obdclass: read jobid from proc"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]

Hello!

On Feb 4, 2014, at 11:57 AM, Greg Kroah-Hartman wrote:

> What exactly are you doing here? Calling out to userspace for what
> information? And how are you going to handle namespaces and containers
> by doing that? Are you going to block in the kernel for this
> information?
>
> What are you trying to solve with this code in the first place?

So, as an overview of a feature:

When you have tens of thousand (and even hundreds) of nodes doing IO,
it's no longer practical to tell them apart separately or in some
network-related groups based on their address server-side
(for purposes like monitoring and load control).
Since such systems are usually managed by some sort of a job/batch scheduler,
it's much more natural to organize them into "jobs" as known to the
job scheduler instead. This job id information is harvested and added to
all RPCs so that server side can do useful things with this information
(like aggregate statistics, identify pathologically bad workcases,
QoS and so on).

Most of batch schedulers out there let userspace know their own JOBID as an
environment variable. So original implementation was just harvesting this
info directly from process environment. I certainly can see why this is not
really desirable.

So, the patch does away with the environment parsing, instead it adds two
new venues of getting this information:

1. In vast majorities of cases entire node is dedicated to a single job, so
we just create a /proc variable where you can input job id from job scheduler
prologue (and then clear it from an epilogue). Getting jobid in this case does
does not dip in userspace anymore. This also does not block anywhere.

2. In some more rare systems with lots of cores they actually seem to be subdividing
individual nodes across jobs. Additionally all systems usually have login/interactive
nodes. While these sort of interactive nodes do not have jobs scheduled on them, it
still might be useful to distinguish between different user sessions happening there.
This is where the upcalls come into play. First time a process does IO an upcall
would be called that would provide the kernel with jobid identifier (however it might
want to obtain it, we don't really care at this point). This would block (with a timeout)
for the reply. The reply is then cached and reused for subsequent io from the same process.
I did not really think about containers before, but I think it would work out anyway:
I think namespaces and containers still have non-intersecting pid space so we should be
fine in this regard. There is a "master container"/namespace of some sort I think, that is able to see
entire pid space, and that's where the upcall would be run (do I need to somehow force that?)

Bye,
Oleg

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

Next message: Mark Brown: "Re: [PATCH 4/6] regulator: add bcm59056 regulator driver"
Previous message: Grant Likely: "RE: [PATCH] of: fix of_update_property()"
In reply to: Greg Kroah-Hartman: "Re: [PATCH 2/4] staging/lustre/obdclass: read jobid from proc"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]