Re: [Patch][RFC] Disabling per-tgid stats on task exit in taskstats

From: Shailabh Nagar
Date: Mon Jul 03 2006 - 11:01:35 EST


Paul Jackson wrote:

Shailabh wrote:


Sends a separate "registration" message with cpumask to listen to. Kernel stores (real) pid and cpumask.



Question:
=========

Ah - good.

So this means that I could configure a system with a fork/exit
intensive, performance critical job on some dedicated CPUs, and be able
to collect taskstat data from tasks exiting on the -other- CPUS, while
avoiding collecting data from this special job, thus avoiding any
taskstat collection performance impact on said job.

If I'm understanding this correctly, excellent.


Yes. If no one registers to listen on a particular CPU, data from tasks exiting on that cpu is
not sent out at all.

Caveat:
=======

Passing cpumasks across the kernel-user boundary can be tricky.

Historically, Unix has a long tradition of boloxing up the passing
of variable length data types across the kernel-user boundary.

We've got perhaps a half dozen ways of getting these masks out of the
kernel, and three ways of getting them (or the similar nodemasks) back
into the kernel. The three ways being used in the sched_setaffinity
system call, the mbind and set_mempolicy system calls, and the cpuset
file system.

All three of these ways have their controversial details:
* The kernel cpumask mask size needed for sched_setaffinity calls is
not trivially available to userland.
* The nodemask bit size is off by one in the mbind and set_mempolicy
calls.
* The CPU and Node masks are ascii, not binary, in the cpuset calls.

One option that might make sense for these task stat registrations
would be to:
1) make the kernel/sched.c get_user_cpu_mask() routine generic,
moving it to non-static lib/*.c code, and
2) provide a sensible way for user space to query the size of
the kernel cpumask (and perhaps nodemask while you're at it.)

Currently, the best way I know for user space to query the kernels
cpumask and nodemask size is to examine the length of the ascii
string values labeled "Cpus_allowed:" and "Mems_allowed:" in the file
/proc/self/status. These ascii strings always require exactly nine
ascii chars to express each 32 bits of kernel mask code, if you include
in the count the trailing ',' comma or '\n' newline after each eight
ascii character word.

Probing /proc/self/status fields for these mask sizes is rather
unobvious and indirect, and requires caching the result if you care at
all about performance. Userland code in support of your taskstat
facility might be better served by a more obvious way to size cpumasks.

... unless of course you're inclined to pass cpumasks formatted as
ascii strings, in which case speak up, as I'd be delighted to
throw in my 2 cents on how to do that ;).


Thanks for the size info. I did hit it while coding this up.

So I chose to use the "cpulist" ascii format that has been helpfully provided in include/linux/cpumask.h (by whom I wonder :-)

User specified the cpumask as an ascii string containing comma separated cpu ranges.
Kernel parses the same and stores it as a cpumask_t after which we can iterate over the
mask using standard helpers.

Since registration/deregistration is not a common operation, the overhead of parsing
ascii strings should be acceptable and avoids the hassles of trying to determine kernel cpumask size. I don't know if there are buffer overflow issues in passing a string (though I'm using the
standard netlink way of passing it up using NLA_STRING).

Will post the patch shortly.

--Shailabh
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/