Re: [RFC PATCH] poll(): add poll_wait_set_exclusive()

From: Mathieu Desnoyers
Date: Thu Oct 07 2010 - 14:07:45 EST


* Steven Rostedt (rostedt@xxxxxxxxxxx) wrote:
> On Thu, 2010-10-07 at 13:07 -0400, Mathieu Desnoyers wrote:
> > * Steven Rostedt (rostedt@xxxxxxxxxxx) wrote:
> > > On Wed, 2010-10-06 at 15:04 -0400, Mathieu Desnoyers wrote:
> > >
> > > > For reference, here is the use-case: The user-space daemon runs typically one
> > > > thread per cpu, each with a handle on many file descriptors. Each thread waits
> > > > for data to be available using poll(). In order to follow the poll semantic,
> > > > when data becomes available on a file descriptor, the kernel wakes up all
> > > > threads at once, but in my case only one of them will successfully consume the
> > > > data (all other thread's splice or read will fail with -ENODATA). With many
> > > > threads, these useless wakeups add an unwanted overhead and scalability
> > > > limitation.
> > >
> > > Mathieu, I'm curious to why you have multiple threads reading the same
> > > fd. Since the threads are per cpu, does the fd handle all CPUs?
> >
> > The fd is local to a single ring buffer (which is per-cpu, transporting a group
> > of events). The threads consuming the file descriptors are approximately per
> > cpu, modulo cpu hotplug events, user preferences, etc. I would prefer not to
> > make that a strong 1-1 mapping (with affinity and all), because a typical
> > tracing scenario is that a single CPU is heavily used by the OS (thus producing
> > trace data), while other CPUs are idle, available to pull the data from the
> > buffers. Therefore, I strongly prefer not to affine reader threads to their
> > "local" buffers in the general case. That being said, it could be kept as an
> > option, since it might make sense in some other use-cases, especially with tiny
> > buffers, where it makes sense to keep locality of reference in the L2 cache.
>
> I never mention affinity. As with trace-cmd, it assigns a process per
> CPU, but those processes can be on any CPU that the scheduler chooses. I
> could probably do it with a single process reading all the CPU fds too.
> I might add that as an option.

Your scheme works fine because you have only one stream (and thus one fd) per
cpu. How would you map that with many streams per cpu ?

Also, you might want to consider using threads rather than processes, to save
the unnecessary VM swaps.

>
> >
> > > Or do you have an fd per event per CPU, in which case the threads should just
> > > poll off of their own fds.
> >
> > I have one fd per per-cpu buffer, but there can be many per-cpu buffers, each
> > transporting a group of events. Therefore, I don't want to associate one thread
> > per event group, because this would be a resource waste. Typically, only a few
> > per-cpu buffers will be very active, and others will be very quiet.
>
> Lets not talk about threads, what about fds? I'm wondering why you have
> many threads on the same fd?

That's because I have fewer threads than file descriptors. So I can choose to
either:

1) somehow assign each thread to many fds statically or
2) make each thread wait for data on all fds

Option (2) adapts much better to workloads where a lots of data would come from
many file descriptors from a single CPU: all threads can collaboratively work to
extract the data.

Thanks,

Mathieu

>
> -- Steve
>
>
>

--
Mathieu Desnoyers
Operating System Efficiency R&D Consultant
EfficiOS Inc.
http://www.efficios.com
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/