Re: [take14 0/3] kevent: Generic event handling mechanism.

From: Evgeniy Polyakov
Date: Thu Aug 31 2006 - 04:00:24 EST


Hello.

Sorry ofr long delay - I was on small vacations.

On Sun, Aug 27, 2006 at 02:03:33PM -0700, Ulrich Drepper (drepper@xxxxxxxxxx) wrote:
> [Sorry for the length, but I want to be clear.]
>
> As promised, before Monday (at least my time), here are my thoughts on
> the proposed kevent interfaces. Not necessarily well ordered:
>
>
> - one point of critique which applied to many proposals over the years:
> multiplexer syscalls a bad, really bad. They are more complicated
> to use at userlevel and in the kernel. We've seen more than once that
> unimplemented functions are not reported correctly with ENOSYS. Just
> use individual syscalls. Adding them is cheap and probably overall
> less expensive than the multiplexer.

Can you convince Christoph?
I do not care about interfaces, but until several people agree on it, I
will not change anything.

> Events to wait for are basically all those with syscalls which can
> potentially block indefinitely:
>
> - file descriptor
> - POSIX message queues (these are in fact file descriptors but
> let's make it legitimate)
> - timer expiration
> - signals (just as sigwait, not normal delivery instead of a handler)
> - futexes (needs a lot more investigation)
> - SysV message queues
> - SysV semaphores
> - bind socket operations (Alan brought this up in a different context)
> - delays (nanosleep/clock_nanosleep, could be done using timers but the
> overhead would likely be too high)
> - process state change (waitpid, wait4, waitid etc)
> - file locking (flock, lockf)
> -

You completely miss AIO here (I talk not about POSIX AIO).

> We might also want to think about
>
> - msync/fsync: Today's wait/no-wait option doesn't allow us to work on
> other things if the sync takes time and we need a real notification
> (i.e., if no-wait cannot be used)
>
>
> The reporting must of course provide the userlevel code with enough
> information to identify the request. For submitting requests we need
> such identification, too, so having unique identifiers for all the
> different event types is necessary. So some extend this is what the
> KEVENT_TIMER_FIRED, KEVENT_SOCKET_RECV, etc constants do. But they
> should be more generic in their names since we need to use them also
> when registering the even. I.e., KEVENT_EVENT_TIMER or so is more
> appropriate.

There are such identifiers.
We have _two_ levels of id
- event type (KEVENT_EVENT_TIMER,
KEVENT_EVENT_POLL, KEVENT_EVENT_AIO and so one, but they are called
without _EVENT_ inside), which is type of origin for given events
- events itself - timer fired, data received, client accepted and so
on.

> Often (most of the time) this ID and the actual descriptor (file
> descriptor, message queue descriptor, signal number, etc) is not
> sufficient. In the POSIX API we therefore usually have a cookie value
> which the userlevel code can provide and which is returned unchanged as
> part of the notification. See the sigev_value member of struct
> sigevent. I think this is the best approach: is compact and it gives
> all the flexibility needed. Userlevel code will store a value or more
> often a pointer in the cookie and can then access additional information
> based of the cookie.

kevents have such "cookies".

> I know there is a controversy around using pointer-sized values in
> kernel structures which are exposed to userlevel. It should be possible
> to work around this. We can simply always use 64-bit values and when
> the data structure is exposed to 32-bit userland code only the first or
> second 32-bit word of the structure is exposed with the name. The other
> word is padding. If planned in from the beginning this should not cause
> any problems at all.

I use union of two 32bit values and pointer to simplify userspace.
It was planned and implemented already.

> Looking at the current struct mukevent, I don't think it is sufficient.
> We need more room for the various types of events. And we shouldn't
> prevent future innovative uses. I suggest to create records of a fixed
> size with sufficient room. Maybe 32 bytes are sufficient but I'd leave
> this open for can until the very end. Members of the structure must be
> - ID if the type of event; type int
> - descriptor (file descriptor, SysV msg descriptors etc); type int
> - user-provided cookie; type uint64_t
> That's only 16 bytes so far but we'll likely need more for some uses.

I use there only id provided by user, it is not his cookie, but it was
done to make strucutre as small as possible.
Think about size of the mapped buffer when there are several kevent
queues - it is all mapped and thus pinned memory.
It of course can be extended.

> Next, the current interfaces once again fail to learn from a mistake we
> made and which got corrected for the other interfaces. We need to be
> able to change the signal mask around the delay atomically. Just like
> we have ppoll for poll, pselect for select (and hopefully soon also
> epoll_pwait for epoll_wait) we need to have this feature in the new
> interfaces.

We able to change kevents atomically.

> I read the description Nicholas Miell produced (the example programs
> aren't available, accessing the URL fails for me) and looked over the
> last patch (take 14).
>
> The biggest problem I see so far is the integration into the existing
> interfaces. kevent notification *really* should be usable as a new
> sigevent type. Whether the POSIX interfaces are liked by kernel folks
> or not, they are what the majority of the userlevel programmers use.
> The mechanism is easily extensible. I've described this in my paper. I
> cannot comment on the complexity of the kernel side but I'd imagine it's
> not much more difficult, just different from what is implemented now.
> Let's learn for a change from the mistakes of the past. The new and
> innovative AIO interfaces never took off because their implementation
> differs so much from the POSIX interfaces. People are interested in
> portable code. So, please, let's introduce SIGEV_KEVENT. Then we
> magically get timer notification etc for free.

Well, I rarely talk about what other people want, but if you strongly
feel, that all posix crap is better than epoll interface, then I can not
agree with you.

It is possible to create additional one using any POSIX API you like,
but I strongly insist on having possibility to use lightweight syscall
interface too.

> The ring buffer interface is not described in Nicholas' description.
> I'm looking at the sources and am a bit baffled. For instance, the
> kevent_user_ring_add_event function simply adds an event without
> determining whether this overwrites an undelivered entry. One single
> index into the buffer isn't sufficient for this anyway. So let me ask
> some questions:
>
> - how is userlevel code supposed to locate events in the buffer? We
> can maintain a separate pointer for the ring buffer (in a separate
> location, which might actually be good for CPU cache reasons). But
> this cannot solve all problems. E.g., if the read pointer is
> initialized to zero (as is the write pointer), the ring buffer fits N
> entries, if now N+1 entries arrive before the first event is handled
> by the userlevel code, how does the userland code know that all ring
> buffer entries are valid? If the code supposed to always scan the
> entire buffer?

Ring buffer _always_ has space for new events until queue is not filled.
So if userspace do not read for too much time it's events and eventually
tries to add new one, it will fail early.

> - we need to signal the ring buffer overflow in some form to the
> userlevel code. What proposals have been made for this? Signals
> are the old and tried mechanism. I.e., one would be allowed to
> associate a signal with each kevent descriptor and receive overflow
> notifications this way. When rt signals are used we event can get
> the kevent descriptor and the possible a user cookie delivered.
> Something like this is needed in case such a kevent queue is used
> in library code where we cannot rely on being the only user for an
> event.

There is no overflow - I do not want to introduce another signal queue
overflow crap here.
And once again - no signals.

> I must admit I haven't spent too much time thinking about the ideal ring
> buffer interface. At OLS there were quite a few people (like Zach) who
> said they did. So, let's solicit advice. I think the kernel AIO
> interface can also provide some info on what not to do.

Sure, I would like to see different design if it is ready.

> One aspect of the interface I did think about: the delay syscall. I
> already mentioned the signal mask issue above. The interface already
> has a timeout value (good!). But we need to specify the semantics quite
> detailed to avoid problems.
>
> What I mean by that is the problem we are facing if there is more than
> one thread waiting for events. If no event is available all threads use
> the delay syscall. If now an event becomes available, what do we do?
> Do we want exactly one thread? This is a problem. The thread might not
> be working on the event after it gets woken (e.g., because the thread
> gets canceled). The result is that there is an event available and no
> other thread gets woken. This can be avoided by requiring that if a
> thread, which got woken from a delay syscall, doesn't use the event, it
> has to wake another thread. But how do we do this?

I can reformulate your words in a different manner. PLease correct me if
I'm wrong.

You basically want to deliver the same event to several users.
But how do you want to achive it with network buffers for example.
When several threads reads from the same socket, they do not obtain the
same data.
So I disagree that we need to deliver some events to several threads. If
you need to wake up several ones when one network socket is ready (I
seriously doubt you want), create per-thread kevent queue and put that
socket into them.
You want to wake several threads on timeout - create per-thread queue
and put there an event.

The more simple interface is, the less problems we will catch when some
tricky configuration is used.

> One possibility I could see is that the delay syscall returns the event
> which caused the thread to be woken. This event is _not_ also reported
> in the ring buffer. Then, if the thread does not use the event, it
> simply requeues it. This will then implicitly wake another delayed thread.
>
> Which brings me to the second point about the current kevent_get_events
> syscall. I don't think the min_nr parameter is useful. Probably we
> should not even allow the kevent queue to be used with different max_nr
> parameters in different thread. If you'd allow this, how would the
> event notification be handled? A waiter with a smaller required number
> of events would always be woken first. I think the number of required
> events should be a property of the kevent object. Then the code would
> create different kevent object if the requirement is different. At the
> very least I'd declare it an error if at any time there are two or more
> threads delayed which have different requirements on the number of
> events. This could provide all the flexibility needed while preventing
> some of the mistakes one can make.

min_nr is used to specify special case "wake up when at least one event
is ready and get all ready ones".

> In summary, I don't think we're at the point where the current
> interfaces are usable. I'd like to see them redesigned and
> reimplemented. The bad news is that I'll not be able to help with the
> coding. The somewhat good news is that I can given some more
> recommendations. In general I still think the text from my OLS paper
> applies:

I can do it.
But I will not, until other core developers acks your proposals.
As I described above I disagree with most of them.

> - one syscall to create a kevent queue. Using a special filesystem like
> take 14 does is OK. But who do you pass parameters like the maximum
> number of expected outstanding events? I think a dedicated syscall is
> better. It also works more reliably since /proc might not be yet

There are no "expected outstanding events", I think it can be a problem.
Currently there is absolute maximum of events, which can not be
increased in real-time.

> mounted when the first user of the interface is started. The result
> should be a file descriptor. At least an object which can be handled
> like a file descriptor when it comes to transmitting it over Unix
> domain sockets. Questions to answer: what happens if you use the
> descriptor with any other interface but the kevent interfaces (I think
> all such calls like dup, read, write, ... should fail).
>
> int kevent_init (int num);

Kevent always provides file descriptor (which is "poll"able) as result
of either opening of special file (like in the latest patchset), or
using special filesystem (which was removed by Christoph).

> - one system call to create the userlevel ring buffer. Simply
> overloading the mmap operation for the special kevent filesystem can
> work so no separate syscall is needed in that case. We need to
> nail down the semantics, though. What happens if more than one mmap
> call is made? Does only the last one count? Does the second one
> fail? Will mremap() work to increase/descrease the size? Will
> mremap() be allowed to be called with MREMAP_MAYMOVE? What if mmap()
> is called from different processes (in the POSIX sense, i.e., from
> different address spaces)?
>
> Either
>
> mmap(...)
>
> Or
>
> int kevent_map_ringbuf (int kfd, size_t num)

Each subsequent mmap will mmap existing buffers, first one mmap can
create that buffer.

> - one interface to set additional parameters. This is likely mostly to
> make the interfaces safe for the future. Perhaps the number of events
> needed per delay call should be set this way.
>
> int kevent_ctl (int kfd, int cmd, ...)
>
>
> - one interface to shut the kevent down. This might be overkill. We
> should be able to use munmap() and close(). If a real interface for
> this would be created it should look like this
>
> int kevent_destroy (int kfd, void *ringbuf, size_t num)
>
> I find this rather more cumbersome. Just use close and munmap.
>
>
> - one interface to submit requests.
>
> int kevent_submit (int kfd, struct kevent_event *ev, int flags,
> struct timespec *timeout)
>
> Maybe the flags parameter isn't needed, it's just another way to make
> sure we won't regret the design later. If the ring buffer can fill up
> and this is detected by the kernel (unlike what happens in take 14)

Just a repeat - with current buffer implementation it can not happen -
maximum queue length is a limit for buffer size.

> then the calling thread could be delayed undefinitely. Maybe we even
> have a deadlock if there is only one thread. If only a wait/no-wait
> mode is needed, then use only a flags parameter and no timeout
> parameter.
>
> A special variant should be if ev == NULL the call is taken as a
> request to wake one or more delayed threads.

Well, you propose three different syscalls for threee operations. I use
one with multiplexer. I do not have strong opinion on how it must be
done, but I created a policy for such changes - until other developers
ack such changes, nothing will be done.

> - one interface to delay threads until the next event becomes available.
> No data is transfered along with the call. The event data must be
> read from the ring buffer:
>
> int kevent_wait (int kfd, unsigned ringstate,
> const struct timespec *timeout,
> const sigset_t *sigmask)

Yes, I agree, this is good syscall.
Except signals (no signals, that's the rule) and variable sized timespec
structure. What about putting there u64 number of nanoseconds?

> Wait-mode can be implemented by recognizing timeout==NULL. no-wait
> mode is implemented using timeout->tv_sec==timeout->tv_nsec==0. If
> sigset_t is NULL the signal mask is not changed.
>
> The ringstate parameter is also not present in the take 14 proposal.
> Something like it is necessary to prevent the thread from going to
> sleep while there are events in the ring buffer. It would be very
> wasteful if the kernel would have to keep track of outstanding
> events. This would also mean then handling events would require
> a system call, exactly what the ring buffer approach should prevent.

It is possible to put there a number of the last "acked" kevent, so
kernel will remove all events which were placed into the buffer before
and including this one.

> I think the sequence for waiting for an event should be like this:
>
> + get current ring state
> + check whether any outstanding event in ring buffer
> + if yes, copy data out of ring buffer, mark ring buffer record
> as unused (atomically).
> + if no, call kevent_wait with ring state value
>
> When the kernel delivers a new event it does:
>
> + find place to store event
> + change ring state (might be a simple counter)

What about following:
userspace:
- check ring index, if it differs from stored in userspace, then there
are events between old stored index and new one just read.
- copy events
- call kevent_wait() or other method to show kernel that all events
upto provided in syscall numbers are processed, and thus kernel can
remove them and put there new ones.

kernelspace:
- when new kevent is added, it guarantees that there is a place for it
in kernel ring buffer
- when event is ready it is copied into mapped buffer and index of the
"last ready" is increased (it is fully atomic operation)
- when userspace calls kevent_wait() kernel get ring index from
syscall, searches for all events upto provided number and free them
(or rearm)


Except kevent_wait() imeplementation it is how it is implemented right
now.

> The kevent_wait implementation in the kernel would then as the first
> thing determine whether the ring state changed. If yes, the syscall
> returns immediate with -ENWOULDBLOCK. Otherwise it is queued for
> waiting.
>
> With these steps and the requirement that all ring buffer entries are
> processed FIFO we can
> a) avoid syscalls to avoid freeing ring buffer entries
> b) detect overflows in the ring buffer
> c) can maintain the read pointer at userlevel while the kernel can
> maintain the write pointer into the buffer

As shown above it is already implemented.

> --
> â Ulrich Drepper â Red Hat, Inc. â 444 Castro St â Mountain View, CA â
>



--
Evgeniy Polyakov
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/