Re: [1/1] CBUS: new very fast (for insert operations) message busbased on kenel connector.

From: Evgeniy Polyakov
Date: Fri Apr 01 2005 - 05:51:53 EST


On Fri, 2005-04-01 at 02:30 -0800, Andrew Morton wrote:
> Evgeniy Polyakov <johnpol@xxxxxxxxxxx> wrote:
> >
> > > > keventd does very hard jobs on some of my test machines which
> > > > for example route big amount of traffic.
> > >
> > > As I said - that's going to cause _your_ kernel thread to be slowed down as
> > > well.
> >
> > Yes, but it does not solve peak performance issues - all scheduled
> > jobs can run one after another which will decrease insert performance.
>
> Well the keventd handler would simply process all currently-queued
> messages. It's not as if you'd only process one event per keventd callout.

But that will hurt insert performance.
Processing all messages without splitting them up into pieces noticebly
slows insert operation down.

> > > I mean, it's just a choice between two ways of multiplexing the CPU. One
> > > is via a context switch in schedule() and the other is via list traversal
> > > in run_workqueue(). The latter will be faster.
> >
> > But in case of separate thread one can control execution process,
> > if it will be called from work queue then insert requests
> > can appear one after another in a very interval,
> > so theirs processing will hurt insert performance.
>
> Why does insert performance matter so much? These things still have to be
> sent up to userspace.
>
> (please remind me why cbus exists, btw. What functionality does it offer
> which the connector code doesn't?)

The main goal of CBUS is insert operation performance.
Anyone who wants to have maximum delivery speed should use direct
connector's
methods instead.

> > > > > Plus keventd is thread-per-cpu and quite possibly would be faster.
> > > >
> > > > I experimented with several usage cases for CBUS and it was proven
> > > > to be the fastest case when only one sending thread exists which manages
> > > > only very limited amount of messages at a time [like 10 in CBUS
> > > > currently]
> > >
> > > Maybe that's because the cbus data structures are insufficiently
> > > percpuified. On really big machines that single kernel thread will be a
> > > big bottleneck.
> >
> > It is not because of messages itself, but becouse of it's peaks,
> > if there is a peak then above mechanism will smooth it into
> > several pieces [for example 10 in each bundle, that value should be
> > changeable in run-time, will place it into TODO],
> > with keventd there is no guarantee that next peak will be processed
> > not just after the current one, but after some timeout.
>
> keventd should process all the currently-queued messages. If messages are
> being queued quickly then that will be a lot of messages on each keventd
> callout.

But for maximum _insert_ operatios, none should process _all_ messages
at once,
even if there are many of them.
One needs to smooth peak message number into pieces and process each
piece after previous ane after some timeout.

> > > Introducing an up-to-ten millisecond latency seems a lot worse than some
> > > reduction in peak bandwidth - it's not as if pumping 100000 events/sec is a
> > > likely use case. Using prepare_to_wait/finish_wait will provide some
> > > speedup in SMP environments due to reduced cacheline transfers.
> >
> > It is a question actually...
> > If we allow peak processing, then we _definitely_ will have insert
> > performance degradation, it was observed in my tests.
>
> I don't understand your terminology. What is "peak processing" and how
> does it differ from maximum insertion rate?
>
> > The main goal of CBUS was exactly insert speed
>
> Why?

To allow connector usage in a really fast pathes.
If one cares about _delivery_ speed then one should use cn_netlink_send
().

> > - so
> > it somehow must smooth shape performance peaks, and thus
> > above budget was introdyced.
>
> An up-to-ten-millisecond latency between the kernel-side queueing of a
> message and the delivery of that message to userspace sounds like an
> awfully bad restriction to me. Even one millisecond will have impact in
> some scenarios.

If you care about delivery speed stronger than insertion,
then you do not need to use CBUS, but cn_netlink_send() instead.

I will test smaller values, but doubt it will have better insert
performance.

> > It is similar to NAPI in some abstract way, but with different aims -
> > NAPI for speed improovement, but here we have peak smootheness.
> >
> > > > If too many deferred insert works will be called simultaneously
> > > > [which may happen with keventd] it will slow down insert operations
> > > > noticeably.
> > >
> > > What is a "deferred insert work"? Insertion is always synchronous?
> >
> > Insert is synchronous in one CPU, but actuall message delivering is
> > deferred.
>
> OK, so why does it matter that "If too many deferred insert works will be
> called [which may happen with keventd] it will slow down insert operations
> noticeably"?
>
> There's no point in being able to internally queue messages at a higher
> frequency than we can deliver them to userspace. Confused.

Concider fork() connector - it is better for userspace
to return from system call as soon as possible, while information
delivering
about that event will be delayed.
Noone says that queueing is done in much higher rate then delivering,
it only shapes shart peaks when it is unacceptably to wait untill
delivery
is finished.

> > > > I did not try that case with the keventd but with one kernel thread
> > > > it was tested and showed worse performance.
> > >
> > > But your current implementation has only one kernel thread?
> >
> > It has a budget and timeout between each bundle processing.
> > keventd does not allow to create such a timeout between each bundle
> > processing.
>
> Yes, there's batching there. But I don't understand why the ability to
> internally queue events at a high rate is so much more important than the
> latency which that batching will introduce.

With such mechanism we may use event notification in a really fast
pathes.
We always defer actuall work processing out from HW interrupts,
although for the task running from that interrupt it would be better
to steal the whole CPU and run there [in HW irq] until it finishes.

But if someone does not care about latencies it still may use
direct connector's method cn_netlink_send().

> (And keventd _does_ allow such batching. schedule_delayed_work()).

And it is still unacceptible, since there may be too many events
delayed to the same time, so they all will be processed in the same
time,
which will hurt insert operation performance.

--
Evgeniy Polyakov

Crash is better than data corruption -- Arthur Grabowski

Attachment: signature.asc
Description: This is a digitally signed message part