Re: [1/1] CBUS: new very fast (for insert operations) message busbased on kenel connector.

From: Evgeniy Polyakov
Date: Fri Apr 01 2005 - 08:12:23 EST


On Fri, 2005-04-01 at 03:20 -0800, Andrew Morton wrote:
> Evgeniy Polyakov <johnpol@xxxxxxxxxxx> wrote:
> >
> > On Fri, 2005-04-01 at 02:30 -0800, Andrew Morton wrote:
> > > Evgeniy Polyakov <johnpol@xxxxxxxxxxx> wrote:
> > > >
> > > > > > keventd does very hard jobs on some of my test machines which
> > > > > > for example route big amount of traffic.
> > > > >
> > > > > As I said - that's going to cause _your_ kernel thread to be slowed down as
> > > > > well.
> > > >
> > > > Yes, but it does not solve peak performance issues - all scheduled
> > > > jobs can run one after another which will decrease insert performance.
> > >
> > > Well the keventd handler would simply process all currently-queued
> > > messages. It's not as if you'd only process one event per keventd callout.
> >
> > But that will hurt insert performance.
>
> Why? All it involves is one schedule_work() on the insert side. And that
> will involve just a single test_bit() in the great majority of cases
> because the work will already be pending.

Here is example:
schedule_work();
keventd->cbus_process(), which has 2 variants:
1. process all pending events.
2. process only number of them.

In the first case we will hurt very noticeble insert performance,
because actual delivery takes some time, so one process will
take time_for_one_delivery * number_of_events_to_be_delivered,
since in a peak number_of_events_to_be_delivered may be very high,
it will take too much time to flush the event queue and deliver all
messages.

In the second case we finish our work in predictible time,
but it can not help us with the keventd, which may [and is] cought
new schedule_work(), and thus will run cbus_process() again
without time window after previous delivering.

That time window is _very_ helpfull for the insert performance
and thus low latencies.

> > Processing all messages without splitting them up into pieces noticebly
> > slows insert operation down.
>
> What does "splitting them up into pieces" mean? They're single messages
> end-to-end. You've been discussing batching of messages, which is the
> opposite?

There is a queue of single event messages, if we process them all in one
shot,
then there is no time to insert new event [each CPU is running keventd
thread]
untill all previous are sent.
So we split that queue into pieces of [currently] 10 messages in each,
and send only them, then we sleep for some time, in which new inserts
can be completed, then process next 10...

> > > (please remind me why cbus exists, btw. What functionality does it offer
> > > which the connector code doesn't?)
> >
> > The main goal of CBUS is insert operation performance.
> > Anyone who wants to have maximum delivery speed should use direct
> > connector's
> > methods instead.
>
> Delivery speed is not the same thing as insertion speed. All the insertion
> does is to place the event onto an internal queue. We still need to do a
> context switch and send the thing. Provided there's a reasonable amount of
> batching, the CPU consumption will be the same in both cases.

Sending is slow [in comparison to insertion], so it can be deferred
for better latency.
Context switch and actuall sending will happen after main function(like
fork(),
or any other in fast path) is already finished, so we do not hurt it's
performance.

> > > > > Introducing an up-to-ten millisecond latency seems a lot worse than some
> > > > > reduction in peak bandwidth - it's not as if pumping 100000 events/sec is a
> > > > > likely use case. Using prepare_to_wait/finish_wait will provide some
> > > > > speedup in SMP environments due to reduced cacheline transfers.
> > > >
> > > > It is a question actually...
> > > > If we allow peak processing, then we _definitely_ will have insert
> > > > performance degradation, it was observed in my tests.
> > >
> > > I don't understand your terminology. What is "peak processing" and how
> > > does it differ from maximum insertion rate?
> > >
> > > > The main goal of CBUS was exactly insert speed
> > >
> > > Why?
> >
> > To allow connector usage in a really fast pathes.
> > If one cares about _delivery_ speed then one should use cn_netlink_send
> > ().
>
> We care about both insertion and delivery! Sure, simply sticking the event
> onto a queue is faster than delivering it. But we still need to deliver it
> sometime so there's no aggregate gain. Still confused.

If one needs low latency events in peaks - use CBUS, it will smooth
shapes, since actual delivering will be postponed.

There is no _aggregate_ gain - only immediate low latenciy with work
deferring.

> > > > - so
> > > > it somehow must smooth shape performance peaks, and thus
> > > > above budget was introdyced.
> > >
> > > An up-to-ten-millisecond latency between the kernel-side queueing of a
> > > message and the delivery of that message to userspace sounds like an
> > > awfully bad restriction to me. Even one millisecond will have impact in
> > > some scenarios.
> >
> > If you care about delivery speed stronger than insertion,
> > then you do not need to use CBUS, but cn_netlink_send() instead.
> >
> > I will test smaller values, but doubt it will have better insert
> > performance.
>
> I fail to see why it is desirable to defer the delivery. The delivery has
> to happen some time, and we've now incurred additional context switch
> overhead.

Yes, but if we care about low latencies, so we MUST defer work, even
if later it will cost additional context switch.

> > Concider fork() connector - it is better for userspace
> > to return from system call as soon as possible, while information
> > delivering
> > about that event will be delayed.
>
> Why? The amount of CPU required to deliver the information regarding a
> fork is the same in either case. Probably more, in the deferred case.

But latency is much smaller.

> > Noone says that queueing is done in much higher rate then delivering,
> > it only shapes shart peaks when it is unacceptably to wait untill
> > delivery
> > is finished.
>
> Maybe cbus gave better lmbench numbers because the forking was happening on
> one CPU and the event delivery was pushed across to the other one. OK for
> a microbenchmark, but dubious for the real world.
>
> I can see that there might be CPU consumption improvements due to the
> batching of work and hence more effective CPU cache utilisation. That
> would need to be carefully measured, and you'd get the same effect by
> simply doing a list_add+schedule_work for each insertion.

Andrew, CBUS is not intended to be faster than connector itself,
it is just not possible, since it calls connector's methods
with some preparation, which takes time.

CBUS was designed to provide very fast _insert_ operation.
It is needed not only for fork() accounting, but for any
fast path, when we do not want to slow process down just to
send notification about it, instead we can create such a notification,
and deliver it later.
Why do we defer all work from HW IRQ into BH context?
Because while we are in HW IRQ we can not perform other tasks,
so with connector and CBUS we have the same situation.
While we are sending a low priority event, we stops actuall work,
which is not acceptible in many situations.

> > > > > > I did not try that case with the keventd but with one kernel thread
> > > > > > it was tested and showed worse performance.
> > > > >
> > > > > But your current implementation has only one kernel thread?
> > > >
> > > > It has a budget and timeout between each bundle processing.
> > > > keventd does not allow to create such a timeout between each bundle
> > > > processing.
> > >
> > > Yes, there's batching there. But I don't understand why the ability to
> > > internally queue events at a high rate is so much more important than the
> > > latency which that batching will introduce.
> >
> > With such mechanism we may use event notification in a really fast
> > pathes.
> > We always defer actuall work processing out from HW interrupts,
> > although for the task running from that interrupt it would be better
> > to steal the whole CPU and run there [in HW irq] until it finishes.
>
> I agree that there are advantages to queueing the event and sending it
> later: it allows events to be sent from IRQ context and is more CPU-cache
> friendly.


The main idea is not to speed up message delivering,
but to not hurt existing fast pathes, when we insert some
notification delivering into it.

> But the up-to-ten millisecond latency probably makes this thing pretty
> useless for everything except the fork() application.
>
> And queueing up arbitrary numbers of objects in this manner is a design
> problem: if the generator is generating events faster than the kernel
> thread can deliver them, we'll kill the machine.

Yes, it is known issue, I introduced queue length in the CBUS for that too,
I think when it exceeds some threshold, CBUS will just directly call
cn_netlink_send(), and thus remove possible DOS.
Or do you have other ideas?

> > > (And keventd _does_ allow such batching. schedule_delayed_work()).
> >
> > And it is still unacceptible, since there may be too many events
> > delayed to the same time, so they all will be processed in the same
> > time,
> > which will hurt insert operation performance.
>
> Why? It's equivalent to the msleep(10).

But there will be several works, one will sleep,
but others will work without time window.

time [msec] work
0 schedule_delayed_work(w1, 10);
0 schedule_delayed_work(w2, 10);
0 schedule_delayed_work(w3, 10);
0 schedule_delayed_work(w4, 10);
0 schedule_delayed_work(w5, 10);
0 schedule_delayed_work(w6, 10);
...
10 cbus_process();
10+time*1 cbus_process();
10+time*2 cbus_process();
10+time*3 cbus_process();
10+time*4 cbus_process();
10+time*5 cbus_process();

There will be no sleep between cbus_process() and next call,
so there will not be any timewindow for others.


Or do you mean call schedule_delayed_process() only from cbus_process()?
It can be a case, but it will not allow convenient manipulation,
like removing CBUS - it will require flushing all keventd works
just to one cbus_process() call.
And it was prooven that having only one sending thread hurts
performance smaller.

I tested it with SMP kernel on 1-way and 2-way systems.

--
Evgeniy Polyakov

Crash is better than data corruption -- Arthur Grabowski

Attachment: signature.asc
Description: This is a digitally signed message part