Re: [PATCH tty-next 0/4] tty: Fix ^C echo

From: Peter Hurley
Date: Wed Dec 11 2013 - 23:00:00 EST


On 12/04/2013 07:13 PM, One Thousand Gnomes wrote:
Not so much confused as simply merged. Input processing is inherently
single-threaded; it makes sense to rely on that at the highest level
possible.

I would disagree entirely. You want to minimise the areas affected by a
given lock. You also want to lock data not code. Correctness comes before
speed. You optimise it when its right, otherwise you end up in a nasty
mess when you discover you've optimised to assumptions that are flawed.

Sorry for the delayed reply, Alan; what little free time I had was spent
snuffing out regressions :/

Sure, I understand that ideally locks protect data, not operations.
But I think maybe you're missing my point. Almost every lock, even at
inception, is somewhat optimized; otherwise, every datum would have its
own lock. Eliminating overlapping locks is a common optimization in stable
code.

In this case, an already broken bit of code is just only still broken.
buf->lock is also fairly simple to break apart (although I don't want to
because of the performance hit) which is not characteristic of locks
which protect operations.


Firewire, which is capable of sustained throughput in excess of 40MB/sec,
struggles to get over 5MB/sec through the tty layer. [And drm output
is orders-of-magnitude slower than that, which is just sad...]

And what protocols do you care about 5MB/second - n_tty - no ? For the
high speed protocols you are trying to fix a lost cause. By the time
we've gone piddling around with tty buffers and serialized tty queues
firing bytes through tasks and the like you already lost.

For drm I assume you mean the framebuffer console logic ? Last time I
benched that except for the Poulsbo it was bottlenecked on the GPU - not
that I can type at 5MB/second anyway. Not that fixing the performance of
the various bits wouldn't be a good thing too especially on the output
end.

For drm, I actually mean GEM object deletion, which is typically fenced
and thus appears to be GPU-bound. What's really needed there is deferred
deletion, like kfree_rcu(), with partial synchronization on allocation
failures only.

I mostly care about output speed; unfortunately, that's the input side
at the other end :)

While that would work, it's expensive extra locking in a path that 99.999%
of the time doesn't need it. I'd rather explore other solutions.

How about getting the high speed paths out of the whole tty buffer
layer ? Almost every line discipline can be a fastpath directly to the
network layer. If optimisation is the new obsession then we can cut the
crap entirely by optimising for networking not making it a slave of n_tty.

Starting at the beginning

we have locks on rx because
- we want serialized rx
- we have buffer lifetimes
- we have buffer queues
- we have loads of flow control parameters

Only n_tty needs the buffers (maybe some of irda but irda hasn't worked
for years afaik). IRQ receive paths are serialized (and as a bonus can be
pinned to a CPU). Flow control is n_tty stuff, everyone else simply fires
it at their network layer as fast as possible and net already does the
work.

Keep a single tty_buf in the tty for batching at any given time, and
private so no locks at all

Have a wrapper via
ld->receive(tty, buf)

which fires the tty_buf at the ldisc and allocates a new empty one

tty_queue_bytes(tty, buf, flags, len)

which adds to the buffer, and if full calls ld->queue and then carries on
the copying cycle

and

ld->receive_direct(tty, buf, flags, len)

which allows block mode devices to blast bytes directly at the queue (ie
all the USB 3G stuff, firewire, etc) without going via any additional
copies.

For almost all ldiscs

ld->receive would be

ld->receive_direct(tty, buf->buf, buf->flags, buf->len);
free buffer

For n_tty type stuff

ld->receive is basically much of tty_flip_buffer_push

ld->receive_direct allocates tty_buffers and copies into it

We may even be able to optimise some of the n_tty cases into the
fastpath afterwards (notably raw, no echo)

For anything receiving in blocks that puts us close to (but not quite at)
ethernet kinds of cleanness for network buffer delivery.

Worth me looking into ?

I have to give this a lot more thought.

The universality of n_tty is important, and costs real cycles on servers and
such. It's not just about typing speed.

The clock/generation method seems like it might yield a lockless solution
for this problem, but maybe creates another one because the driver-side
would need to stamp the buffer (in essence, a flush could affect data
that has not yet been copied from the driver).

But it has arrived in the driver so might not matter. That requires a
little thought!

This is my next experiment.

Regards,
Peter Hurley
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/