Re: [PATCH 2 of 4] Introduce i386 fibril scheduling

From: Ingo Molnar
Date: Fri Feb 02 2007 - 17:28:45 EST

Next message: Bartlomiej Zolnierkiewicz: "Re: [PATCH 11/15] ide: make ide_hwif_t.ide_dma_{host_off,off_quietly}void"
Previous message: Andrew Morton: "Re: [BUG] Unable to handle kernel NULL pointerdereference...as_move_to_dispatch+0x11/0x135"
In reply to: Zach Brown: "Re: [PATCH 2 of 4] Introduce i386 fibril scheduling"
Next in thread: Linus Torvalds: "Re: [PATCH 2 of 4] Introduce i386 fibril scheduling"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]

* Linus Torvalds <torvalds@xxxxxxxxxxxxxxxxxxxx> wrote:

> On Fri, 2 Feb 2007, Ingo Molnar wrote:
> >
> > My only suggestion was to have a couple of transparent kernel threads
> > (not fibrils) attached to a user context that does asynchronous
> > syscalls! Those kernel threads would be 'switched in' if the current
> > user-space thread blocks - so instead of having to 'create' any of them
> > - the fast path would be to /switch/ them to under the current
> > user-space, so that user-space processing can continue under that other
> > thread!
>
> But in that case, you really do end up with "fibrils" anyway.
>
> Because those fibrils are what would be the state for the blocked
> system calls when they aren't scheduled.
>
> We may have a few hundred thousand system calls a second (maybe that's
> not actually reasonable, but it should be what we *aim* for), and 99%
> of them will hopefully hit the cache and never need any separate IO,
> but even if it's just 1%, we're talking about thousands of threads.
>
> I do _not_ think that it's reasonable to have thousands of threads
> state around just "in case". Especially if all those threadlets are
> then involved in signals etc - something that they are totally
> uninterested in.
>
> I think it's a lot more reasonable to have just the kernel stack page
> for "this was where I was when I blocked". IOW, a fibril-like thing.

ok, i think i noticed another misunderstanding. The kernel thread based
scheme i'm suggesting would /not/ 'switch' to another kernel thread in
the cached case, by default. It would just execute in the original
context (as if it were a synchronous syscall), and the switch to a
kernel thread from the pool would only occur /if/ the context is about
to block. (this 'switch' thing would be done by the scheduler)
User-space gets back an -EAIO error code immediately and transparently -
but already running under the new kernel thread.

i.e. in the fully cached case there would be no scheduling at all - in
fact no thread pool is needed at all.

regarding cost:

the biggest memory resource cost of a kernel thread (assuming it has no
real user-space context) /is/ its kernel stack page, which is 4K or 8K.
The task struct takes ~1.5K. Once we have a ready kernel thread around,
it's quite cheap to 'flip' it to under any arbitrary user-space context:
change its thread_info->task pointer to the user-space context's task
struct, copy the mm pointer, the fs pointer to the "worker thread",
switch the thread_info, update ptregs - done. Hm?

Note: such a 'flip' would only occur when the original context blocks,
/not/ on every async syscall.

regarding CPU resource costs, i dont think there should be significant
signal overhead, because the original task is still only one instance,
and the kernel thread that is now running with the blocked kernel stack
is not part of the signal set. (Although it might make sense to make
such async syscalls interruptible, just like any syscall.)

The 'pool' of kernel threads doesnt even have to be per-task, it can be
a natural per-CPU thing - and its size will grow/shrink [with a low
update frequency] depending on how much AIO parallelism there is in the
workload. (But it can also be strictly per-user-context - to make sure
that a proper ->mm ->fs, etc. is set up and that when the async system
calls execute they have all the right context info.)

and note the immediate scheduling benefits: if an app (say like
OpenOffice) is single-threaded but has certain common ops coded as async
syscalls, then if any of those syscalls blocks then it could utilize
/more than one/ CPU. I.e. we could 'spread' a single-threaded app's
processing to multiple cores/hardware-threads /without/ having to
multi-thread the app in an intrusive way. I.e. this would be a
finegrained threading of syscalls, executed as coroutines in essence.
With fibrils all sorts of scheduling limitations occur and no
parallelism is possible.

in fact an app could also /trigger/ the execution of a syscall in a
different context - to create parallelism artificially - without any
blocking event. So we could do:

cookie1 = sys_async(sys_read, params);
cookie2 = sys_async(sys_write, params);

[ ... calculation loop ... ]

wait_on_async_syscall(cookie1);
wait_on_async_syscall(cookie2);

or something like that. Without user-space having to create threads
itself, etc. So basically, we'd make kernel threads more useful, and
we'd make threading safer - by only letting syscalls thread.

> What I like about fibrils is that they should be able to handle the
> cached case well: the case where no "real" scheduling (just the fibril
> stack switches) takes place.

the cached case (when a system call would not block at all) would not
necessiate any switch to another kernel thread at all - the task just
executes its system call as if it were synchronous!

that's the nice thing: we can do this switch-to-another-kernel-thread
magic thing right in the scheduler when we block - and the switched-to
thread will magically return to user-space (with a -EAIO return code) as
if nothing happened (while the original task blocks). I.e. under this
scheme i'm suggesting we have /zero/ setup cost in the cached case. The
optimistic case just falls through and switches to nothing else. Any
switching cost only occurs in the slowpath - and even that cost is very
low.

once a kernel thread that ran off with the original stack finishes the
async syscall and wants to return the return code, this can be gathered
via a special return-code ringbuffer that notifies finished syscalls. (A
magic cookie is associated to every async syscall.)

> So the state machine argument is totally bogus - it results in a
> programming model that simply doesn't match the *normal* setup. You
> want the kernel programming model to appear "linear" even when it
> isn't, because it's too damn hard to think nonlinearly.
>
> Yes, we could do pathname lookup with that kind of insane setup too.
> But it would be HORRID!

yeah, but i guess not nearly as horrid as writing a new OS from scratch
;-)

seriously, i very much think and agree that programming state machines
is hard and not desired in most of the kernel. But it can be done, and
sometimes (definitely not in the common case) it's /cleaner/ than
functional programming. I've programmed an HTTP and an FTP in-kernel
server via a state machine and it worked better than i initially
expected. It needs different thinking but there /are/ people around with
that kind of thinking, so we just cannot exclude the possibility. [ It's
just that such people usually dedicate their brain to mental
fantasies^H^H^Hexcercises called 'Higher Mathematics' :-) ]

> [...] The fact that networking (and TCP in particular) has state
> machines is because it is a packetized environment.

rough ballpark figures: for things like webserving or fileserving (or
mailserving), networking sockets are the reason for context-blocking
events in 90% of the cases (mostly due to networking latency). 9% of the
blocking happens due to plain block IO, and 1% happens due to VFS
metadata (inode, directory, etc.) blocking.

( in Tux i had to handle /all/ of these sources of blocking because even
1% kills your performance if you do a hundred thousand requests per
second - but in terms of design weight, networking is pretty damn
important. )

and interestingly, modern IO frameworks tend to gravitate towards a
packetized environment as well. I.e. i dont think state machines are
/that/ unimportant.

Ingo
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

Next message: Bartlomiej Zolnierkiewicz: "Re: [PATCH 11/15] ide: make ide_hwif_t.ide_dma_{host_off,off_quietly}void"
Previous message: Andrew Morton: "Re: [BUG] Unable to handle kernel NULL pointerdereference...as_move_to_dispatch+0x11/0x135"
In reply to: Zach Brown: "Re: [PATCH 2 of 4] Introduce i386 fibril scheduling"
Next in thread: Linus Torvalds: "Re: [PATCH 2 of 4] Introduce i386 fibril scheduling"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]