[RFC] Raceless module interface

From: Daniel Phillips (phillips@arcor.de)
Date: Wed Sep 11 2002 - 13:35:05 EST


Hi Roman,

On Tuesday 10 September 2002 12:17, Roman Zippel wrote:
> I implemented something like this some time ago. If module->count isn't
> used by module.c anymore, why should it be in the module structure?
> Consequently I removed it from the module struct (what breaks of course
> unloading of all modules, so I'll probably reintroduce it with big a
> warning). If the count isn't in the module structure, the locking will
> become quite simpler. More info is here
> http://marc.theaimsgroup.com/?l=linux-kernel&m=102754132716703&w=2

Ah, I remember your original post but I didn't fully understand what you were
going on about at the time. Now I do, and we're on the same page. Not only
that, but I'm getting to the point with this messy problem where I can
articulate the issues clearly, so let's try to do it now. Please ack/nack me
as I go.

First, let's identify the central issue that's causing conceptual problems
for some who've sniffed at these issues. It's simple, there are two broad
categories of modules:

  - Those for which we account exactly for all task entry/exits
  - Those for which we do not, or cannot account current usage exactly

Let's call the former "counting" and the latter "noncounting" modules. The
number of tasks executing in a given module is the count of task entries
minus the count of task exits. Obviously this is a big problem when we have
no good way of counting accurately, but it is not an insoluble problem, and a
solution is presented here, based on earlier discussions on lkml.

Preventing Unload Races in Counting Modules
-------------------------------------------

Module unload races in counting modules are easy to handle. Al Viro has
written code that handles those in a passably competent way and we will
improve Al's approach incrementally as described below.

I'll digress for a moment and ask the question "why are counting modules
countable?". Well, using filesystems as an example, the vfs initiates each
and every call into a filesystem module, and every vfs high level function is
structured in such a way that all calls into a filesystem module return
before the the vfs goes on to do something else. So it's easy to write
accounting code that relies on this behavior, and in fact, the accounting job
reduces to simply counting the number of times a given module is mounted vs
unmounted.

Even in this simple situation there are possible races, and one of the races
is dealt with by Al's odd-looking try_inc_mod_count function, which protects
the incrementing of its counter in three ways: 1) by using atomic operations
2) by protecting incrementing and testing with a spin lock 3) by additionally
protecting incrementing with a state flag. Decrementing the counter is
protected only by the use of atomic decrement, and turns out not to need any
more protection than that (granted, this interesting fact is largely useless
in terms of cycles saved).

So let's look at the curious structure of the count increment operation,
try_inc_mod_count. This acts on a counter embedded in struct module. (As
you point out, it doesn't need to, it can act on a counter embedded in the
struct filesystem instead, there is a very good reason for doing that, which
I'll get back to.)

I'm won't show Al's version, I'll show my improved version, which
substitutes a special state of the counter variable for Al's state flag, a
small but useful change:

int try_inc_mod_count(struct module *module)
{
        int result = 1;
        if (module) {
                spin_lock(&unload_lock);
                if (atomic_read(&module->count) == -1)
                        result = 0;
                else
                        atomic_inc(&module->count);
                spin_unlock(&unload_lock);
        }
        return result;
}

If the counter has the value -1 then it can't be incremented and this
function returns 0. This corresponds to the state where unloading is in
progress, and for filesystems, means that an attempted mount will fail. If
the counter has any other value then that value is incremented, the function
returns true, and the mount proceeds. Getting rid of the state flag
simplifies the later step of removing the counter from struct module.

The only other thing worth noting about this function is that, to use it, you
have to be able to locate the module given a struct filesystem, hence the
->owner field in that structure. With the techniques you and I have
identified, we can dispense with the ->owner field, removing considerable
cruft in the process. However, for the moment I think we should produce a
patch that aims narrowly at getting rid of the unload races, and do the
spinoff cleanups after.

The other main part of this mechanism is in sys_delete_module, in which the
test for zero module count and setting of the state flag are protected by the
same spinlock as above. Our improvement here will be to encapsulate this
functionality in a module helper function, that looks something like this:

int try_mod_count(struct module *module)
{
        int count;
        spin_lock(&unload_lock);
        count = atomic_read(&module->count);
        if (!count)
                atomic_set(&module->count, -1);
        spin_unlock(&unload_lock);
        return count;
}

This function returns zero if the (counting) module has no users, and returns
the (approximate) number of users otherwise. If the function returns zero,
then additionally, the module count is set to -1, a state in which
try_inc_mod_count is prevented from incrementing it, so after the spinlock is
released, any pending mounts will fail as they should, because we have been
asked to unload the module and are in the process of doing so.

The remaining possible races here are currently closed by the BKL, and we
will change that to a semaphore in due course. The fact that we use a
sleeping lock for this purpose means that a module's cleanup_function is
unrestricted in terms of scheduling.

The (counting) module's cleanup_function will therefore look something like
this:

        int error = try_mod_count(module);
        if (!error) {
                unregister_filesystem(foo_filesystem);
                cleanup_foo(foo_filesystem);
        }
        return error;

And we will later design the 'module' parameter out of this, because it is no
longer needed by generic code and so the associated cruft can be eliminated.
As a programmer/guru friend once said: the code "won't have its thumb tied to
its nose any more".

Note that our returned error can also take on a negative value (i.e., a
standard error code) should module unloading fail for some other reason than
the module simply having users. Additionally, a return of -1 means "you
already killed me, stupid, I'm already dead, and I already unregistered and
cleaned up my filesystem". The state the module is in when its count is -1
is interesting and potentially useful: in this state the module can legally
be reinitialized/reactivated, instead of just being removed from the module
list and destroyed.

(Roman, is -1 an ok error code to return here, given that there could be
other negative error values returned as well?)

Now we can look at the revised version of sys_delete_module, which has the
code:

        BUG_ON(!mod->cleanup);
        if ((error = mod->cleanup())
                goto out_error;
        free_module(module);

and free_module will no longer call the cleanup_function. Nice and simple,
easy to understand, is it not? What's more, we don't have to change this
code at all to handle the more difficult case of noncounting modules.

Preventing Unload Races in Noncounting Modules
----------------------------------------------

Noncounting modules typically don't keep track of the number of task
entry/exits, typically because the cost of even a simple inc/dec wrapper in
generic kernel code is unacceptable, and because the wrappers just plain look
ugly. The proposed Linux Security Module is a good example of a noncounting
kind of module: many vfs functions that currently execute security checks
using inline code now must henceforth call indirectly through a module to do
it. Linus will never buy it if these calls are inefficient. At present,
some of these security tests consist of just a few instructions, so wrapping
an accounting interface around those is going to show up on an execution
profile, never mind the extra source code.

This is far from the only reason why a module might be noncounting, but it's
sufficient to show why we can't ignore the problem. Fortunately, there is a
well-understood way to deal with such modules, consisting of two steps:

  - Stub out all the module call points
  - Wait until ever running task on the system has scheduled at least once

To make this work, we rely on the rule that no module code may sleep/schedule
unless it first increments a counter. The counter incremented will, at
present, be the module->count, but that is entirely up to the module itself;
module.c will no longer care how it keeps track, as long as it does.

With config_preempt code, we must additionally disable preemption until the
module quiescence test has completed. I'm not going to go further into this,
because this is deep scheduling-fu, and I haven't got working code to show
yet. Suffice to say that we have the technology to build a
magic_wait_for_quiescence, and we must now proceed to do that. (Robert, are
you reading?) A noncounting module uses the test as follows, in its
cleanup_function:

        unregister_callpoints(...);
        magic_wait_for_quiescence();
        return cleanup_foo(...);

Blessed relief, that was easy. Now it's just a matter of coding ;-)

One can easily imagine a kind of module that needs both the try_mod_count
mechanism and the magic_wait_for_quiescence, or one that also has to wait for
some daemon to exit, or a module that supports more than a single subsystem,
possibly of very different kinds. All these variations are handled in the
obvious way: by customizing the cleanup_function, with the aid of helper
functions. This allows us to avoid the strong temptation to over-engineer
the module interface, trying to anticipate every possible kind of module that
someone might write in the future.

Now there's the question "if this is such a great approach, why not make all
modules work this way, not just filesystems?". Easy: the magic scheduling
approach impacts the scheduler, however lightly, and worse, we cannot put an
upper bound on the time needed for magic_wait_for_quiescence to complete. So
the try_count approach is preferable, where it works.

Implementation Issues
---------------------

OK, I think we agree on the general outline of the work that has to be done.
If not, please shout now, otherwise, feel free to niggle me to death in the
usual way ;-)

Let me touch on just one more issue: we need to do this work without breaking
every existing module in the world. In other words, we need a compatibility
wrapper to get us through the period, possibly as long as a year or two, when
there are still a lot of modules both in-kernel and outside, whose
cleanup_functions just return void. Module.c will detect this and do theits
magic_wait_for_quiescence following the module's cleanup_function. Thus,
such modules will be no more broken than they ever were.

What we want is a define a new 'module_kill' macro, which will live alongside
a backwards-compatible 'module_exit' macro for now. The latter generates a
function call hook that module.c detects and wraps like this:

        if (!mod->old_cleanup) {
                WARN(1, "Please fix your old-style module");
                if ((error = try_mod_count(module)))
                        goto out_error;
                mod->old_cleanup();
                free_module(module);
        } else {
                <nice new interface>
        }

Miscellaneous other improvements we might want to make are:

  - Use a semaphore instead of lock_kernel
  - Clean up the generally substandard code in module.c
  - Move the semaphore into struct module
  - Move unload_lock into module
  - Other things I've forgotten for the moment

But these can all wait. We have races now, and we have proposals on the
table that smell like over-engineering. So we need to do the work I've
described above, promptly, and get it into circulation.

Comments/flames/races?

-- 
Daniel
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/



This archive was generated by hypermail 2b29 : Sun Sep 15 2002 - 22:00:26 EST