Re: sys_write() racy for multi-threaded append?

From: Bodo Eggert
Date: Mon Mar 12 2007 - 14:49:34 EST


On Mon, 12 Mar 2007, Michael K. Edwards wrote:
> On 3/12/07, Bodo Eggert <7eggert@xxxxxx> wrote:
> > Michael K. Edwards <medwards.linux@xxxxxxxxx> wrote:

> > > Actually, I think it would make the kernel (negligibly) faster to bump
> > > f_pos before the vfs_write() call.
> >
> > This is a security risk.

<snip>

> I don't entirely follow this.

I got it wrong, and I didn't CC myself so I couldn't correct it in time.

If it was a security risk, it would exist right now, too, because of
other access patterns.

> > BTW: The best thing you can do to a program where two threads race for
> > writing one fd is to let it crash and burn in the most spectacular way
> > allowed without affecting the rest of the system, unless it happens to
> > be a pipe and the number of bytes written is less than PIPE_MAX.
>
> That's fine when you're doing integration test, and should probably be
> the default during development. But if the race is first exposed in
> the field, or if the developer is trying to concentrate on a different
> problem, "spectacular crash and burn" may do more harm than good.
> It's easy enough to refactor the f_pos handling in the kernel so that
> it all goes through three or four inline accessor functions, at which
> point you can choose your trade-off between speed and idiot-proofness
> -- at _kernel_ compile time, or (given future hardware that supports
> standardized optionally-atomic-based-on-runtime-flag operations) per
> process at run-time.

CONFIG_WOMBAT

Waste memory, brain and time in order to grant an atomic write which is
neither guaranteed by the standard nor expected by any sane programmer,
just in case some idiot tries to write to one file from multiple
processes.

Warning: Programs expecting this behaviour are buggy and non-portable.

> Frankly, I think that unless application programmers poke at some sort
> of magic "I promise to handle short writes correctly" bit, write()
> should always return either the full number of bytes requested or an
> error code.

If you asume that you won't have short writes, your programs may fail on
e.g. solaris. There may be reasons for linux to use the same semantics at
some time in the future, you never know.

If you asume you *may* have short writes, you have no problem.

> If they do poke at that bit, the (development) kernel
> should deliberately split writes a few percent of the time, just to
> exercise the short-write code paths. And in order to find out where
> that magic bit is, they should have to read the kernel code or ask on
> LKML (and get the "standard lecture").

I think it may be well-hidden in menuconfig.

> Really very IEEE754-like, actually. (Harp, harp.)

-EYOULOSTME
--
"Try to look unimportant; they may be low on ammo."
-Infantry Journal
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/