Re: JFS default behavior (was: UTF-8 in file systems? xfs/extfs/etc.)

From: John Bradford
Date: Mon Feb 16 2004 - 11:34:44 EST


Quote from viro@xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx:
> On Mon, Feb 16, 2004 at 03:46:21PM +0000, John Bradford wrote:
> > The current situation is that so many applications simply treat
> > filenames as arbitrary sequences of bytes. With many encodings, this
> > simply happens to work, and an encoding mis-match will result in some
> > incorrect characters being displayed for byte values > 127. However,
> > some encodings, such as UTF-8, are simply _not_ compatible with the
> > 'you can also treat it like an arbitrary byte string model', and there
>
> Excuse me? Would you fscking mind explaining what, in your opinion,
> UTF-8 is

Read the UTF-8 manual page.

> and what makes "simply _not_ compatible" with aforementioned
> model?

Byte values > 127 in UTF-8 don't map to single characters, but instead
many of them form part of an escape to a larger set of values.

The net effect is that if you have filenames in an existing 8-bit
encoding, such as any of the ISO-8859- encodings, and treat them as
being in another, similar encoding, you may get some incorrect
characters. This is not ideal, of course, but it is not very
confusing for end users. You can more or less store arbitrary bytes
in the filename, and usually at least get a displayable, re-typeable,
somewhat usable result out.

Note that _many_ current applications expect to be able to do just
that.

However, with UTF-8, random bytes > 127 may not map to any valid
character sequence at all, or may map to a sequence that is not
permitted by the spec, but which is, for example, a 31-bit
representation of a value < 128. These are a potential source of
security vulnerabilities for badly written decoders.

Now, this problem is not limited to UTF-8 - many 16 bit encodings may
have similar issues with 'random' byte streams.

However, with my proposed solution, Unicode-aware applications can be
adapted to write their filenames as UCS-4, and existing applications
which continue to see 8-bit byte streams which they can interpret as
they like, will see 7-bit ascii for characters which can be
represented in it, and a random character for those which can't.
Assuming that those applications treat the byte sequence as an
ISO-8859- type character set, (not UTF-8, or a 16-bit character set),
this shouldn't be too much of a problem, except where the low byte of
the UCS-4 character is \0 or /. We can work around this by replacing
such bytes with another character in the 8-bit read routine, (which
isn't expected to deal with anything other than 7-bit ASCII 100%
correctly, (which is no worse than what we have at the moment, as far
as I can see)).

Applications which do treat the 8-bit byte stream as UTF-8 or an
existing 16-bit encoding should have only one additional thing to deal
with over what they have to deal with today, and that is the potential
for a filename created from truncated 32-bit UCS-4 values to contain
\0 or /. I suggested above that the kernel could deal with that by
substituting another value, but obviously UTF-8 and 16-bit encodings
are more sensitive to what that substitute value is, than ISO-8859-
type encodings are.

John.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/