Re: UTF-8 practically vs. theoretically in the VFS API

From: Jamie Lokier
Date: Wed Feb 18 2004 - 06:35:05 EST


Linus Torvalds wrote:
> Somebody correctly pointed out that you do not need any out-of-band
> encoding mechanism - the very fact that it's an invalid sequence is in
> itself a perfectly fine flag. No out-of-band signalling required.

Technically this is almost(*) correct, however a _lot_ of code exists
which assumes logical properties of UTF-8. (See, for example, the
"stty utf8" patch).

Perl, for example, allows you to pass around invalid sequences in
exactly the way you describe. It works, right up until you do
something like length() or substr() or a regex match. Then Perl
screws up the answer, because it sees something like 0xfd and just
assumes it can skip the next 5 bytes, without checking them.

hpa's suggestion that invalid bytes are treated as 0x800000xx works
very nicely, *iff* a program is absolutely consistent about its
treatment of bytes in that way. When there's a mixture of code which
interprets malformed UTF-8 in different ways, then it's messy and
sometimes a security hazard.

-- Jamie

(*) - It's fine until you concatenate two malformed strings. Then the
out-of-band signal is lost if the combination is valid UTF-8.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/