Re: UTF-8 practically vs. theoretically in the VFS API
From: H. Peter Anvin
Date: Tue Feb 17 2004 - 22:01:49 EST
Followup to: <20040216202142.GA5834@xxxxxxxxxxxxxxx>
By author: bert hubert <ahu@xxxxxxx>
In newsgroup: linux.dev.kernel
>
> Additional good news is that following octets in a utf-8 character sequence
> always have the highest order bit set, precluding / or \x0 from appearing,
> confusing the kernel.
>
Indeed. The original name for the encoding was, in fact, "FSS-UTF",
for "filesystem safe Unicode transformation format."
> The remaining zit is that all these represent '..':
> 2E 2E
> C0 AE C0 AE
> E0 80 AE E0 80 AE
> F0 80 80 AE F0 80 80 AE
> F8 80 80 80 AE F8 80 80 80 AE
> FC 80 80 80 80 AE FC 80 80 80 80 AE
No, they don't.
The first represent "..", the remaining two are illegal encodings and
do not decode to anything.
Those of us who have been involved with the issue have fought
*extremely* hard against DWIM decoders which try to decode the latter
sequences into ".." -- it's incorrect, and a security hazard. The
only acceptable decodings is to throw an error, or use an out-of-band
encoding mechanism to denote "bad bytecode."
> This in itself is not a problem, the kernel will only recognize 2E 2E as the
> real .., but it does show that 'document.doc' might be encoded in a myriad
> ways.
No, it doesn't.
> So some guidance about using only the simplest possible encoding might be
> sensible, if we don't want the kernel to know about utf-8.
UTF-8 requires the use of the shortest possible encoding. An
application which doesn't obey that and tries to be "smart" is a
security hazard.
It is a bit unfortunate that the encoding don't exclude these by
design as opposed by error checking; it makes it a little too easy for
clueless programmers to skip :(
-hpa
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/