Re: UTF-8 practically vs. theoretically in the VFS API
From: H. Peter Anvin
Date: Wed Feb 18 2004 - 00:31:53 EST
Linus Torvalds wrote:
On Tue, 17 Feb 2004, H. Peter Anvin wrote:
Well, the reason you'd want an out-of-band mechanism is to be able to
display it as some kind of escapes.
I'd suggest just doing that when you convert the utf-8 format to printable
format _anyway_. At that point you just make the "printable"
representation be the binary escape sequence (which you have to have for
other non-printable utf-8 characters anyway).
What does "printable" mean in this context? Typically you have to
convert it to UCS-4 first, so you can index into your font tables, then
you have to create the right composition, apply the bidirectional text
algorithm, and so forth.
Rendering general Unicode text is complex enough that you really want it
layered. What I described what the first step of that -- mostly trying
to show that "throwing an error" doesn't necessarily mean "produce no
output." What you shouldn't do, though, is alias it with legitimate input.
And if you do things right (ie you allow user input in that same escaped
output format), you can allow users to re-create the exact "broken utf-8".
Which is actually important just so that the user can fix it up (ie
imagine the user noticing that the filename is broken, and now needs to do
a "mv broken-name fixed-name" - the user needs some way to re-create the
brokenness).
Indeed. The C language has gone with \x77 for bytes and \u7777 or
\U77777777 for Unicode characters (4 vs 8 hex digits respectively); I
think this is a good UI for shells to follow. The \x representation
then doesn't stand for characters but for bytes. It may be desirable to
disallow encoding of *valid* UTF-8 characters this way, though.
-hpa
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/