Re: UTF-8 practically vs. theoretically in the VFS API

From: Robin Rosenberg
Date: Wed Feb 18 2004 - 05:30:51 EST


On Wednesday 18 February 2004 06.30, H. Peter Anvin wrote:
> Linus Torvalds wrote:
> > On Tue, 17 Feb 2004, H. Peter Anvin wrote:
> >>Well, the reason you'd want an out-of-band mechanism is to be able to
> >>display it as some kind of escapes.
> > I'd suggest just doing that when you convert the utf-8 format to printable
> > format _anyway_. At that point you just make the "printable"
> > representation be the binary escape sequence (which you have to have for
> > other non-printable utf-8 characters anyway).
> What does "printable" mean in this context? Typically you have to
> convert it to UCS-4 first, so you can index into your font tables, then
> you have to create the right composition, apply the bidirectional text
> algorithm, and so forth.

> Rendering general Unicode text is complex enough that you really want it
> layered. What I described what the first step of that -- mostly trying
> to show that "throwing an error" doesn't necessarily mean "produce no
> output." What you shouldn't do, though, is alias it with legitimate input.

I think you can use libicu here. Conversion to UCS-4 doesn't for determining
character type doesn't mean you will every have actual strings of UCS-4. It could
be character by character just for looking it up, so you can have the out-of-band
error flags internally.

> > And if you do things right (ie you allow user input in that same escaped
> > output format), you can allow users to re-create the exact "broken utf-8".
> > Which is actually important just so that the user can fix it up (ie
> > imagine the user noticing that the filename is broken, and now needs to do
> > a "mv broken-name fixed-name" - the user needs some way to re-create the
> > brokenness).
>
> Indeed. The C language has gone with \x77 for bytes and \u7777 or
> \U77777777 for Unicode characters (4 vs 8 hex digits respectively); I
> think this is a good UI for shells to follow. The \x representation
> then doesn't stand for characters but for bytes. It may be desirable to
> disallow encoding of *valid* UTF-8 characters this way, though.

Agree. \u80808080 I would assume represents a valid character, while \x80\x80\x80\x80
does not. A problem with invalid sequences I just noted is that they break some of
the nice properties of UTF-8, that people will assume apply, i.e. that you can parse it
backwards. With UTF-8 (i.e. well-formed utf-8) you can point at a byte and figure "this is
not the first byte", lets skip backwards to find the start. If invalid sequences can ever occur
you must read from the start of the string.

-- robin
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/