Re: [PATCH] NLS: improve UTF8 -> UTF16 string conversion routine
From: Alan Stern
Date: Sat Nov 19 2011 - 10:28:58 EST
On Sat, 19 Nov 2011, NamJae Jeon wrote:
> >> > +int utf8s_to_utf16s(const u8 *s, int len, enum utf16_endian endian,
> >> > + Â Â Â Â Â Â Â wchar_t *pwcs, int maxlen)
> >> > Â{
> >> > Â Â Â Âu16 *op;
> >> > Â Â Â Âint size;
> >> > Â Â Â Âunicode_t u;
> >> >
> >> > Â Â Â Âop = pwcs;
> >> > - Â Â Â while (*s && len > 0) {
> >> > + Â Â Â while (len > 0 && maxlen > 0 && *s) {
> >> > Â Â Â Â Â Â Â Âif (*s & 0x80) {
> >> > Â Â Â Â Â Â Â Â Â Â Â Âsize = utf8_to_utf32(s, len, &u);
> >> > Â Â Â Â Â Â Â Â Â Â Â Âif (size < 0)
> >> > Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Âreturn -EINVAL;
> >> > + Â Â Â Â Â Â Â Â Â Â Â s += size;
> >> > + Â Â Â Â Â Â Â Â Â Â Â len -= size;
...
> >> > Â Â Â Â Â Â Â Â Â Â Â Âif (u >= PLANE_SIZE) {
> >> > + Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â if (maxlen < 2)
> >> > + Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â break;
> >> > Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Âu -= PLANE_SIZE;
> >> > - Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â *op++ = (wchar_t) (SURROGATE_PAIR |
> >> > - Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â ((u >> 10) & SURROGATE_BITS));
> >> > - Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â *op++ = (wchar_t) (SURROGATE_PAIR |
> >> > + Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â put_utf16(op++, SURROGATE_PAIR |
> >> > + Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â ((u >> 10) & SURROGATE_BITS),
> >> > + Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â endian);
> >> > + Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â put_utf16(op++, SURROGATE_PAIR |
> >> > Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â ÂSURROGATE_LOW |
> >> > - Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â (u & SURROGATE_BITS));
> >> > + Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â (u & SURROGATE_BITS),
> >> > + Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â endian);
> >> > + Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â maxlen -= 2;
> >>
> >> Why did you use contants value(-2) instead of maxlen -= size; value ?
> >
> > "maxlen -= size" would be completely wrong, because size is the length
> > of the utf8 input and maxlen is the number of 16-bit slots remaining
> > in the output buffer. ÂA surrogate pair uses two 16-bit values,
> > therefore maxlen has to be decreased by 2.
> If so, len should also be minus -2 constant value like maxlen ?
You seem to be confused. "len" refers to the input string and "maxlen"
refers to the output string. They have no connection to one another.
Would it help if "maxlen" were named "maxout" instead?
> and why does this code(if (maxlen < 2)) is needed ? If len is smaller than 2 ?
If maxlen < 2 then there is room in the output buffer for only one more
data value -- but a surrogate pair occupies two data values. Hence
there isn't room to store the pair in the output buffer, so the loop
must terminate.
> >> > Â Â Â Â Â Â Â Â Â Â Â Â} else {
> >> > - Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â *op++ = (wchar_t) u;
> >> > + Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â put_utf16(op++, u, endian);
> >> > + Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â maxlen--;
> >> > Â Â Â Â Â Â Â Â Â Â Â Â}
> >> > - Â Â Â Â Â Â Â Â Â Â Â s += size;
> >> > - Â Â Â Â Â Â Â Â Â Â Â len -= size;
> >> > Â Â Â Â Â Â Â Â} else {
> >> > - Â Â Â Â Â Â Â Â Â Â Â *op++ = *s++;
> >> > + Â Â Â Â Â Â Â Â Â Â Â put_utf16(op++, *s++, endian);
> >> > Â Â Â Â Â Â Â Â Â Â Â Âlen--;
> >> > + Â Â Â Â Â Â Â Â Â Â Â maxlen--;
> >> > Â Â Â Â Â Â Â Â}
> >> > Â Â Â Â}
> >> > Â Â Â Âreturn op - pwcs;
> >> > Index: usb-3.2/fs/fat/namei_vfat.c
> >> > ===================================================================
> >> > --- usb-3.2.orig/fs/fat/namei_vfat.c
> >> > +++ usb-3.2/fs/fat/namei_vfat.c
> >> > @@ -512,7 +512,8 @@ xlate_to_uni(const unsigned char *name,
> >> > Â Â Â Âint charlen;
> >> >
> >> > Â Â Â Âif (utf8) {
> >> > - Â Â Â Â Â Â Â *outlen = utf8s_to_utf16s(name, len, (wchar_t *)outname);
> >> > + Â Â Â Â Â Â Â *outlen = utf8s_to_utf16s(name, len, UTF16_HOST_ENDIAN,
> >> > + Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â (wchar_t *) outname, FAT_LFN_LEN + 2);
> >> Is there the reason why you plus 2 to FAT_LFN_LEN ?
> >
> > So that the "(*outlen > FAT_LFN_LEN)" test below will work correctly.
> > If the maximum length were set to FAT_LFN_LEN then the test would
> > always fail. ÂIf the maximum length were set to FAT_LFN_LEN + 1 then
> > the test would fail when the next character to be stored was a
> > surrogate pair.
> Although we are using maxlen, I don't know why do we check case that
> outlen is bigger than FAT_LFN_LEN.
Probably because the filesystem code requires that the UTF16 string fit
into a certain amount of space. If *outlen > FAT_LFN_LEN then the
string doesn't fit.
Alan Stern
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/