Re: [PATCH] NLS: improve UTF8 -> UTF16 string conversion routine

From: NamJae Jeon
Date: Sat Nov 19 2011 - 09:13:59 EST


2011/11/19 Alan Stern <stern@xxxxxxxxxxxxxxxxxxx>:
> On Fri, 18 Nov 2011, NamJae Jeon wrote:
>
>> 2011/11/18 Alan Stern <stern@xxxxxxxxxxxxxxxxxxx>:
>> > The utf8s_to_utf16s conversion routine needs to be improved. ÂUnlike
>> > its utf16s_to_utf8s sibling, it doesn't accept arguments specifying
>> > the maximum length of the output buffer or the endianness of its
>> > 16-bit output.
>> >
>> > This patch (as1501) adds the two missing arguments, and adjusts the
>> > only two places in the kernel where the function is called. ÂA
>> > follow-on patch will add a third caller that does utilize the new
>> > capabilities.
>> >
>> > The two conversion routines are still annoyingly inconsistent in the
>> > way they handle invalid byte combinations. ÂBut that's a subject for a
>> > different patch.
>
>> > Index: usb-3.2/fs/nls/nls_base.c
>> > ===================================================================
>> > --- usb-3.2.orig/fs/nls/nls_base.c
>> > +++ usb-3.2/fs/nls/nls_base.c
>> > @@ -114,34 +114,57 @@ int utf32_to_utf8(unicode_t u, u8 *s, in
>> > Â}
>> > ÂEXPORT_SYMBOL(utf32_to_utf8);
>> >
>> > -int utf8s_to_utf16s(const u8 *s, int len, wchar_t *pwcs)
>> > +static inline void put_utf16(wchar_t *s, unsigned c, enum utf16_endian endian)
>> > +{
>> > + Â Â Â switch (endian) {
>> > + Â Â Â default:
>> > + Â Â Â Â Â Â Â *s = (wchar_t) c;
>> > + Â Â Â Â Â Â Â break;
>> > + Â Â Â case UTF16_LITTLE_ENDIAN:
>> > + Â Â Â Â Â Â Â *s = __cpu_to_le16(c);
>> > + Â Â Â Â Â Â Â break;
>> > + Â Â Â case UTF16_BIG_ENDIAN:
>> > + Â Â Â Â Â Â Â *s = __cpu_to_be16(c);
>> > + Â Â Â Â Â Â Â break;
>> > + Â Â Â }
>> > +}
>> > +
>> > +int utf8s_to_utf16s(const u8 *s, int len, enum utf16_endian endian,
>> > + Â Â Â Â Â Â Â wchar_t *pwcs, int maxlen)
>> > Â{
>> > Â Â Â Âu16 *op;
>> > Â Â Â Âint size;
>> > Â Â Â Âunicode_t u;
>> >
>> > Â Â Â Âop = pwcs;
>> > - Â Â Â while (*s && len > 0) {
>> > + Â Â Â while (len > 0 && maxlen > 0 && *s) {
>> > Â Â Â Â Â Â Â Âif (*s & 0x80) {
>> > Â Â Â Â Â Â Â Â Â Â Â Âsize = utf8_to_utf32(s, len, &u);
>> > Â Â Â Â Â Â Â Â Â Â Â Âif (size < 0)
>> > Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Âreturn -EINVAL;
>> > + Â Â Â Â Â Â Â Â Â Â Â s += size;
>> > + Â Â Â Â Â Â Â Â Â Â Â len -= size;
>> Why did you move this code to here ?
>
> Mainly in order to keep the counter updates near the place where the
> character is read. ÂAlso, in an earlier version of the patch, I used a
> "continue" instead of the "break" statement three lines below. ÂFor
> that to work, the updates to s and len had to be moved up here.
>
>> > Â Â Â Â Â Â Â Â Â Â Â Âif (u >= PLANE_SIZE) {
>> > + Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â if (maxlen < 2)
>> > + Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â break;
>> > Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Âu -= PLANE_SIZE;
>> > - Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â *op++ = (wchar_t) (SURROGATE_PAIR |
>> > - Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â ((u >> 10) & SURROGATE_BITS));
>> > - Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â *op++ = (wchar_t) (SURROGATE_PAIR |
>> > + Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â put_utf16(op++, SURROGATE_PAIR |
>> > + Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â ((u >> 10) & SURROGATE_BITS),
>> > + Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â endian);
>> > + Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â put_utf16(op++, SURROGATE_PAIR |
>> > Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â ÂSURROGATE_LOW |
>> > - Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â (u & SURROGATE_BITS));
>> > + Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â (u & SURROGATE_BITS),
>> > + Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â endian);
>> > + Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â maxlen -= 2;
>>
>> Why did you use contants value(-2) instead of maxlen -= size; value ?
>
> "maxlen -= size" would be completely wrong, because size is the length
> of the utf8 input and maxlen is the number of 16-bit slots remaining
> in the output buffer. ÂA surrogate pair uses two 16-bit values,
> therefore maxlen has to be decreased by 2.
If so, len should also be minus -2 constant value like maxlen ?
and why does this code(if (maxlen < 2)) is needed ? If len is smaller than 2 ?

>> > Â Â Â Â Â Â Â Â Â Â Â Â} else {
>> > - Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â *op++ = (wchar_t) u;
>> > + Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â put_utf16(op++, u, endian);
>> > + Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â maxlen--;
>> > Â Â Â Â Â Â Â Â Â Â Â Â}
>> > - Â Â Â Â Â Â Â Â Â Â Â s += size;
>> > - Â Â Â Â Â Â Â Â Â Â Â len -= size;
>> > Â Â Â Â Â Â Â Â} else {
>> > - Â Â Â Â Â Â Â Â Â Â Â *op++ = *s++;
>> > + Â Â Â Â Â Â Â Â Â Â Â put_utf16(op++, *s++, endian);
>> > Â Â Â Â Â Â Â Â Â Â Â Âlen--;
>> > + Â Â Â Â Â Â Â Â Â Â Â maxlen--;
>> > Â Â Â Â Â Â Â Â}
>> > Â Â Â Â}
>> > Â Â Â Âreturn op - pwcs;
>> > Index: usb-3.2/fs/fat/namei_vfat.c
>> > ===================================================================
>> > --- usb-3.2.orig/fs/fat/namei_vfat.c
>> > +++ usb-3.2/fs/fat/namei_vfat.c
>> > @@ -512,7 +512,8 @@ xlate_to_uni(const unsigned char *name,
>> > Â Â Â Âint charlen;
>> >
>> > Â Â Â Âif (utf8) {
>> > - Â Â Â Â Â Â Â *outlen = utf8s_to_utf16s(name, len, (wchar_t *)outname);
>> > + Â Â Â Â Â Â Â *outlen = utf8s_to_utf16s(name, len, UTF16_HOST_ENDIAN,
>> > + Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â (wchar_t *) outname, FAT_LFN_LEN + 2);
>> Is there the reason why you plus 2 to FAT_LFN_LEN ?
>
> So that the "(*outlen > FAT_LFN_LEN)" test below will work correctly.
> If the maximum length were set to FAT_LFN_LEN then the test would
> always fail. ÂIf the maximum length were set to FAT_LFN_LEN + 1 then
> the test would fail when the next character to be stored was a
> surrogate pair.
Although we are using maxlen, I don't know why do we check case that
outlen is bigger than FAT_LFN_LEN.
>
>> > Â Â Â Â Â Â Â Âif (*outlen < 0)
>> > Â Â Â Â Â Â Â Â Â Â Â Âreturn *outlen;
>> > Â Â Â Â Â Â Â Âelse if (*outlen > FAT_LFN_LEN)
>> Â Â Â Â Â Â Â Â Â Â Â Â Â return -ENAMETOOLONG;
>> "else if (*outlen > FAT_LFN_LEN)" code Âis needed ? Is there the case
>> that *outlen is over FAT_LFN_LEN in your patch ?
>
> I have no idea. ÂThat test was already there, I didn't add or change it.
>
>> Thanks.
>
> Alan Stern
>
>
¢éì®&Þ~º&¶¬–+-±éÝ¥Šw®žË±Êâmébžìdz¹Þ)í…æèw*jg¬±¨¶‰šŽŠÝj/êäz¹ÞŠà2ŠÞ¨è­Ú&¢)ß«a¶Úþø®G«éh®æj:+v‰¨Šwè†Ù>Wš±êÞiÛaxPjØm¶Ÿÿà -»+ƒùdš_