RE: [PATCH] udf: rework name conversions to fix multi-bytes characters support

From: Andrew Gabbasov
Date: Fri Dec 18 2015 - 06:26:56 EST


Hello Jan,

> -----Original Message-----
> From: Jan Kara [mailto:jack@xxxxxxx]
> Sent: Wednesday, December 16, 2015 2:26 PM
> To: Gabbasov, Andrew
> Cc: Jan Kara; linux-kernel@xxxxxxxxxxxxxxx
> Subject: Re: [PATCH] udf: rework name conversions to fix multi-bytes
characters support
>
> On Fri 11-12-15 05:44:32, Andrew Gabbasov wrote:
> > Current implementation has several issues in unicode.c, mostly related
> > to handling multi-bytes characters in file names:
>
> Thanks for looking into these problems! Did you find them by code
> inspection or do you actually have some problematic fs images?

I got an image from our customer, that causes the kernel crash when trying
to make 'ls' on some of the folders. During fixing that particular issue
I found the other related problems.

That image contains file names with Greek and similar characters (2- and
3-bytes
UTF-8 codes) and of full lengths of 255 bytes for File Identifier fields.

> > - loop ending conditions in udf_CS0toUTF8 and udf_CS0toNLS functions do
not
> > properly catch the end of output buffer in case of multi-bytes
characters,
> > allowing out-of-bounds writing and memory corruption;
>
> Can you provide concrete example please? Do you mean the problem that ustr
> can hold only UDF_NAME_LEN-2 characters and the code seems to assume it
can
> hold UDF_NAME_LEN characters?

It's hard to say what the code assumes ;-) but here are some more details:
udf_CS0toUTF8 and udf_CS0toNLS functions have the following loop ending
condition:
(i < ocu_len) && (utf_o->u_len <= (UDF_NAME_LEN - 3));
It seems to assume that the output length is no more than UDF_NAME_LEN-2
(which is correct) but leaves the space for only 1 more byte to output.
However, the character to output can take 2 or 3 (or even more in NLS case)
bytes,
and writing them to the output buffer will corrupt utf_o->u_len field (which
is located just after the buffer) and can continue to write to incorrect
index, etc.

> > - udf_UTF8toCS0 and udf_NLStoCS0 do not check the right boundary of
output
> > buffer at all, also allowing out-of-bounds writing and memory
corruption;
>
> Ah, I guess you mean the case when maxval == 0xffff and we can possibly
> encode one character of input into two characters of output, right?

Yes, you are right. If, for example, the name contains at least one
character
translated to Unicode value longer than 1 byte (causing maxval to be 0xffff)
and many (say, almost full UDF_NAME_LEN) single bytes characters, the output
buffer will be almost two times longer than the input buffer and than
allowed.

> > - udf_translate_to_linux does not take into account multi-bytes
characters
> > at all (although it is called after converting to UTF8 or NLS): maximal
> > length of extension is counted as 5 bytes, that may be incorrect with
> > multi-bytes characters; when inserting CRC and extension for long names
> > (near the end of the buffer), they are inserted at fixed place at the
end,
> > that can break into the middle of the multi-bytes character;
> >
> > - when being converted from CS0 to UTF8 (or NLS), the name can be
truncated
> > (even if the sizes in bytes of input and output buffers are the same),
> > but the following translating function does not know about it and does
not
> > insert CRC, as it is assumed by the specs.
> >
> > Because of the last item above, it looks like all the checks and
> > conversions (re-coding and possible CRC insertions) should be done
> > simultaneously in the single function. This means that the listed
> > issues can not be fixed independently and separately. So, the whole
> > conversion and translation support should be reworked.
> >
> > The proposed implementation below fixes the listed issues, and also has
> > some additional features:
> >
> > - it gets rid of "struct ustr", since it actually just makes an unneeded
> > extra copying of the buffer and does not have any other significant
> > advantage;
> >
> > - it unifies UTF8 and NLS conversions support, since there is no much
> > sense to separate these cases;
> >
> > - UDF_NAME_LEN constant adjusted to better reflect actual restrictions.
>
> So I like the changes you do but this has to be split in several patches
> because in current form the patch is very difficult to review. I suggest
> something like following steps:
>
> 1) get rid of ustr
> 2) change definition of UDF_NAME_LEN, introduce UDF_NAME_LEN_CS0
> 3) join functions for NLS & UTF8 conversion
> 4) fix boundary checks
> 5) merge udf_translate_to_linux() into conversion from CS0 (frankly, I'm
> not very happy with the complexity of the resulting function and would
love
> to split it up but I don't see any natural split).

Yes, I was thinking about it too.
I'll try to split the change into several patches (may be in different
order)
and submit V2 of the commit set.

As for the resulting conversion function, I also think it's too long and
complicated. For the sake of efficiency I tried to make all necessary
calculations and conversions within a single pass, which made
understandability
and simplicity suffer. I'll try to think more about it, but so far I also
don't see any way to simplify the function.

> Thanks!
> Honza
> --
> Jan Kara <jack@xxxxxxxx>
> SUSE Labs, CR

Thanks.

Best regards,
Andrew


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/