Re: [PATCH 09/17] Move unicode to ASCII conversion to sharedfunction.

From: Adam Borowski
Date: Thu Sep 19 2013 - 19:02:51 EST


On Wed, Sep 18, 2013 at 09:48:44PM -0700, Roy Franz wrote:
> On Wed, Sep 18, 2013 at 8:44 PM, Adam Borowski <kilobyte@xxxxxxxxxx> wrote:
> > [UCS2 truncation]
>
> I stuck to re-arranging the code that was there, as I don't know enough
> about character encodings to propose changes.

I on the other hand don't know the kernel (lurking because of my first
patch), but I'm on a crusade against mangled Unicode (so far in the
userland). Can't let such a blatant error slip through on my watch :)

> Also, this code is running as part of the kernel decompressor, rather than
> the kernel itself, so it doesn't have access to any kernel facilities, and
> it also needs to be position independent.

Ok, so it can't reuse common libraries. No problem, a simplified, sanitized
and optimized copy of utf16s_to_utf8s() can be done in quite less code than
the original.

> It's running in a quite limited environment - the decompressor has
> its own copy of strstr(), and other string functions.

I'd need nothing but a way to alloc the new string. And I see this is
already done (efi_{low,high_alloc()).

> I checked the UEFI specification, and it states that all 16 bit strings
> are UCS-2, unless otherwise noted.

... which means it will either get upgraded to UTF-16 in a subsequent
version, or some Unicode strings get mangled. I'd ignore this bit and
implement full UTF-16 from the start: every legal UCS-2 string can be
decoded as UTF-16 so it's a strict superset.

> The load options that the command line is provided through a void pointer
> specified as: [snip]

Either a null pointer or a 16-bit string, that sounds clear enough.

I see not a word about endianness (does anything do EFI on big endian?),
but "same as host" seems to be a reasonable assumption.

> Would it be acceptable to fix the naming/comments, and convert values
> above 126 to '?' in the current patchset, and address a more thorough fix
> in another patch set? The ARM and ARM64 EFI stub patchsets that are
> mostly complete depend on this one, so getting this merged soon would be
> helpful.

I don't want to hinder your work, so what about putting in your version
as-is and fixing it later?

> > There's just one problem: which encoding to use, but
> > these days, most distributions have either dropped non-UTF8 or hardly pay
> > lip service, so we could get away with hard-coding UTF-8: those few who
> > use ancient charsets can stick to ASCII.

Not being able to use regular kernel facilities makes supporting ancient
charsets a lost cause. I'm so weeping about them... not.

> I would certainly appreciate your help improving this

Are we on the same page so far? If so, I can make a patch atop yours.

--
ááááááááááááááááááááá
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/