[PATCH v2 00/13] vt: implement proper Unicode handling

From: Nicolas Pitre
Date: Tue Apr 15 2025 - 15:22:49 EST


The Linux VT console has many problems with regards to proper Unicode
handling:

- All new double-width Unicode code points which have been introduced since
Unicode 5.0 are not recognized as such (we're at Unicode 16.0 now).

- Zero-width code points are not recognized at all. If you try to edit files
containing a lot of emojis, you will see the rendering issues. When there
are a lot of zero-width characters (like "variation selectors"), long
lines get wrapped, but any Unicode-aware editor thinks that the content
was rendered properly and its rendering logic starts to work in very bad
ways. Combine this with tmux or screen, and there is a huge mess going on
in the terminal.

- Also, text which uses combining diacritics has the same effect as text
with zero-width characters as programs expect the characters to take fewer
columns than what they actually do.

Some may argue that the Linux VT console is unmaintained and/or not used
much any longer and that one should consider a user space terminal
alternative instead. But every such alternative that is not less maintained
than the Linux VT console does require a full heavy graphical environment
and that is the exact antithesis of what the Linux console is meant to be.

Furthermore, there is a significant Linux console user base represented by
blind users (which I'm a member of) for whom the alternatives are way more
cumbersome to use reducing our productivity. So it has to stay and
be maintained to the best of our abilities.

That being said...

This patch series is about fixing all the above issues. This is accomplished
with some Python scripts leveraging Python's unicodedata module to generate
C code with lookup tables that is suitable for the kernel. In summary:

- The double-width code point table is updated to the latest Unicode version
and the table itself is optimized to reduce its size.

- A zero-width code point table is created and the console code is modified
to properly use it.

- A table with base character + combining mark pairs is created to convert
them into their precomposed equivalents when they're encountered.
By default the generated table contains most commonly used Latin, Greek,
and Cyrillic recomposition pairs only, but one can execute the provided
script with the --full argument to create a table that covers all
possibilities. Combining marks that are not listed in the table are simply
treated like zero-width code points and properly ignored.

- All those tables plus related lookup code require about 3500 additional
bytes of text which is not very significant these days. Yet, one
can still set CONFIG_CONSOLE_TRANSLATIONS=n to configure this all out
if need be.

Note: The generated C code makes scripts/checkpatch.pl complain about
"... exceeds 100 columns" because the inserted comments with code
point names, well, make some inlines exceed 100 columns. Please make
an exception for those files and disregard those warnings. When
checkpatch.pl is used on those files directly with -f then it doesn't
complain.

This series was tested on top of v6.15-rc2.

Changes from v1 (https://lkml.org/lkml/2025/4/9/1952):

- Moved much of the C functions out of the Python generator, leaving only
lookup tables to C code generation

- Cleaned up the Python code

- Unicode processing in vt.c moved to a function of its own

- Folded bug fixes into the series, fixed style, typos, etc.

Thanks to Jiri Slaby for the review.

diffstat:
drivers/tty/vt/Makefile | 3 +-
drivers/tty/vt/consolemap.c | 2 -
drivers/tty/vt/gen_ucs_recompose_table.py | 255 +++++++++++++
drivers/tty/vt/gen_ucs_width_table.py | 299 ++++++++++++++++
drivers/tty/vt/ucs.c | 156 ++++++++
drivers/tty/vt/ucs_recompose_table.h | 102 ++++++
drivers/tty/vt/ucs_width_table.h | 453 ++++++++++++++++++++++++
drivers/tty/vt/vt.c | 138 +++++---
include/linux/consolemap.h | 18 +
9 files changed, 1376 insertions(+), 50 deletions(-)