Re: [PATCH 06/53] docs: admin-guide: avoid using UTF-8 chars

From: Mauro Carvalho Chehab
Date: Wed May 12 2021 - 06:22:58 EST


Em Wed, 12 May 2021 10:25:35 +0100
David Woodhouse <dwmw2@xxxxxxxxxxxxx> escreveu:

> On Wed, 2021-05-12 at 10:44 +0200, Mauro Carvalho Chehab wrote:
> > The main point here is that a large amount of those UTF-8 characters
> > appeared as result of document conversion from DocBook/LaTeX/Markdown.
> >
> > As the conversion ended, I don't expect the need of re-doing a series
> > like that in the near future.
> >
> > There are even some cases where the UTF-8 were doing wrong things, like
> > using an EN DASH instead of an hyphen in order to pass a command line
> > parameter, and the addition of non-printable BOM characters.
> >
> > So, IMO, this is a necessarily cleanup after the conversion.
>
> That part — fixing characters that are *wrong*, such as converting a
> UTF-8 U+2014 EM DASH to a UTF-8 U+002D HYPHEN-MINUS, is reasonable
> enough.
>
> But you're not "avoiding using UTF-8 chars" there, as it says in the
> title of this patch. HYPHEN-MINUS encoded as 0x2D *is* UTF-8.

Yeah, you're right, as ASCII is a subset of UTF-8 - as ASCII is
also subset of other charsets as well[1].

[1] ASCII is a subset for all charsets mentioned at:
https://man7.org/linux/man-pages/man7/charsets.7.html

A more precise title would be something like:

Use ASCII instead of non-ASCII UTF-8 alternate symbols
or
Use ASCII subset instead of UTF-8 alternate symbols

See, the goal of this series is to address the cases where there are
multiple UTF-8 alternate symbols with the same meaning as the
original ASCII set. Most of them were introduced by tools like
DocBook/LaTeX/pandoc during document conversions[2], not by design,
but just because the UTF-8 non-ASCII symbols produce a nicer output
in html or pdf. In another words, it was a toolset decision to change
them, diverging from what the author originally typed.

[2] I suspect that a few of them could have been introduced as a result
of someone using a text editor like libreoffice (or equivalent),
that has a similar behavior.

With ReST, there's no need to use any those, as the building tools will
already do the such conversion when generating html/pdf output.

So, better to stick with ASCII subset on such cases, as it allows
to better use tools like grep and it makes easier to edit such files
on editors like vi, nano, emacs, etc.

Thanks,
Mauro