Re: [RFC PATCH 0/2] docs: automarkup.py: Add literal markup of known constants

From: Mauro Carvalho Chehab
Date: Wed Jun 09 2021 - 03:45:44 EST


Hi Nícolas,

Em Tue, 8 Jun 2021 22:43:06 -0300
Nícolas F. R. A. Prado <n@xxxxxxxxxxxxx> escreveu:

> Hi,
>
> a while back Matthew suggested adding automatic markup of known constants, like
> NULL and error codes, as literals [1]. This is a great idea since those literals
> are very frequently used throughout the documentation and are fixed names, so we
> can easily match them without false positives and make the source files less
> cluttered.
>
> Patch 1 adds that feature to automarkup.py, while patch 2 removes the no longer
> needed markup from the xarray.rst file which was given as an example.
>
> Some things I'd like to discuss:
>
> * As a first draft I added the constants I found on xarray.rst plus all error
> codes from errno-base.h and errno.h. We don't need to start with everything,
> but we can certainly do better than this. What are common constants that
> should be added here? (Matthew mentioned ANSI C or POSIX, some reference to
> the constants in those would be great, even more if they are easily parsable)
>
> * The Literals list added is already a bit big with just the error codes, so
> perhaps we should move them to a separate plain text file that is read on
> startup?

If we're willing to keep a long list, they should be parsed from files,
instead of directly included. For things like errno-base.h/errno.h, the
better would be if automarkup would read them directly from the header
file, as otherwise there will be a maintainance burden.

Yet, a regex like "-E[A-Z\d]+\b" would likely get all cases, but this
will produce false positives, like EXAMPLE, EXAMPLES and some other
non-trivial exceptions.

With regards to NULL, I would just use a trivial regex like: "\bNULL\b"
(but see below my notes about using "\b").

> * There was also mention of automatically converting uppercase words to
> literals. I'm not so sure about that one so I left it out for now.
>
> The example given was XA_MARK_0, which is a #define in xarray.h. The way
> to go here is certainly to first use kernel-doc to get it included in the
> documentation, and then cross-reference to it. FWICT at the moment kernel-doc
> for defines should be done like for functions (but the parenthesis can be
> omitted, although they show up in the output):
>
> /**
> * XA_MARK_0 - Brief description.
> *
> * Description of the type.
> */
>
> At the current state, the cross-reference would then need to be done either
> through plain :c:macro:`XA_MARK_0`, or, by relying on automarkup, typedef
> XA_MARK_0 (which is not ideal since this isn't a typedef). Options for
> improvements would be to add 'macro' as a possible prefix (so eg. macro
> XA_MARK_0), or go all out and try to cross-reference on every uppercase word
> like suggested. We should strive for what is the most natural to write, as
> long as the regex isn't too broad.
>
> Since automarkup.py is opt-out rather than opt-in, I think we should be extra
> careful not to make the regexes too broad, even if in this case it relies on a
> C symbol being present.
>
> One other idea I had was that we could limit to all uppercase words as
> long as it has an _ on it, since I don't expect we would get false positives
> with that. The downside is that it might be very confusing to people why their
> MACRONAME isn't being automatically cross-referenced to...

What it can be done would be to first check/apply cross-references. Only if it
fails, then replace everything in uppercase to literals.

There is a drawback, tough: this would cause texts in uppercase. We used to
have lots of them before ReST conversion. I won't doubt that some files
would have kept a few of them. So, in this case, the regex would need to
require at least one _, e. g. something like:

\b[\-]?[A-Z_][A-Z\d_]*_[A-Z\d_]*\b

Again, tests are needed in order to check if the regex won't get anything
that shouldn't be caught. So, I would grep the source codes in order to
check if the regexes won't bring false positives, e. g.:

$ git grep -E "\b[\-]?[A-Z_][A-Z\d_]*_[A-Z\d_]*\b" Documentation/


Btw, the "\b" at the end won't actually work, due to things like:

GLOBAL_GEN_STORAGE{0:3}
GAUDI_ENGINE_ID_*

If you take a look at the scripts/get_abi.pl that I wrote, I faced
the same problem there with \b. So, I ended implementing my own set
of \b:

my $start = qr {(^|\s|\() }x;
my $bondary = qr { ([,.:;\)\s]|\z) }x;

Using the above replacements for \b, I would do something like this to
double-check if the regex won't be producing false-positives:

$ for i in $(find Documentation/ -name '*.rst'); do perl -ne 'print "$1 \e[0;31m$2\e[0;37m $3\n" if /(^|\s|\()([\-]?[A-Z_][A-Z\d_]*_[A-Z\d_]*)([,.:;\)\s]|\z)/' $i; done

Btw, on a quick look, this specific regex:

(^|\s|\()([\-]?[A-Z_][A-Z\d_]*_[A-Z\d_]*)([,.:;\)\s]|\z)

Seems to be producing just one false positive: a sequence of _, e. g.:

\b\_+\b

While we could make the regex more complex, I would just test if the second
match is '^\_+$', skipping it if found. On other words, a python-equivalent
to this:

$ for i in $(find Documentation/ -name '*.rst'); do perl -ne 'print "$2\n" if /(^|\s|\()([\-]?[A-Z_][A-Z\d_]*_[A-Z\d_]*)([,.:;\)\s]|\z)/' $i|grep -vE "^\_+$"; done|sort|uniq

-

That's said, it is worth to also mention the false negatives:

$ for i in $(find Documentation/ -name '*.rst'); do perl -ne 'print "$2\n" if /(^|\s|\()([\-]?[A-Z_][A-Z\d_]*_[A-Z\d_]*)([,.:;\)\s]|\z)/' $i|grep -vE "^\_+$"; done|sort|uniq >OK_list
$ for i in $(find Documentation/ -name '*.rst'); do perl -ne 'print "$2\n" if /(^|\s|\()([\-]?[A-Z_][A-Z\d_]*)([,.:;\)\s]|\z)/' $i|grep -vE " "; done|sort|uniq >FULL_list
$ grep -v -f OK_list FULL_list >excluded_list
$ wc -l OK_list FULL_list excluded_list
7398 OK_list
14487 FULL_list
7520 excluded_list
29405 total

There are a lot of constants that won't be parsed if we require at least
one '_'. Looking at the excluded_list, indeed there are things there
which should be excluded, like EXAMPLE, EXAMPLES, EXCEPTION, WITH, WITHIN,
WITHOUT..., but also there are a several matches that are constants,
like KERNELRELEASE, DISCONTIGMEM, SIOCGIFINDEX, SIGALRM, SETREGSET,
CDROMREADCOOKED, CDROMREADRAW, ...

I doubt that there would be an easy way to handle such cases, as
a file with a ~7000 constants would be a maintenance nightmare.

---

In summary, my suggestion is that we should stay out of having a big list
of constants. So, I would start with:

1. a regex to get FOO_BAR cases, like:

(^|\s|\()([\-]?[A-Z_][A-Z\d_]*_[A-Z\d_]*)([,.:;\)\s]|\z)

with a test to exclude this pattern:

^\_+$

2. a logic that will pick and use errno codes from the errno*.h
files;

3. an specific handler handler for the NULL special case.

This would get 7000+ different constants, which seems a very good
start. If needed, later we could get a few other relevant constants
by checking the most-used ones with something like:

for i in $(find Documentation/ -name '*.rst'); do perl -ne 'print "$2\n" if /(^|\s|\()([\-]?[A-Z_][A-Z\d_]*)([,.:;\)\s]|\z)/' $i|grep -v "_"; done|sort|uniq -c|sort -n

in order to grab the most common ones.

Regards,
Mauro