Re: [RFC v2 18/38] docs: sphinx/kernel_abi: use AbiParser directly

From: Mauro Carvalho Chehab
Date: Sun Feb 16 2025 - 07:54:42 EST


Em Mon, 10 Feb 2025 17:03:54 +0100
Mauro Carvalho Chehab <mchehab+huawei@xxxxxxxxxx> escreveu:

> Em Mon, 10 Feb 2025 07:40:02 -0700
> Jonathan Corbet <corbet@xxxxxxx> escreveu:
>
> > Mauro Carvalho Chehab <mchehab+huawei@xxxxxxxxxx> writes:
> >
> > > I took a look on Markus work: it was licensed under GPL 3.0 and it was
> > > written in 2016. There were several changes on kerneldoc since them,
> > > including the addition of a regex that it is not compatible with
> > > Python re[1]:
> > >
> > > $members =~ s/\bSTRUCT_GROUP(\(((?:(?>[^)(]+)|(?1))*)\))[^;]*;/$2/gos;
> > >
> > > This one use:
> > >
> > > - recursive patterns: ?1
> > > - atomic grouping (?>...)
> > >
> > > Also, it is hard to map what he does with the existing script. I'm
> > > opting to write a new script from scratch.
> >
> > That's fine, I just wanted to be sure you'd had a chance to look at
> > it...
> >
> > > Another option would be to re-implement such regexes without using
> > > such advanced patterns.
> >
> > Seems like a preferred option if that can be done. Banging one's head
> > against all those regexes is often the hardest part of dealing with that
> > script; anything that makes it simpler is welcome.
>
> Agreed. This one, in special, is very hard for me to understand, as I
> never used recursive patterns or atomic grouping. The net result of
> the struct_group*() handling is that it removes some parameters when
> generating the function prototype. This is done using a complex logic
> on two steps:
>
> # unwrap struct_group():
> # - first eat non-declaration parameters and rewrite for final match
> # - then remove macro, outer parens, and trailing semicolon
> $members =~ s/\bstruct_group\s*\(([^,]*,)/STRUCT_GROUP(/gos;
> $members =~ s/\bstruct_group_attr\s*\(([^,]*,){2}/STRUCT_GROUP(/gos;
> $members =~ s/\bstruct_group_tagged\s*\(([^,]*),([^,]*),/struct $1 $2; STRUCT_GROUP(/gos;
> $members =~ s/\b__struct_group\s*\(([^,]*,){3}/STRUCT_GROUP(/gos;
> $members =~ s/\bSTRUCT_GROUP(\(((?:(?>[^)(]+)|(?1))*)\))[^;]*;/$2/gos;
>
>
> The first step basically eliminates some members of the function. At the
> places I checked, the second step was just removing parenthesis from the
> macro (and the STRUCT_GROUP name).
>
> I suspect that the same result could be done with a much simpler expression
> like:
>
> $members =~ s/\bSTRUCT_GROUP\((.*)\)[^;]*;/$2/gos;
>
> But maybe there are some corner cases that would cause such simpler
> regex to fail.

A simpler regex didn't work.

Instead of using Python regex module and keep such complex regular
expressions, I opted to implement it on a completely different way.
See patch below.

Basically, I implemented a new class (NestedMatch) just to
handle patterns like:

FOO(.*)

converting them into:

.*

I expect that the new logic there would work properly for any
number of [], {} and () paired delimiters.

I ended writing such class from scratch, based on a suggestion from
stackoverflow.

Using such logic, replacing STRUCT_GROUP(.*) by .* is as simple
as:

sub_nested_prefixes = [
(re.compile(r'\bSTRUCT_GROUP'), r'\1'),
]

nested = NestedMatch()

for search, sub in sub_nested_prefixes:
members = nested.sub(search, sub, members)

This should probably help cleaning up other similar patterns.
Let's see. Anyway, I expect that this would make the C parsing
part easier to maintain.

For the first version, I'll use the new way only when I notice
discrepancies between Perl and Python versions.

Thanks,
Mauro

---

[PATCH] scripts/kernel-doc.py: properly handle struct_group macros

Handing nested parenthesis with regular expressions is not an
easy task. It is even harder with Python's re module, as it
has a limited subset of regular expressions, missing more
advanced features.

We might use instead Python regex module, but still the
regular expressions are very hard to understand. So, instead,
add a logic to properly match delimiters.

Signed-off-by: Mauro Carvalho Chehab <mchehab+huawei@xxxxxxxxxx>

diff --git a/scripts/kernel-doc.py b/scripts/kernel-doc.py
index 9b10fd05b6af..df824692c4e7 100755
--- a/scripts/kernel-doc.py
+++ b/scripts/kernel-doc.py
@@ -92,6 +92,138 @@ class Re:
def group(self, num):
return self.last_match.group(num)

+class NestedMatch:
+ """
+ Finding nested delimiters is hard with regular expressions. It is
+ even harder on Python with its normal re module, as there are several
+ advanced regular expressions that are missing.
+
+ This is the case of this pattern:
+
+ '\bSTRUCT_GROUP(\(((?:(?>[^)(]+)|(?1))*)\))[^;]*;'
+
+ which is used to properly match open/close parenthesis of the
+ string search STRUCT_GROUP(),
+
+ Add a class that counts pairs of delimiters, using it to match and
+ replace nested expressions.
+
+ The original approach was suggested by:
+ https://stackoverflow.com/questions/5454322/python-how-to-match-nested-parentheses-with-regex
+
+ Although I re-implemented it to make it more generic and match 3 types
+ of delimiters. The logic checks if delimiters are paired. If not, it
+ will ignore the search string.
+ """
+
+ DELIMITER_PAIRS = {
+ '{': '}',
+ '(': ')',
+ '[': ']',
+ }
+
+ RE_DELIM = re.compile(r'[\{\}\[\]\(\)]')
+
+ def _search(self, regex, line):
+ """
+ Finds paired blocks.
+
+ If a given regex is not empty, the logic will first seek for its
+ position, and then handles the next delimiters afterwards
+
+ The suggestion of using finditer to match pairs came from:
+ https://stackoverflow.com/questions/5454322/python-how-to-match-nested-parentheses-with-regex
+ but I ended using a different implementation to align all three types
+ of delimiters and seek for an initial regular expression.
+
+ The algorithm seeks for open/close paired delimiters and place them
+ into a stack, yielding a start/stop position of each match when the
+ stack is zeroed.
+
+ The algorithm shoud work fine for properly paired lines, but will
+ silently ignore end delimiters that preceeds an start delimiter.
+ This should be OK for kernel-doc parser, as unaligned delimiters
+ would cause compilation errors. So, we don't need to rise exceptions
+ to cover such issues.
+ """
+
+ stack = []
+
+ for match_re in regex.finditer(line):
+ offset = match_re.end()
+
+ for match in self.RE_DELIM.finditer(line[offset:]):
+ pos = match.start() + offset
+
+ d = line[pos]
+
+ if d in self.DELIMITER_PAIRS.keys():
+ end = self.DELIMITER_PAIRS[d]
+
+ stack.append(end)
+ continue
+
+ # Does the end delimiter match what it is expected?
+ if stack and d == stack[-1]:
+ stack.pop()
+
+ if not stack:
+ yield offset, pos + 1
+
+ def search(self, regex, line):
+ """
+ This is similar to re.search:
+
+ It matches a regex that it is followed by a delimiter,
+ returning occurrences only if all delimiters are paired.
+ """
+
+ for prev_pos, pos in self._search(regex, line):
+
+ yield line[prev_pos:pos]
+
+ def sub(self, regex, sub, line, count=0):
+ """
+ This is similar to re.sub:
+
+ It matches a regex that it is followed by a delimiter,
+ replacing occurrences only if all delimiters are paired.
+
+ if r'\1' is used, it works just like re: it places there the
+ matched paired data with the delimiter stripped.
+
+ If count is different than zero, it will replace at most count
+ items.
+ """
+ out = ""
+
+ cur_pos = 0
+ n = 0
+
+ for start, end in self._search(regex, line):
+ out += line[cur_pos:start]
+
+ # Value, ignoring start/end delimiters
+ value = line[start + 1:end - 1]
+
+ # replaces \1 at the sub string, if \1 is used there
+ new_sub = sub
+ new_sub = new_sub.replace(r'\1', value)
+
+ out += new_sub
+
+ cur_pos = end
+ n += 1
+
+ if count and count >= n:
+ break
+
+ # Append the remaining string
+ l = len(line)
+ out += line[cur_pos:l]
+
+ return out
+
#
# Regular expressions used to parse kernel-doc markups at KernelDoc class.
#
@@ -667,19 +799,40 @@ class KernelDoc:
# Unwrap struct_group() based on this definition:
# __struct_group(TAG, NAME, ATTRS, MEMBERS...)
# which has variants like: struct_group(NAME, MEMBERS...)
+ #
+ # Parsing them are done on two steps:
+ # 1. drop arguments that aren't struct names;
+ # 2. remove STRUCT_GROUP() ancillary macro.
+ #
+ # NOTE: the original logic which replaces STRUCT_GROUP(.*) by .*
+ # is incompatible with Python re, as it uses:
+ #
+ # - a recursive pattern: (?1)
+ # - an atomic grouping: (?>...)
+ #
+ # This is the original expression:
+ # '\bSTRUCT_GROUP(\(((?:(?>[^)(]+)|(?1))*)\))[^;]*;'
+ #
+ # I tried a simpler version: but it didn't work either:
+ # \bSTRUCT_GROUP\(([^\)]+)\)[^;]*;
+ #
+ # As it doesn't properly match the end parenthesis.
+ #
+ # So, a better solution was crafted: there's now a NestedMatch
+ # class that ensures that delimiters after a search are properly
+ # matched. So, the implementation to drop STRUCT_GROUP() will be
+ # handled in separate.

(Re(r'\bstruct_group\s*\(([^,]*,)', re.S), r'STRUCT_GROUP('),
(Re(r'\bstruct_group_attr\s*\(([^,]*,){2}', re.S), r'STRUCT_GROUP('),
(Re(r'\bstruct_group_tagged\s*\(([^,]*),([^,]*),', re.S), r'struct \1 \2; STRUCT_GROUP('),
(Re(r'\b__struct_group\s*\(([^,]*,){3}', re.S), r'STRUCT_GROUP('),

- # This is incompatible with Python re, as it uses:
- # recursive patterns ((?1)) and atomic grouping ((?>...)):
- # '\bSTRUCT_GROUP(\(((?:(?>[^)(]+)|(?1))*)\))[^;]*;'
- # Let's see if this works instead:
- (Re(r'\bSTRUCT_GROUP\(([^\)]+)\)[^;]*;', re.S), r'\1'),
-
# Replace macros
+ #
+ # TODO: it is better to also move those to the NestedMatch logic,
+ # to endure that parenthesis will be properly matched.
+
(Re(r'__ETHTOOL_DECLARE_LINK_MODE_MASK\s*\(([^\)]+)\)', re.S), r'DECLARE_BITMAP(\1, __ETHTOOL_LINK_MODE_MASK_NBITS)'),
(Re(r'DECLARE_PHY_INTERFACE_MASK\s*\(([^\)]+)\)', re.S), r'DECLARE_BITMAP(\1, PHY_INTERFACE_MODE_MAX)'),
(Re(r'DECLARE_BITMAP\s*\(' + args_pattern + r',\s*' + args_pattern + r'\)', re.S), r'unsigned long \1[BITS_TO_LONGS(\2)]'),
@@ -691,9 +844,18 @@ class KernelDoc:
(Re(r'DEFINE_DMA_UNMAP_LEN\s*\(' + args_pattern + r'\)', re.S), r'__u32 \1'),
]

+ sub_nested_prefixes = [
+ (re.compile(r'\bSTRUCT_GROUP'), r'\1'),
+ ]
+
for search, sub in sub_prefixes:
members = search.sub(sub, members)

+ nested = NestedMatch()
+
+ for search, sub in sub_nested_prefixes:
+ members = nested.sub(search, sub, members)
+
# Keeps the original declaration as-is
declaration = members