Re: [PATCH 00/38] docs: several improvements to kernel-doc

From: Mauro Carvalho Chehab

Date: Tue Mar 03 2026 - 09:55:24 EST

On Mon, 23 Feb 2026 15:47:00 +0200
Jani Nikula <jani.nikula@xxxxxxxxxxxxxxx> wrote:

> On Wed, 18 Feb 2026, Mauro Carvalho Chehab <mchehab+huawei@xxxxxxxxxx> wrote:
> > As anyone that worked before with kernel-doc are aware, using regex to
> > handle C input is not great. Instead, we need something closer to how
> > C statements and declarations are handled.
> >
> > Yet, to avoid breaking docs, I avoided touching the regex-based algorithms
> > inside it with one exception: struct_group logic was using very complex
> > regexes that are incompatible with Python internal "re" module.
> >
> > So, I came up with a different approach: NestedMatch. The logic inside
> > it is meant to properly handle brackets, square brackets and parenthesis,
> > which is closer to what C lexical parser does. On that time, I added
> > a TODO about the need to extend that.
>
> There's always the question, if you're putting a lot of effort into
> making kernel-doc closer to an actual C parser, why not put all that
> effort into using and adapting to, you know, an actual C parser?

Playing with this idea, it is not that hard to write an actual C
parser - or at least a tokenizer. There is already an example of it
at:

https://docs.python.org/3/library/re.html

I did a quick implementation, and it seems to be able to do its job:

$ ./tokenizer.py ./include/net/netlink.h
1: 0 COMMENT '/* SPDX-License-Identifier: GPL-2.0 */'
2: 0 CPP '#ifndef'
2: 8 ID '__NET_NETLINK_H'
3: 0 CPP '#define'
3: 8 ID '__NET_NETLINK_H'
5: 0 CPP '#include'
5: 9 OP '<'
5: 10 ID 'linux'
5: 15 OP '/'
5: 16 ID 'types'
5: 21 PUNC '.'
5: 22 ID 'h'
5: 23 OP '>'
6: 0 CPP '#include'
6: 9 OP '<'
6: 10 ID 'linux'
6: 15 OP '/'
6: 16 ID 'netlink'
6: 23 PUNC '.'
6: 24 ID 'h'
6: 25 OP '>'
7: 0 CPP '#include'
7: 9 OP '<'
7: 10 ID 'linux'
7: 15 OP '/'
7: 16 ID 'jiffies'
7: 23 PUNC '.'
7: 24 ID 'h'
7: 25 OP '>'
8: 0 CPP '#include'
8: 9 OP '<'
8: 10 ID 'linux'
8: 15 OP '/'
8: 16 ID 'in6'
...
12: 1 COMMENT '/**\n * Standard attribute types to specify validation policy\n */'
13: 0 ENUM 'enum'
13: 5 PUNC '{'
14: 1 ID 'NLA_UNSPEC'
14: 11 PUNC ','
15: 1 ID 'NLA_U8'
15: 7 PUNC ','
16: 1 ID 'NLA_U16'
16: 8 PUNC ','
17: 1 ID 'NLA_U32'
17: 8 PUNC ','
18: 1 ID 'NLA_U64'
18: 8 PUNC ','
19: 1 ID 'NLA_STRING'
19: 11 PUNC ','
20: 1 ID 'NLA_FLAG'
...
41: 0 STRUCT 'struct'
41: 7 ID 'netlink_range_validation'
41: 32 PUNC '{'
42: 1 ID 'u64'
42: 5 ID 'min'
42: 8 PUNC ','
42: 10 ID 'max'
42: 13 PUNC ';'
43: 0 PUNC '}'
43: 1 PUNC ';'
45: 0 STRUCT 'struct'
45: 7 ID 'netlink_range_validation_signed'
45: 39 PUNC '{'
46: 1 ID 's64'
46: 5 ID 'min'
46: 8 PUNC ','
46: 10 ID 'max'
46: 13 PUNC ';'
47: 0 PUNC '}'
47: 1 PUNC ';'
49: 0 ENUM 'enum'
49: 5 ID 'nla_policy_validation'
49: 27 PUNC '{'
50: 1 ID 'NLA_VALIDATE_NONE'
50: 18 PUNC ','
51: 1 ID 'NLA_VALIDATE_RANGE'
51: 19 PUNC ','
52: 1 ID 'NLA_VALIDATE_RANGE_WARN_TOO_LONG'
52: 33 PUNC ','
53: 1 ID 'NLA_VALIDATE_MIN'
53: 17 PUNC ','
54: 1 ID 'NLA_VALIDATE_MAX'
54: 17 PUNC ','
55: 1 ID 'NLA_VALIDATE_MASK'
55: 18 PUNC ','
56: 1 ID 'NLA_VALIDATE_RANGE_PTR'
56: 23 PUNC ','
57: 1 ID 'NLA_VALIDATE_FUNCTION'
57: 22 PUNC ','
58: 0 PUNC '}'
58: 1 PUNC ';'

It sounds doable to use it, and, at least on this example, it
properly picked the IDs.

On the other hand, using it would require lots of changes at
kernel-doc. So, I guess I'll add a tokenizer to kernel-doc, but
we should likely start using it gradually.

Maybe starting with NestedSearch and with public/private
comment handling (which is currently half-broken).

As a reference, the above was generated with the code below,
which was based on the Python re documentation.

Comments?

---

One side note: right now, we're not using typing at kernel-doc,
nor really following a proper coding style.

I wanted to use it during the conversion, and place consts in
uppercase, as this is currently the best practices, but doing
it while converting from Perl were very annoying. So, I opted
to make things simpler. Now that we have it coded, perhaps it
is time to define a coding style and apply it to kernel-doc.

--
Thanks,
Mauro

#!/usr/bin/env python3

import sys
import re

class Token():
def __init__(self, type, value, line, column):
self.type = type
self.value = value
self.line = line
self.column = column

class CTokenizer():
C_KEYWORDS = {
"struct", "union", "enum",
}

TOKEN_LIST = [
("COMMENT", r"//[^\n]*|/\*[\s\S]*?\*/"),

("STRING", r'"(?:\\.|[^"\\])*"'),
("CHAR", r"'(?:\\.|[^'\\])'"),

("NUMBER", r"0[xX][0-9a-fA-F]+[uUlL]*|0[0-7]+[uUlL]*|"
r"[0-9]+(\.[0-9]*)?([eE][+-]?[0-9]+)?[fFlL]*"),

("ID", r"[A-Za-z_][A-Za-z0-9_]*"),

("OP", r"\+\+|\-\-|\->|==|\!=|<=|>=|&&|\|\||<<|>>|\+=|\-=|\*=|/=|%="
r"|&=|\|=|\^=|=|\+|\-|\*|/|%|<|>|&|\||\^|~|!|\?|\:"),

("PUNC", r"[;,\.\[\]\{\}]"),

("CPP", r"#\s*(define|include|ifdef|ifndef|if|else|elif|endif|undef|pragma)"),

("HASH", r"#"),

("NEWLINE", r"\n"),

("SKIP", r"[\s]+"),

("MISMATCH",r"."),
]

def __init__(self):
re_tokens = []

for name, pattern in self.TOKEN_LIST:
re_tokens.append(f"(?P<{name}>{pattern})")

self.re_scanner = re.compile("|".join(re_tokens),
re.MULTILINE | re.DOTALL)

def tokenize(self, code):
# Handle continuation lines
code = re.sub(r"\\\n", "", code)

line_num = 1
line_start = 0

for match in self.re_scanner.finditer(code):
kind = match.lastgroup
value = match.group()
column = match.start() - line_start

if kind == "NEWLINE":
line_start = match.end()
line_num += 1
continue

if kind in {"SKIP"}:
continue

if kind == "MISMATCH":
raise RuntimeError(f"Unexpected character {value!r} on line {line_num}")

if kind == "ID" and value in self.C_KEYWORDS:
kind = value.upper()

# For all other tokens we keep the raw string value
yield Token(kind, value, line_num, column)

if __name__ == "__main__":
if len(sys.argv) != 2:
print(f"Usage: python {sys.argv[0]} <fname>")
sys.exit(1)

fname = sys.argv[1]

try:
with open(fname, 'r', encoding='utf-8') as file:
sample = file.read()
except FileNotFoundError:
print(f"Error: The file '{fname}' was not found.")
sys.exit(1)
except Exception as e:
print(f"An error occurred while reading the file: {str(e)}")
sys.exit(1)

print(f"Tokens from {fname}:")

for tok in CTokenizer().tokenize(sample):
print(f"{tok.line:3d}:{tok.column:3d} {tok.type:12} {tok.value!r}")