Re: [Intel-wired-lan] [PATCH 00/38] docs: several improvements to kernel-doc

From: Mauro Carvalho Chehab

Date: Fri Mar 13 2026 - 06:49:46 EST

On Wed, 04 Mar 2026 12:07:45 +0200
Jani Nikula <jani.nikula@xxxxxxxxxxxxxxx> wrote:

> On Mon, 23 Feb 2026, Jonathan Corbet <corbet@xxxxxxx> wrote:
> > Jani Nikula <jani.nikula@xxxxxxxxxxxxxxx> writes:
> >
> >> There's always the question, if you're putting a lot of effort into
> >> making kernel-doc closer to an actual C parser, why not put all that
> >> effort into using and adapting to, you know, an actual C parser?
> >
> > Not speaking to the current effort but ... in the past, when I have
> > contemplated this (using, say, tree-sitter), the real problem is that
> > those parsers simply strip out the comments. Kerneldoc without comments
> > ... doesn't work very well. If there were a parser without those
> > problems, and which could be made to do the right thing with all of our
> > weird macro usage, it would certainly be worth considering.
>
> I think e.g. libclang and its Python bindings can be made to work. The
> main problems with that are passing proper compiler options (because
> it'll need to include stuff to know about types etc. because it is a
> proper parser), preprocessing everything is going to take time, you need
> to invest a bunch into it to know how slow exactly compared to the
> current thing and whether it's prohitive, and it introduces an extra
> dependency.
>
> So yeah, there are definitely tradeoffs there. But it's not like this
> constant patching of kernel-doc is exactly burden free either.

On my tests with a simple C tokenizer:

https://lore.kernel.org/linux-doc/cover.1773326442.git.mchehab+huawei@xxxxxxxxxx/

The tokenizer is working fine and didn't make it much slow: it
increases the time to pass the entire Kernel tree from 37s to 47s
for man pages generation, but should not change much the time for
htmldocs, as right now only ~4 seconds is needed to read files
pointed by Documentation kernel-doc tags and parse them.

The code can still be cleaned up, as there are still some things
hardcoded on the various dump_* functions that could be better
implemented (*).

The advantage of the approach I'm using is that it allows to
gradually migrate to rely at the tokenized code, as it can be done
incrementally.

(*) for instance, __attribute__ and a couple of other macros are parsed
twice at dump_struct() logic, on different places.

> I don't
> know, is it just me, but I'd like to think as a profession we'd be past
> writing ad hoc C parsers by now.

Probably not, but I don't think we need a C parser, as kernel-doc
just needs to understand data types (enum, struct, typedef, union,
vars) and function/macro prototypes.

For such purpose, a tokenizer sounds enough.

Now, there is the code that it is now inside:
https://github.com/mchehab/linux/blob/tokenizer-v5/tools/lib/python/kdoc/xforms_lists.py

which contains a list of C/gcc/clang keywords that will
be ignored, like:

__attribute__
static
extern
inline

Together with a sanitized version of the kernel macros it needs
to handle or ignore:

DECLARE_BITMAP
DECLARE_HASHTABLE
__acquires
__init
__exit
struct_group
...

Once we finish cleaning up kdoc_parser.py to rely only
on it for prototype transformations, this will be the only file
that will require changes when more macros start affecting
kernel-doc.

As this is complex, and may require manual adjustments, it
is probably better to not try to auto-generate xforms list
in runtime. A better approach is, IMO, to have a C pre-processor
code to help periodically update it, like using a target like:

make kdoc-xforms

that would use either cpp or clang to generate a patch to
update xforms_list content after adding new macros that
affect docs generation.

--
Thanks,
Mauro