Re: Kernel docs: muddying the waters a bit

From: Markus Heiser
Date: Mon Apr 18 2016 - 05:50:37 EST


Hi Jonahtan,

Am 12.04.2016 um 17:46 schrieb Jonathan Corbet <corbet@xxxxxxx>:

> On Fri, 8 Apr 2016 17:12:27 +0200
> Markus Heiser <markus.heiser@xxxxxxxxxxx> wrote:
>
>> motivated by this MT, I implemented a toolchain to migrate the kernel’s
>> DocBook XML documentation to reST markup.
>>
>> It converts 99% of the docs well ... to gain an impression how
>> kernel-docs could benefit from, visit my sphkerneldoc project page
>> on github:
>>
>> http://return42.github.io/sphkerneldoc/
>
> So I've obviously been pretty quiet on this recently. Apologies...I've
> been dealing with an extended death-in-the-family experience, and there is
> still a fair amount of cleanup to be done.
>
> Looking quickly at this work, it seems similar to the results I got. But
> there's a lot of code there that came from somewhere?

>From me? ... except the kernel-doc script which is a fork from your

git://git.lwn.net/linux.git doc/sphinx


> I'd put together a
> fairly simple conversion using pandoc and a couple of short sed scripts;
> is there a reason for a more complex solution?

It depends. If you have a simple DocBook with less various markup, maybe not.
May you want to read my remarks about migration tools and especially pandoc:

* https://return42.github.io/sphkerneldoc/articles/dbtools.html#remarks-on-pandoc

A few more words about, what I have done:

I wrote a lib of XML filters which might be also usefully in other
migration projects (dbxml).

* https://github.com/return42/sphkerneldoc/blob/master/scripts/dbxml.py

It uses a xml-parser, pandoc, pandoc-filters and regular expressions. Because
I did not implemented a whole converter, I hacked around pandoc. Thats why
conversion is done in several steps:

1. copy xml file(s) to a cache space

2. substitude unsolved internal and external entities

3. filter all xml files

* run custom hooks on every node

* apply filters on every node and inject reST into the XML-tree where pandoc fails.
https://github.com/return42/sphkerneldoc/blob/master/scripts/dbxml.py#L515

4. convert intermediary XML result with pandoc to json (needed by pandoc filters)

5. apply pandoc-filter and clean up the injected reST markup from step3

6. convert filtered json to reST

7. fix the produce reST with regular expression

... the last step is similar to your sed scripts.


And I wrote a commandline Interface to use this lib (see func db2rst):

* https://github.com/return42/sphkerneldoc/blob/master/scripts/dbtools.py#L146

With this db2rst all kernel DB-XML books could be migrated, except the linux-tv
book, which has much more complexity. For this, there is a separated commandline
called media2rst

* https://github.com/return42/sphkerneldoc/blob/master/scripts/dbtools.py#L107

The media2rst needs several special handlings, which is implemented in
hooks (the dbxml interface method)

* https://github.com/return42/sphkerneldoc/blob/master/scripts/media.py

Summarize, why should one prefer this tools over pandoc + sed?

* Pandoc coverage is less on reading and writing, this is where
dbxml comes into play

- reading DocBook:
https://github.com/jgm/pandoc/blob/master/src/Text/Pandoc/Readers/DocBook.hs#L23

- writing reST has many bugs and leaks
(you fixed some of them with sed)

* Pandoc does not support external entities (linux-tv), covered by dbxml

* dbxml brings the ability to chunk one large XML book into small
reST chunks e.g. kernel-hacking book:

https://github.com/return42/sphkerneldoc/tree/master/doc/books/kernel-hacking

* dbxml lets you manipulate the XML source before you convert it to reST

this might helpfull e.g. if you have to convert single-column informal-tables
to lists or other things ... in short; dbxml and it's hooks are the key to hack
everything you need in a full automated DocBook-->reST migration workflow.


--Markus--

> Thanks for looking into this, anyway; I hope to be able to focus more on
> it shortly.
>
> jon
> --
> To unsubscribe from this list: send the line "unsubscribe linux-media" in
> the body of a message to majordomo@xxxxxxxxxxxxxxx
> More majordomo info at http://vger.kernel.org/majordomo-info.html