Re: [Automated-testing] Call for nommu LTP maintainer [was: Re: [PATCH 00/36] Remove UCLINUX from LTP]

From: Rob Landley
Date: Wed Jan 10 2024 - 14:17:27 EST


On 1/10/24 08:14, Petr Vorel wrote:
> There is MAP_PRIVATE_EXCEPT_UCLINUX constant to avoid using MAP_PRIVATE on
> uClinux, who knows if this is relevant on nommu?

MAP_PRIVATE creates a copy-on-write mapping, and doing copy-on-write requires an
MMU. (You mark it read only in the page table, take a fault when they try to
write, the fault handler allocates a new physical page, copies the old contents
to it, marks it writeable, and returns allowing the write to complete to the new
page.)

On NOMMU you can MAP_SHARED and MAP_ANON, but not MAP_PRIVATE.

Swap is implemented kind of similarly, except when you recycle pages you mark
them as neither readable nor writeable in the page table, schedule the page's
contents to be written to disk, suspend the process so the scheduler can go run
something else, and then when you get the I/O completion interrupt you mark the
page free so whatever else needed a page can use it. And then when the process
tries to access the page the fault handler reverses the process, allocating a
new physical page and load in the contents back in while the process is
suspended waiting for that to finish. Can't do that without an MMU either.

>> 3) what the desired roadmap going forward would be, to continue to support this code.
>
> All LTP tests are being rewritten to use new API since 2016 (new API was
> introduced in 20160510), thus we are loosing the support with old API going
> away. Sure, I can hold on this patchset and we continue removing the
> functionality tests manually. But sooner or later it's gone.

You can't fork() on nommu because copies of the mappings have different
addresses, meaning any pointers in the copied mappings would point into the OLD
mappings (belonging to the parent process), and fixing them up is 100%
equivalent to the "garbage collection in C" problem. (It's AI-complete. Of the
C3PO kind, not the "autocorrect with syntax checking" kind.) People get hung up
on the "it would be very inefficient to do that because no copy-on-write"
problem and miss the "the child couldn't FUNCTION because its pointer variables
all contain parent addresses" problem.

So instead vfork() creates a child with all the same memory mappings (a bit like
a thread) and freezes the parent process until that child discards those
mappings, either by calling exec() or _exit(). (At which point the parent gets
un-suspended.)

The child can do quite a lot of setup before calling exec, it already has its
own filehandle table for example, but it has to be really careful about MEMORY.
Anything it writes to global variables the parent will see, any changes to the
heap persist in the parent, and anything it writes to local variables the parent
MAY see. (Systems have historically differed about whether the vfork() child
gets a new stack like a thread would, or keeps using the parent's mapping since
the new stack would be quickly discarded anyway. If you call into a new setup()
function after vfork() it doesn't matter much either way, but do NOT return from
the function that called vfork(): either your new stack hasn't got anything to
return to or you'll corrupt the parent's stack by overwriting its return address
so when the parent exits from its current function it jumps to la-la land.)

The OTHER fun thing about nommu is you can't run conventional ELF binaries,
because everything is linked at fixed address. So you might be able to run ONE
instance of the program as your init task, assuming those addresses were
available even then, but as soon as you try to run a second one it's a conflict.

The quick and dirty work around is to make PIE binaries, which can relocate
everything into available space, which works but doesn't scale. The problem with
ELF PIE is that everything is linked contiguously from a single base pointer,
meaning your text, rodata, data, and bss segments are all one linear blob. So if
you run two instances of bash, you've loaded two copies of the test and the
rodoata. This fills up your memory fast.

AND PIE requires contiguous memory, which nommu is bad at providing because it
has no page tables to remap stuff. With an mmu it can coalesce scattered
physical pages into a virtually contiguous range, but without an mmu you can
have plenty of memory free but in tiny chunks, none big enough to satisfy an
allocation request.

So they invented FDPIC, which is ELF with FOUR base pointers. Each major section
(rodata, text, data, and bss) has its own base pointer, so you need to find
smaller chunks of memory to load them into (and thus it can work on a more
fragmented system), AND it means that two instances of the same program can
share the read-only sections (rodata and text) so you only need new copies of
the writeable segments (data and bss. And the heap. And the stack.)

(The earlier binflt format is an a.out variant with 4 base pointers. FDPIC is
the ELF version of the same idea. Since a.out went bye-bye binflt is obsolete,
but not everybody's moved off it yet because so many nommu people are still
using 2.6 or even earlier, and also using gcc 3.x or 2.x toolchains.)

Oh, the OTHER thing is none of this is deferred allocation, it's all up front.
On systems with mmu you can populate large empty virtual mappings that read as
zeroed but it's actually redundant copy-on-write instances of the zero page, and
when you write to them it takes a soft fault and the fault handler allocates the
page you dirtied when you dirty it. On nommu, if you want a 4 megabyte mapping
you have to find 4 contiguous megabyte and allocate it immediately, or else the
mmap() or malloc() returns failure. (On systems with mmu malloc() almost never
returns NULL, because you've got virtual address space coming out of your ears
and if you ACTUALLY run out of memory that's happens way later, the OOM killer
triggers long after malloc() returned success. But on a nommu system, malloc()
returns NULL all the time, even if you THINK you have enough memory, because
what's left is too fragmented to contain a free chunk THAT BIG...)

This impacts the stack. On MMU Linux, the default stack size is 8 megs but it's
seldom all used. On nommu linux, that would be RIDICULOUS because A) it would
always be allocated to its full size right up front, B) you'd need contiguous
memory for it. So instead you set the default stack size when building the
linker (you can also set it on the ld command line), and common sizes range from
8k to maybe 256k depending on what you expect to be running. Toybox tries not to
use more than 16k stack, although I usually test it with 32k on nommu. (There's
no guard page to tell you when you went off the edge, because no MMU so no page
faults, but you can check that the stack page at end-16k is still clean at exit
if you like. Some nommu hardware has range registers, but Linux has never
supported them that I'm aware of.)

There's not THAT much to learn about NOMMU. It could all be covered in an hour
presentation at a conference, I expect?

> One can check files which had special handling in the old API:
>
> $ git grep -l UCLINUX 20160126 -- testcases/ | wc -l
> 173
>
> What is supported now:
>
> $ git grep -l UCLINUX -- testcases/ |wc -l
> 55

UCLINUX is a long-dead distro. Linaro died in the dot-com crash and its founder
Jeff Dionne moved to Japan for his next gig and never came back. On the way out
he handed uclinux off to someone else, who didn't do a lot of work maintaining
it. Most of the actual support went "upstream" into various packages (linux and
busybox and gcc and so on) before the handoff, so you didn't NEED uclinux anymore.

The real nail in the coffin is the inheritor of uclinux never migrated it off
CVS, and then the disk holding the CVS archive crashed with no backup. He came
out with a couple more releases after that by monkey-patching the last release's
filesystem, but the writing was on the wall and it rolled to a stop.

I did a triage of its last release (from 2014) as part of my toybox roadmap:

https://landley.net/toybox/roadmap.html#uclinux

> => We have now removed nearly 2/3 of it (this means we're arguing about 1/3 of
> the tests which initially somehow supported nommu).

I'd like to get more tests supporting nommu. Possibly the approach here is just
get ANYTHING working with the new api, and then whack-a-mole more tests in as we go.

Other than lacking fork(), restricted mmap(), different executable packaging,
smaller stack size, having to actually test the return from malloc(), no page
faults if you follow a wild pointer, and the complete lack of swap space...

Unless I missed something, it's otherwise normal Linux? (People comfortable with
threads can still do all their thread tricks on nommu systems. And the Turtle
board I'm using is an SMP nommu system, they do exist. :)

Rob