Shared test cluster for filesystem testing

From: Kent Overstreet
Date: Sun Jul 14 2024 - 19:35:45 EST


Those who know me have oft heard me complain about the state of testing
automation and infrastructure, in filesystem land and the wider kernel.
In short - it sucks.

For some years I've been working, off and on, on my own system off and
on, and I think I've got it to the point where I can start making it
available to the wider filesystem community, and I hope it will be of
some use to people.

Here's my philosophy and requirements:

- Tests should be done with results up in a dashboard _as quickly as
possible_

Nothing's worse than having to wait hours, overnight, or days for test
results - by which time you've context switched onto something else. I
want full test results in 10 minutes.

- Every commit gets tested, and the results are available in a git log
view.

Manual bisection is a timesuck, and every commit should be tested
anyways. I want to be able able to churn out code in nice clean simple
commits, push it all out to the CI, and when one of them is broken, be
able to see at a glance which one it is.

- Simple and extensible, and able to do any kernel testing that can be
done in a VM.

kdevops is right out - all the stateful ansible crap is not what I'm
after. Simple and declarative tests that specify how the kernel, qemu
etc. should be configured.

- Available to all developers and maintainers

Maintainers shouldn't be looking at patches that haven't been tested.
Everyone doing filesystem development needs access to this
system, on whatever branches they're working on.

IOW: big cluster of machines watching git branches and uploading results
to a dashboard, with sharding at subtest granularity so we can get
results back _quick_.

I've got 8 80 core arm machines for this so far. We _will_ need more
machines than this, and I'll need funding to pay for those machines, but
this is enough to get started.

A shared cluster of dedicated machines with full sharding means that us
individual developers can get results back _quick_. The CI tests each
branch, newest to oldest, and since we're not all going to be pushing at
the same time, or need the lockdep/kasan variants right away (those run
at a lower priority) - we can all get the results we need (most recent
commit, basic tests) pretty much immediately.

I've got fstests tests wrappers for bcachefs, btrfs, ext2, ext4, f2fs,
jfs, nfs, nilfs2 and xfs so far, with lockdep, kasan and ubsan variants
for all of those.

The tests the CI runs are easy to run locally, for reproducability -
ktest was first written for local, interactive use. I suggest you try
it, it's slick [0]:

Send me an email with your ssh pubkey and the username you want, and
I'll give you an account - this is how you'll configure your config file
that specifies which tests to run and which branches to test.

And please send me patches to ktest adding tests for more filesystems
and subsystems. This isn't intended to be filesystem specific - the goal
here is one single _quick_ dashboard for anything that can be tested in
a VM.

Results dashboard:
https://evilpiepirate.org/~testdashboard/ci

Results for Linus's tree:
https://evilpiepirate.org/~testdashboard/ci?branch=master

[0] Ktest: https://evilpiepirate.org/git/ktest.git/