RE: RFC - kernel test result specification (KTAP)

From: Tim.Bird
Date: Mon Aug 30 2021 - 18:04:44 EST




> -----Original Message-----
> From: Kees Cook <keescook@xxxxxxxxxxxx>
>
> On Mon, Aug 30, 2021 at 05:48:07PM +0000, Tim.Bird@xxxxxxxx wrote:
> > From: Kees Cook <keescook@xxxxxxxxxxxx>
> > > > Test case result lines
> > > > ----------------------
> > > >
> > > > Test case result lines must have the format:
> > > >
> > > > <result> <number> [-] [<description>] [<directive>] [<diagnostic data>]
> > >
> > > "[<diagnostic data>]" is not defined below. I think what you mean is
> > > "directive details".
> >
> > IMHO it's useful to keep this called 'diagnostic data'.
> >
> > The directive is optional, and having this be diagnostic data means you can
> > put explanatory text, or result details, that are independent of the description,
> > on the same line as the result. This is convenient in some cases.
> >
> > I think diagnostic data should always be prefixed by a hash sign.
> > Or, in the case where diagnostic data appears on the result line,
> > the hash serves as the delimiter. This means that the description
> > text must not include a hash character.
> >
> > For example:
> > ok 5 - check return code # rcode=0
> > or
> > not ok 5 - check return code # rcode=17
>
> The directive is currently used to answer the question "why?" as in,
> "why is this test not ok?" ("a timeout occurred") or "why is this test
> ok?" ("a SKIP")
>
> The "supporting evidence" for the "why?" answer are specific details
> about that: "the exit was was NNN" or "the timeout was YYY seconds".
>
> Test diagnostics are all the lines preceding that (currently prefixed
> with a "#").
>
> > > Yes, though the optional "- " is strictly part of the optional
> > > description.
> >
> > It's mildly annoying that "-" is optional. It's trivial to deal with in the parser
> > to just ignore it if it's present. But it has no semantic meaning whatsoever.
> > IMHO it would be nice to either mandate it or remove it, for consistency's sake.
> > This could be based solely on the consensus for whether it added or detracted
> > from human readability, since parsers don't care.
>
> I have no strong opinion on the "-". I was surprised to encounter it
> as it's not part of the TAP 13 spec. I would prefer to drop it, if I had
> to choose.
>
> > ERROR usually means that there was some kind of failure
> > in the testing process that makes the testcase result indeterminate.
> > (like loss of connection to a server, or crash of the program executing
> > the testcase). This can be due to a bug in the test or the test system.
> > I'm, not sure if ERROR is used this way in kselftest. It is, IMHO, a useful
> > distinction to make. TIMEOUT is a special case of ERROR, IMHO.
>
> Well, I think TIMEOUT is a "why not ok?" answer. "not ok" seems to be
> the same as ERROR, but I guess it's really a catch-all, but must come
> with the "evidence" portion:

'not ok' should not mean the same thing as ERROR. 'not ok' means the
testcase was run successfully, and the result is that the test determined
that the testcase failed. ERROR means that the testcase was not
run successfully, so no result can be assumed.

But ERROR is confusing. Maybe there's a better word for the concept
that is less likely to be conflated with the result.

>
> not ok 1 description goes here # TIMEOUT 45 seconds with no results
> not ok 1 description goes here # ERROR lost connection to server
> not ok 1 description goes here # ERROR non-zero exit code: NNN
>
> I might go so far as to say the directive is _required_ for the "not ok"
> state.
As I said above, I don't think they are the same. I would reserve
ERROR for the case where a test error occurred and the test case
result could not be obtained.

If we go to the trouble of making our own spec, maybe it would
be better to support a third result value: 'unknown'. This would
break all the TAP parsers, but it's much clearer, IMHO.

'ok 1 description' = test case executed successfully, and the testcase condition passed
'not ok 1 description' = test case executed successfully, and the testcase condition failed
'unknown 1 description # SKIP # reason' = test case does not apply for some reason
'unknown 1 description # ERROR # reason' = test case could not be executed, for some reason

In the case of SKIP and ERROR, the result should be ignored as it is either inapplicable
or wasn't obtained.

I remember we had a discussion about whether SKIP should have an "ok" or "not ok"
result. IMHO it doesn't matter, because the result has no meaning when a SKIP directive
is used. (Although it's useful for humans to see why something was skipped.)

>
> > As an aside, I've never liked XFAIL or XPASS. Semantically, I believe they
> > are quite confusing, and depend on the level of testcase execution. XPASS is also
> > redundant. Isn't the default case always XPASS?
>
> Nothing currently uses XPASS, and we should remove it.
>
> XFAIL capture the condition of the test logic in that a test for
> something arm64-specific on a non-arm64 machine is _not_ a "pass". It
> will fail. But it's an expected failure. And, as a directive, must come
> with the evidence bit:
>
> not ok 1 PAN catches bad accessg # XFAIL PAN CPU feature is arm64 only
>
> As a test author, I want to see this as distinct from failure and
> success.
I agree that this case is different from a normal failure.

In the example you provide of a feature that applies to only specific
architectures, I would use a SKIP instead. It's not that the testcase failed.
It's that it does not even make sense for the device under test.

Whether something is 1) a testcase failure or 2) a testcase that does not
apply is a matter of perspective. Is it the case that the feature should exist
on all architectures, and it's just not been written yet? I could call this a
'not ok' or an XFAIL. However, if it's a feature that is never intended to be
written for arm32 or i386, I would report that as a SKIP.

But all of this differs from what I originally described, which is labs using
XFAIL to annotate bugs that they knew about but didn't intend to fix.

>
> > In the case of XFAIL, does it mean that an operation that was expected to fail
> > all the time, failed, and so the test passed? Or does it mean that this is a known
> > bug that a particular developer or site has decided to ignore (as in "we know this
> > is a bug, but we expected it to fail, so just ignore it for now). The latter, I really
> > don't think should be encoded in the tests themselves, but should be put
> > into a layer above the test and parser, that is applied according to developer
> > or testlab policy.
>
> I agree that "just ignore it for now" shouldn't be there.
>
> Other arguments are that xfail tests shouldn't be run at all, but I
> don't think that's right because it makes it very hard to do large-scale
> test result comparisons because random stuff is missing, depending on
> various configs, archs, etc. It blocks a 1-to-1 comparison, and begs
> questions like "why is the test here in one case and not in another?"
I completely agree with this. It really messes up tables of results when
the list of testcases varies depending on some board
or configuration attribute. It's very hard for users to parse and compare
results in that case.

> where as an XFAIL will encode the reason it is an XFAIL in its output,
> providing a self-documenting test result.


>
> > > I still think there should be a distinction between "diagnostic lines"
> > > and "unknown lines". For example, if kselftest is running on a console,
> > > the dmesg output may be intermixed, and possibly temporally offset. Such
> > > lines may not be useful, and may not be strictly correlated, where-as
> > > the output from kselftest is at least self-consistent.
> >
> > I think this would be a very good distinction to make. You might have
> > unknown lines that (inadvertently?) start with a hash, but trying to
> > make sure that the test itself always outputs diagnostic data with a preceding
> > hash gives a reader potentially more information about where the message
> > came from, and could be useful.
>
> Right, so lines with a hash shouldn't be machine-parsed. (And this is a
> mistake I made in the current kselftest output for the LAVA parser.) The
> good news for the kernel, is that nothing else in dmesg leads with a
> "#" currently. :)

Oh. That's actually nice to know. Interesting.

>
> > > No; diagnostic lines must have no meaning: they are for humans and nothing
> > > else. If other details are needed for machines, we should explicitly
> > > create new format lines. I made a mistake when I used a diagnostic line
> > > for embedding the test names. :( There is a need for parsers to discover
> > > the name of a given test, though, so we do likely need something for this.
> >
> > I agree it is handy to have the test name for sub-tests. However, I don't
> > see any semantic difference whether we define sub-test declarations via
> > diagnostic lines with special syntax or by creating a new syntax separate
> > from diagnostic lines. Either way, we have a new syntax for parsers to
> > understand.
>
> Pragmatically, I agree, which is what lead me to my mistake. However, in
> reconsidering it, I realize this leads to a slippery slope of just
> dumping stuff into diagnostic lines and pretending nothing will read it.

Good point. Whether it starts with a hash or not, TAP 13 parsers should
ignore anything that is not a result line (starts with 'ok' or 'not ok').
So I guess we are free to make a new syntax without upsetting any existing
parsers.

>
> I suspect what we need is an optional "starting test" line, like:
>
> test N DESC
> # diag...
> # diag...
> # diag...
> ok N
>
> The nesting looks like:
>
> TAP version 13
> 1..3
> test 1 seccomp
> # TAP version 13
> # 1..27
> # test 1 prctl
> # ok 1
> # test 2 strict
> # ok 2
> # ...
> # test 27 user_notif
> # # eek missing CONFIG_....
> # not ok 27 user_notif # ERROR can't use ioctl
> not ok 1 seccomp # ERROR non-zero exit code 1
> test 2 netfilter
> ok 2 netfilter
> test 3 bpf
> # exciting things happen
> ok 3 bpf
>
> This is what we have now except basically just replacing "# $name" with
> "test $num $name"
>
Actually, I like this idea. Including the number is a nice touch.
I think that will help to distinguish such a line from other 'unknown line' output.
I don't know if I'd leave the description off of the result line, though.
A lot of parsers are looking for that.

> > >
> > > > Diagnostic lines of the format "# Subtest:
> > > > <test_name>" indicate the start of a test with subtests. Also,
> > > > diagnostic lines of the format "# <test_name>: <description>" refer to
> > > > a specific test and tend to occur before the test result line of that
> > > > test but are optional.
> > >
> > > I don't think the above should be included in the spec -- diag lines
> > > should have no parseable meaning.
> > >
> > > >
> > > > Bail out line
> > > > -------------
> > > >
> > > > A bail out line can occur anywhere in the KTAP output and will
> > > > indicate that a test has crashed. The format of a bail out line is
> > > > "Bail out! [<description>]", where the description can give
> > > > information on why the bail out occurred and can be any string.
> > >
> > > I'm not a fan of the Bail out line. It's not a problem, exactly, but I
> > > find it redundant. If we want an "end of test" line, let's make one.
> > > "Bail out" is a special case of exit. If we want to handle test exit,
> > > let's define it. E.g. make kselftest's test summary lines not
> > > diagnostic lines:
> > >
> > > # FAILED: 85 / 87 tests passed.
> > > # Totals: pass:83 fail:2 xfail:0 xpass:0 skip:2 error:0
> >
> > How consistent is it for selftests to produce test summary lines?
>
> Any of the tools built with kselftest.h will spit it out. (Which is to
> say several of the "subtests".) The bulk of kselftest output is from the
> runner, which doesn't produce this summary.
>
> > > Please no. There is no reason to force a nested test to suddenly have
> > > to know about its test execution depth/environment. A subtest is just a
> > > nested TAP result. That it is nested is only meaningful to the parser, not
> > > the executing test, and it must have the same output, nested or not. (e.g.
> > > running the test alone or running the test under a full kselftest run,
> > > the only difference is the indentation level.)
> > I agree that the sub-test should not know about this line. I thought
> > the "parent" test wrote the line.
>
> The suggestion was that the subtest should _not_ emit the "TAP" line,
> which is what I strongly disagree with: it should be a self-contained
> TAP-emitting test.

OK - agreed. Yes. The sub-test should not know whether they are running
standalone or nested (with the exception of some line prefix opaquely conveyed
to them)

>
> > > Again, I see only downsides to this. Nesting for the spec is simple
> > > indentation-based recursion. Let's just keep what we already have:
> > >
> > > TAP version 13
> > > 1..1
> > > # TAP version 13
> > > # 1..2
> > > # ok 1 - test_1
> > > # ok 2 - test_2
> > > ok 1 - test_suite
> >
> > I prefer spaces, but whatever. Is the prefix on the 'TAP version 13' line guaranteed
> > to be the prefix on all lines for the (sub-)test?
>
> That's the current implementation, yes.
>
> As for tabs vs spaces, I think it's fine to swap "# " for " ". We just
> need to patch LAVA to deal with it.
>
> > Just as a side note, in some Fuego tests, it was very useful to include an identifier
> > in thethe prefix nested tests. The output looked like this:
> >
> > TAP version 13
> > 1..2
> > [batch_id 4] TAP version 13
> > [batch_id 4] 1..2
> > [batch_id 4] ok 1 - cyclictest with 1000 cycles
> > [batch_id 4] # problem setting CLOCK_REALTIME
> > [batch_id 4] not ok 2 - cyclictest with CLOCK_REALTIME
> > not ok 1 - check realtime
> > [batch_id 4] TAP version 13
> > [batch_id 4] 1..1
> > [batch_id 4] ok 1 - IOZone read/write 4k blocks
> > ok 2 - check I/O performance
> >
> > Can I propose that the prefix not be fixed by the spec, but that the spec indicates that
> > whatever the prefix is on the TAP version line, that prefix must be used with the output for
> > all lines from the test (with the exception of unknown lines)?
>
> Oh, interesting. This would also allow parallel (unique) test execution
> to be parsable. That sounds workable. (Again, this needs LAVA patching
> again...)
>
> > > > Major differences between TAP 14 and KTAP specification
> > > > -------------------------------------------------------
> > > >
> > > > Note that the major differences between TAP 14 and KTAP specification:
> > > > - yaml and json are not allowed in diagnostic messages
> > >
> > > Agreed -- these are overkill (and very difficult to implement as they
> > > _follow_ the test result line: anything generating them has already
> > > finished running).
> >
> > I thought that some tests put the diagnostic data before the result line.
> >
> > I don't care whether it is before or after (and indeed Fuego's parser
> > can handle either case). But it can't handle it when some testcases
> > put diagnostic data before the result line, and some testcases put
> > it after, in the same test. There's no delimiter for the separation
> > between the two different testcases diagnostic data.
>
> Right. The current implementation puts it always before. TAP 13 (14?) has
> the YAML _after_ the results line, but that's totally unworkable for
> kselftest, which is designed to be first human readable and second machine
> readable. Waiting to see diagnostics from a slow test is terrible --
> we can't collect it and spit it out later. Also, a parser would have
> no idea about when a test is finished if diagnostics follow the last
> test result of a plan.
>
> Making it "always before" is fine by me, and the motivation for making
> the "summary" lines be non-diagnostics because they technically followed
> the last test result.
>
> > Could we possibly have a rule about diagnostic lines positioning relative
> > to the testcase result line? Maybe a rule like this: For a single test,
> > diagnostic lines should either always preceded the testcase they
> > are related to, or follow the testcase they are related to.
>
> For the kernel, I would rather avoid anything after because that just
> confuses things.
>
> > > Yes. (Though "exit=" is a mistake in runner.sh -- this should probably
> > > be reported without the '#')
> >
> > Please No! (Unless I'm misinterpreting that line).
> >
> > If you remove the '#', that makes the 'exit=127' part of the test description.
> > The test description should be invariant between success and failure cases.
> > The exit=127 (if I'm reading that line right) is part of data explaining or
> > describing the failure. It would be 'exit=0' if the test succeeded.
> >
> > In order to correlate test cases (ie show results over time), the description
> > should not change from one invocation of the test to the next. At least Fuego
> > uses the test description this way. That is, the testcase description is used as
> > a testcase identifier, and if it changes from run to run things get confusing.
>
> Agreed, I think it should be part of an ERROR directive:
>
> not ok 14 awesome test # ERROR exit=127

As long as the line above means that testcase '14 awesome test' could not be
run, and exit=127 helps diagnose that, then I agree. If 'exit=127' means that
testcase '14 awesome test' was run and exit=127 was a failing result, I would
say that the result line should just be:
not ok 14 awesome test # exit=127

-- Tim Bird