Re: [PATCH v3 1/2] kunit: add a KUnit test for SLUB debugging functionality

From: Marco Elver
Date: Thu Apr 01 2021 - 05:17:40 EST


[Note, if you'd like me to see future versions, please Cc me, otherwise
it's unlikely I see it in time. Also add kunit-dev@xxxxxxxxxxxxxxxx if
perhaps a KUnit dev should have another look, too.]

On Wed, Mar 31, 2021 at 10:51AM +0200, glittao@xxxxxxxxx wrote:
> From: Oliver Glitta <glittao@xxxxxxxxx>
>
> SLUB has resiliency_test() function which is hidden behind #ifdef
> SLUB_RESILIENCY_TEST that is not part of Kconfig, so nobody
> runs it. KUnit should be a proper replacement for it.
>
> Try changing byte in redzone after allocation and changing
> pointer to next free node, first byte, 50th byte and redzone
> byte. Check if validation finds errors.
>
> There are several differences from the original resiliency test:
> Tests create own caches with known state instead of corrupting
> shared kmalloc caches.
>
> The corruption of freepointer uses correct offset, the original
> resiliency test got broken with freepointer changes.
>
> Scratch changing random byte test, because it does not have
> meaning in this form where we need deterministic results.
>
> Add new option CONFIG_SLUB_KUNIT_TEST in Kconfig.
>
> Add a counter field "errors" to struct kmem_cache to count number
> of errors detected in cache.
>
> Silence bug report in SLUB test. Add SLAB_SILENT_ERRORS debug flag.
> Add SLAB_SILENT_ERRORS flag to SLAB_NEVER_MERGE, SLAB_DEBUG_FLAGS,
> SLAB_FLAGS_PERMITTED macros.
>
> Signed-off-by: Oliver Glitta <glittao@xxxxxxxxx>
> ---
> Changes since v2
>
> Use bit operation & instead of logical && as reported by kernel test
> robot and Dan Carpenter
>
> Changes since v1
>
> Conversion from kselftest to KUnit test suggested by Marco Elver.
> Error silencing.
> Error counting improvements.
>
> include/linux/slab.h | 2 +
> include/linux/slub_def.h | 2 +
> lib/Kconfig.debug | 5 ++
> lib/Makefile | 1 +
> lib/test_slub.c | 124 +++++++++++++++++++++++++++++++++++++++
> mm/slab.h | 7 ++-
> mm/slab_common.c | 2 +-
> mm/slub.c | 64 +++++++++++++-------
> 8 files changed, 184 insertions(+), 23 deletions(-)
> create mode 100644 lib/test_slub.c
>
> diff --git a/include/linux/slab.h b/include/linux/slab.h
> index 7ae604076767..ed1a5a64d028 100644
> --- a/include/linux/slab.h
> +++ b/include/linux/slab.h
> @@ -25,6 +25,8 @@
> */
> /* DEBUG: Perform (expensive) checks on alloc/free */
> #define SLAB_CONSISTENCY_CHECKS ((slab_flags_t __force)0x00000100U)
> +/* DEBUG: Silent bug reports */
> +#define SLAB_SILENT_ERRORS ((slab_flags_t __force)0x00000200U)

This flag wouldn't be necessary if you do the design using
kunit_resource (see below).

(But perhaps I missed a conversation that said that this flag is
generally useful, but if so, it should probably be in a separate patch
justifying why it is required beyond the test.)

> /* DEBUG: Red zone objs in a cache */
> #define SLAB_RED_ZONE ((slab_flags_t __force)0x00000400U)
> /* DEBUG: Poison objects */
> diff --git a/include/linux/slub_def.h b/include/linux/slub_def.h
> index dcde82a4434c..e4b51bb5bb83 100644
> --- a/include/linux/slub_def.h
> +++ b/include/linux/slub_def.h
> @@ -133,6 +133,8 @@ struct kmem_cache {
> unsigned int usersize; /* Usercopy region size */
>
> struct kmem_cache_node *node[MAX_NUMNODES];
> +
> + int errors; /* Number of errors in cache */

So, I think it's bad design to add a new field 'errors', just for the
test. This will increase kmem_cache size for all builds, which is
unnecessary.

Is there use to retrieve 'errors' elsewhere?

While you could guard this with #ifdef CONFIG_SLUB_DEBUG or so, there's
a better design option if this is just for the KUnit test's benefit: use
kunit_resource.

The way it'd work is that for each test (you can add a common init
function) you add a named resource, in this case just an 'int' I guess,
that slab would be able to retrieve if this test is being run.

In the test somewhere, you could add something like this:


static struct kunit_resource resource;
static int slab_errors;

..........

static int test_init(struct kunit *test)
{
slab_errors = 0;
kunit_add_named_resource(test, NULL, NULL, &resource,
"slab_errors", &slab_errors);
return 0;
}

...... tests now check slab_errors .....

and then in slub.c you'd have:

#if IS_ENABLED(CONFIG_KUNIT)
static bool slab_add_kunit_errors(void)
{
struct kunit_resource *resource;

if (likely(!current->kunit_test))
return false;
resource = kunit_find_named_resource(current->kunit_test, "slab_errors");
if (!resource)
return false;
(*(int *)resource->data)++;
kunit_put_resource(resource);
}
#else
static inline bool slab_add_kunit_errors(void) { return false; }
#endif

And anywhere you want to increase the error count, you'd call
slab_add_kunit_errors().

Another benefit of this approach is that if KUnit is disabled, there is
zero overhead and no additional code generated (vs. the current
approach).

> };
>
> #ifdef CONFIG_SLUB_CPU_PARTIAL
> diff --git a/lib/Kconfig.debug b/lib/Kconfig.debug
> index 2779c29d9981..e0dec830c269 100644
> --- a/lib/Kconfig.debug
> +++ b/lib/Kconfig.debug
> @@ -2371,6 +2371,11 @@ config BITS_TEST
>
> If unsure, say N.
>
> +config SLUB_KUNIT_TEST
> + tristate "KUnit Test for SLUB cache error detection" if !KUNIT_ALL_TESTS
> + depends on SLUB_DEBUG && KUNIT
> + default KUNIT_ALL_TESTS

Help text please.

> +
> config TEST_UDELAY
> tristate "udelay test driver"
> help
> diff --git a/lib/Makefile b/lib/Makefile
> index b5307d3eec1a..e1eb986c0e87 100644
> --- a/lib/Makefile
> +++ b/lib/Makefile
> @@ -352,5 +352,6 @@ obj-$(CONFIG_LIST_KUNIT_TEST) += list-test.o
> obj-$(CONFIG_LINEAR_RANGES_TEST) += test_linear_ranges.o
> obj-$(CONFIG_BITS_TEST) += test_bits.o
> obj-$(CONFIG_CMDLINE_KUNIT_TEST) += cmdline_kunit.o
> +obj-$(CONFIG_SLUB_KUNIT_TEST) += test_slub.o

Any reason to not name this slub_kunit.c? This is a new test, and it
could follow the naming convention of the other KUnit tests.

Some of it is also documented:
https://www.kernel.org/doc/html/latest/dev-tools/kunit/style.html

> obj-$(CONFIG_GENERIC_LIB_DEVMEM_IS_ALLOWED) += devmem_is_allowed.o
> diff --git a/lib/test_slub.c b/lib/test_slub.c
> new file mode 100644
> index 000000000000..4f8ea3c7d867
> --- /dev/null
> +++ b/lib/test_slub.c
> @@ -0,0 +1,124 @@
> +// SPDX-License-Identifier: GPL-2.0
> +#include <kunit/test.h>
> +#include <linux/mm.h>
> +#include <linux/slab.h>
> +#include <linux/module.h>
> +#include <linux/kernel.h>
> +#include "../mm/slab.h"
> +
> +
> +static void test_clobber_zone(struct kunit *test)
> +{
> + struct kmem_cache *s = kmem_cache_create("TestSlub_RZ_alloc", 64, 0,
> + SLAB_RED_ZONE | SLAB_SILENT_ERRORS, NULL);
> + u8 *p = kmem_cache_alloc(s, GFP_KERNEL);
> +
> + p[64] = 0x12;
> +
> + validate_slab_cache(s);
> + KUNIT_EXPECT_EQ(test, 1, s->errors);
> +
> + kmem_cache_free(s, p);
> + kmem_cache_destroy(s);
> +}
> +
> +static void test_next_pointer(struct kunit *test)
> +{
> + struct kmem_cache *s = kmem_cache_create("TestSlub_next_ptr_free", 64, 0,
> + SLAB_POISON | SLAB_SILENT_ERRORS, NULL);
> + u8 *p = kmem_cache_alloc(s, GFP_KERNEL);
> + unsigned long tmp;
> + unsigned long *ptr_addr;
> +
> + kmem_cache_free(s, p);
> +
> + ptr_addr = (unsigned long *)(p + s->offset);
> + tmp = *ptr_addr;
> + p[s->offset] = 0x12;
> +
> + /*
> + * Expecting two errors.
> + * One for the corrupted freechain and the other one for the wrong
> + * count of objects in use.
> + */
> + validate_slab_cache(s);
> + KUNIT_EXPECT_EQ(test, 2, s->errors);
> +
> + /*
> + * Try to repair corrupted freepointer.
> + * Still expecting one error for the wrong count of objects in use.
> + */
> + *ptr_addr = tmp;
> +
> + validate_slab_cache(s);
> + KUNIT_EXPECT_EQ(test, 1, s->errors);
> +
> + /*
> + * Previous validation repaired the count of objects in use.
> + * Now expecting no error.
> + */
> + validate_slab_cache(s);
> + KUNIT_EXPECT_EQ(test, 0, s->errors);
> +
> + kmem_cache_destroy(s);
> +}
> +
> +static void test_first_word(struct kunit *test)
> +{
> + struct kmem_cache *s = kmem_cache_create("TestSlub_1th_word_free", 64, 0,
> + SLAB_POISON | SLAB_SILENT_ERRORS, NULL);
> + u8 *p = kmem_cache_alloc(s, GFP_KERNEL);
> +
> + kmem_cache_free(s, p);
> + *p = 0x78;
> +
> + validate_slab_cache(s);
> + KUNIT_EXPECT_EQ(test, 1, s->errors);
> +
> + kmem_cache_destroy(s);
> +}
> +
> +static void test_clobber_50th_byte(struct kunit *test)
> +{
> + struct kmem_cache *s = kmem_cache_create("TestSlub_50th_word_free", 64, 0,
> + SLAB_POISON | SLAB_SILENT_ERRORS, NULL);
> + u8 *p = kmem_cache_alloc(s, GFP_KERNEL);
> +
> + kmem_cache_free(s, p);
> + p[50] = 0x9a;
> +
> + validate_slab_cache(s);
> + KUNIT_EXPECT_EQ(test, 1, s->errors);
> + kmem_cache_destroy(s);
> +}
> +
> +static void test_clobber_redzone_free(struct kunit *test)
> +{
> + struct kmem_cache *s = kmem_cache_create("TestSlub_RZ_free", 64, 0,
> + SLAB_RED_ZONE | SLAB_SILENT_ERRORS, NULL);
> + u8 *p = kmem_cache_alloc(s, GFP_KERNEL);
> +
> + kmem_cache_free(s, p);
> + p[64] = 0xab;
> +
> + validate_slab_cache(s);
> + KUNIT_EXPECT_EQ(test, 1, s->errors);
> + kmem_cache_destroy(s);
> +}
> +
> +static struct kunit_case test_cases[] = {
> + KUNIT_CASE(test_clobber_zone),
> + KUNIT_CASE(test_next_pointer),
> + KUNIT_CASE(test_first_word),
> + KUNIT_CASE(test_clobber_50th_byte),
> + KUNIT_CASE(test_clobber_redzone_free),
> + {}
> +};
> +
> +static struct kunit_suite test_suite = {
> + .name = "slub_test",
> + .test_cases = test_cases,

Here you could add a

.init = test_init,

or so, if you go with the kunit_resource design.

> +};
> +kunit_test_suite(test_suite);
> +
> +MODULE_LICENSE("GPL");
> diff --git a/mm/slab.h b/mm/slab.h
> index 076582f58f68..382507b6cab9 100644
> --- a/mm/slab.h
> +++ b/mm/slab.h
> @@ -134,7 +134,8 @@ static inline slab_flags_t kmem_cache_flags(unsigned int object_size,
> #define SLAB_DEBUG_FLAGS (SLAB_RED_ZONE | SLAB_POISON | SLAB_STORE_USER)
> #elif defined(CONFIG_SLUB_DEBUG)
> #define SLAB_DEBUG_FLAGS (SLAB_RED_ZONE | SLAB_POISON | SLAB_STORE_USER | \
> - SLAB_TRACE | SLAB_CONSISTENCY_CHECKS)
> + SLAB_TRACE | SLAB_CONSISTENCY_CHECKS | \
> + SLAB_SILENT_ERRORS)
> #else
> #define SLAB_DEBUG_FLAGS (0)
> #endif
> @@ -164,7 +165,8 @@ static inline slab_flags_t kmem_cache_flags(unsigned int object_size,
> SLAB_NOLEAKTRACE | \
> SLAB_RECLAIM_ACCOUNT | \
> SLAB_TEMPORARY | \
> - SLAB_ACCOUNT)
> + SLAB_ACCOUNT | \
> + SLAB_SILENT_ERRORS)
>
> bool __kmem_cache_empty(struct kmem_cache *);
> int __kmem_cache_shutdown(struct kmem_cache *);
> @@ -215,6 +217,7 @@ DECLARE_STATIC_KEY_TRUE(slub_debug_enabled);
> DECLARE_STATIC_KEY_FALSE(slub_debug_enabled);
> #endif
> extern void print_tracking(struct kmem_cache *s, void *object);
> +long validate_slab_cache(struct kmem_cache *s);
> #else
> static inline void print_tracking(struct kmem_cache *s, void *object)
> {
> diff --git a/mm/slab_common.c b/mm/slab_common.c
> index 88e833986332..239c9095e7ea 100644
> --- a/mm/slab_common.c
> +++ b/mm/slab_common.c
> @@ -55,7 +55,7 @@ static DECLARE_WORK(slab_caches_to_rcu_destroy_work,
> */
> #define SLAB_NEVER_MERGE (SLAB_RED_ZONE | SLAB_POISON | SLAB_STORE_USER | \
> SLAB_TRACE | SLAB_TYPESAFE_BY_RCU | SLAB_NOLEAKTRACE | \
> - SLAB_FAILSLAB | kasan_never_merge())
> + SLAB_FAILSLAB | SLAB_SILENT_ERRORS | kasan_never_merge())
>
> #define SLAB_MERGE_SAME (SLAB_RECLAIM_ACCOUNT | SLAB_CACHE_DMA | \
> SLAB_CACHE_DMA32 | SLAB_ACCOUNT)
> diff --git a/mm/slub.c b/mm/slub.c
> index 3021ce9bf1b3..7083e89432d0 100644
> --- a/mm/slub.c
> +++ b/mm/slub.c
> @@ -673,14 +673,16 @@ static void slab_bug(struct kmem_cache *s, char *fmt, ...)
>
> static void slab_fix(struct kmem_cache *s, char *fmt, ...)
> {
> - struct va_format vaf;
> - va_list args;
> -
> - va_start(args, fmt);
> - vaf.fmt = fmt;
> - vaf.va = &args;
> - pr_err("FIX %s: %pV\n", s->name, &vaf);
> - va_end(args);
> + if (!(s->flags & SLAB_SILENT_ERRORS)) {

This style of guarding causes lots of line diffs.

Instead, if the whole function is to be skipped, please prefer:

if (skip_things_if_true)
return;

instead of

if (!skip_things_if_true) {
...
}

here and everywhere else.

Specifically, for the kunit_resource design, this would simply become:

if (slab_add_kunit_errors())
return;

Meaning that whenever a KUnit test has a resource "slab_errors", no
messages are printed -- otherwise it'll always print a message.

> + struct va_format vaf;
> + va_list args;
> +
> + va_start(args, fmt);
> + vaf.fmt = fmt;
> + vaf.va = &args;
> + pr_err("FIX %s: %pV\n", s->name, &vaf);
> + va_end(args);
> + }
> }
>
> static bool freelist_corrupted(struct kmem_cache *s, struct page *page,
> @@ -739,8 +741,10 @@ static void print_trailer(struct kmem_cache *s, struct page *page, u8 *p)
> void object_err(struct kmem_cache *s, struct page *page,
> u8 *object, char *reason)
> {
> - slab_bug(s, "%s", reason);
> - print_trailer(s, page, object);
> + if (!(s->flags & SLAB_SILENT_ERRORS)) {
> + slab_bug(s, "%s", reason);
> + print_trailer(s, page, object);
> + }
> }
>
> static __printf(3, 4) void slab_err(struct kmem_cache *s, struct page *page,
> @@ -752,9 +756,11 @@ static __printf(3, 4) void slab_err(struct kmem_cache *s, struct page *page,
> va_start(args, fmt);
> vsnprintf(buf, sizeof(buf), fmt, args);
> va_end(args);
> - slab_bug(s, "%s", buf);
> - print_page_info(page);
> - dump_stack();
> + if (!(s->flags & SLAB_SILENT_ERRORS)) {
> + slab_bug(s, "%s", buf);
> + print_page_info(page);
> + dump_stack();
> + }
> }
>
> static void init_object(struct kmem_cache *s, void *object, u8 val)
> @@ -798,11 +804,13 @@ static int check_bytes_and_report(struct kmem_cache *s, struct page *page,
> while (end > fault && end[-1] == value)
> end--;
>
> - slab_bug(s, "%s overwritten", what);
> - pr_err("INFO: 0x%p-0x%p @offset=%tu. First byte 0x%x instead of 0x%x\n",
> + if (!(s->flags & SLAB_SILENT_ERRORS)) {
> + slab_bug(s, "%s overwritten", what);
> + pr_err("INFO: 0x%p-0x%p @offset=%tu. First byte 0x%x instead of 0x%x\n",
> fault, end - 1, fault - addr,
> fault[0], value);
> - print_trailer(s, page, object);
> + print_trailer(s, page, object);
> + }
>
> restore_bytes(s, what, value, fault, end);
> return 0;
> @@ -964,6 +972,7 @@ static int check_slab(struct kmem_cache *s, struct page *page)
>
> if (!PageSlab(page)) {
> slab_err(s, page, "Not a valid slab page");
> + s->errors += 1;

So to me it looks like errors is set whenever there's a slab_err() or
slab_fix() call. So why not move setting that an error occurred into
those functions?

> return 0;
> }
>
> @@ -971,11 +980,13 @@ static int check_slab(struct kmem_cache *s, struct page *page)
> if (page->objects > maxobj) {
> slab_err(s, page, "objects %u > max %u",
> page->objects, maxobj);
> + s->errors += 1;
> return 0;
> }
> if (page->inuse > page->objects) {
> slab_err(s, page, "inuse %u > max %u",
> page->inuse, page->objects);
> + s->errors += 1;
> return 0;
> }
> /* Slab_pad_check fixes things up after itself */
> @@ -1008,8 +1019,10 @@ static int on_freelist(struct kmem_cache *s, struct page *page, void *search)
> page->freelist = NULL;
> page->inuse = page->objects;
> slab_fix(s, "Freelist cleared");
> + s->errors += 1;
> return 0;
> }
> + s->errors += 1;
> break;
> }
> object = fp;
> @@ -1026,12 +1039,14 @@ static int on_freelist(struct kmem_cache *s, struct page *page, void *search)
> page->objects, max_objects);
> page->objects = max_objects;
> slab_fix(s, "Number of objects adjusted.");
> + s->errors += 1;
> }
> if (page->inuse != page->objects - nr) {
> slab_err(s, page, "Wrong object count. Counter is %d but counted were %d",
> page->inuse, page->objects - nr);
> page->inuse = page->objects - nr;
> slab_fix(s, "Object count adjusted.");
> + s->errors += 1;
> }
> return search == NULL;
> }
> @@ -4629,8 +4644,10 @@ static void validate_slab(struct kmem_cache *s, struct page *page)
> u8 val = test_bit(__obj_to_index(s, addr, p), map) ?
> SLUB_RED_INACTIVE : SLUB_RED_ACTIVE;
>
> - if (!check_object(s, page, p, val))
> + if (!check_object(s, page, p, val)) {
> + s->errors += 1;
> break;
> + }
> }
> put_map(map);
> unlock:
> @@ -4650,9 +4667,11 @@ static int validate_slab_node(struct kmem_cache *s,
> validate_slab(s, page);
> count++;
> }
> - if (count != n->nr_partial)
> + if (count != n->nr_partial) {
> pr_err("SLUB %s: %ld partial slabs counted but counter=%ld\n",
> s->name, count, n->nr_partial);
> + s->errors += 1;
> + }
>
> if (!(s->flags & SLAB_STORE_USER))
> goto out;
> @@ -4661,20 +4680,23 @@ static int validate_slab_node(struct kmem_cache *s,
> validate_slab(s, page);
> count++;
> }
> - if (count != atomic_long_read(&n->nr_slabs))
> + if (count != atomic_long_read(&n->nr_slabs)) {
> pr_err("SLUB: %s %ld slabs counted but counter=%ld\n",
> s->name, count, atomic_long_read(&n->nr_slabs));
> + s->errors += 1;

I think these here are a few cases where there's no slab_err() or
slab_fix() call, and open coding setting errors is still required...

> + }
>
> out:
> spin_unlock_irqrestore(&n->list_lock, flags);
> return count;
> }
>
> -static long validate_slab_cache(struct kmem_cache *s)
> +long validate_slab_cache(struct kmem_cache *s)
> {
> int node;
> unsigned long count = 0;
> struct kmem_cache_node *n;
> + s->errors = 0;

This would also go away if you use the kunit_resource design. I also
think it's better if the test fully controls 'slab_errors', and it isn't
reset by this function.

> flush_all(s);
> for_each_kmem_cache_node(s, node, n)
> @@ -4682,6 +4704,8 @@ static long validate_slab_cache(struct kmem_cache *s)
>
> return count;
> }
> +EXPORT_SYMBOL(validate_slab_cache);
> +
> /*
> * Generate lists of code addresses where slabcache objects are allocated
> * and freed.
> --
> 2.17.1