On Wed, 2024-08-07 at 01:24 +0200, Thomas Gleixner wrote:
Cc+: Helge, parisc ML
We're chasing a weird failure which has been tracked down to the
placement of the division library functions (I assume they are
imported
from libgcc).
See the thread starting at:
https://lore.kernel.org/all/718b8afe-222f-4b3a-96d3-93af0e4ceff1@xxxxxxxxxxxx
On Tue, Aug 06 2024 at 21:25, Vlastimil Babka wrote:
On 8/6/24 19:33, Thomas Gleixner wrote:
So this change adds 16 bytes to __softirq() which moves the
division
functions up by 16 bytes. That's all it takes to make the stupid
go
away....
Heh I was actually wondering if the division is somhow messed up
because
maxobj = order_objects() and order_objects() does a division. Now I
suspect
it even more.
check_slab() calls into that muck, but I checked the disassembly of a
working and a broken kernel and the only difference there is the
displacement offset when the code calculates the call address, but
that's as expected a difference of 16 bytes.
Now it becomes interesting.
I added a unused function after __do_softirq() into the softirq text
section and filled it with ASM nonsense so that it occupies exactly
one
page. That moves $$divoI, which is what check_slab() calls, exactly
one
page forward:
-0000000041218c70 T $$divoI
+0000000041219c70 T $$divoI
Guess what happens? If falls on it's nose again.
Now with that ASM gunk I can steer the size conveniently. It works up
to:
0000000041219c50 T $$divoI
and fails for
0000000041219c60 T $$divoI
0000000041219c70 T $$divoI
and works again at
0000000041219c80 T $$divoI
So just on this, you seem to have proved that only exact multiples of
48 work. In terms of how PA-RISC caching works that's completely nuts
... however, there may be something else at work, like stack frame
alignment.
So I added the following:
+extern void testme(void);
+extern unsigned int testsize;
+
+unsigned int testsize = 192;
+
+void __init testme(void)
+{
+ pr_info("TESTME: %lu\n", PAGE_SIZE / testsize);
+}
called that _before_ mm_core_init() from init/main.c and adjusted my
ASM hack to make $$divoI be at:
0000000041219c70 T $$divoI
again and surprisingly the output is:
[ 0.000000] softirq: TESTME: 21
OK, why is that surprising? 4096/192 is 21 due to integer rounding.
Now I went back to the hppa64 gcc version 12.2.0 again and did the
same ASM gunk adjustment so that $$divoI ends up at the offset 0xc70
in the page and the same happens.
So it's not a compiler dependent problem.
But then I added a testme() call to the error path and get:
[ 0.000000] softirq: TESTME: 21
[ 0.000000]
=====================================================================
========
[ 0.000000] BUG kmem_cache_node (Not tainted): objects 21 > max 16
size 192 sorder 0
Now what's wrong?
Adding more debug:
[ 0.000000] BUG kmem_cache_node (Not tainted): objects 21 > max 16
size 192 sorder 0 21
where the last '21' is the output of the same call which made maxobj
go
south:
static int check_slab(struct kmem_cache *s, struct slab *slab)
{
int maxobj;
@@ -1386,8 +1388,10 @@ static int check_slab(struct kmem_cache
maxobj = order_objects(slab_order(slab), s->size);
if (slab->objects > maxobj) {
- slab_err(s, slab, "objects %u > max %u",
- slab->objects, maxobj);
+ testme();
+ slab_err(s, slab, "objects %u > max %u size %u sorder
%u %u",
+ slab->objects, maxobj, s->size,
slab_order(slab),
+ order_objects(slab_order(slab), s->size));
return 0;
}
if (slab->inuse > slab->objects) {
I don't know and I don't want to know TBH...
OK, so you're telling us we have a problem with slab_order on parisc
... that's folio_order, so it smells like a parisc bug with
folio_test_large? Unfortuntely I'm a bit pissed in an airport lounge
on my way to the UK, so I've lost access to my pa test rig and can't
test further for a while.