Re: Known and unfixed active data loss bug in MM + XFS with large folios since Dec 2021 (any kernel from 6.1 upwards)

From: Chris Mason
Date: Mon Sep 30 2024 - 20:56:58 EST


On 9/30/24 7:34 PM, Christian Theune wrote:
> Hi,
>
> we’ve been running a number of VMs since last week on 6.11. We’ve encountered one hung task situation multiple times now that seems to be resolving itself after a bit of time, though. I do not see spinning CPU during this time.
>
> The situation seems to be related to cgroups-based IO throttling / weighting so far:
>
> Here are three examples of similar tracebacks where jobs that do perform a certain amount of IO (either given a weight or given an explicit limit like this:
>
> IOWeight=10
> IOReadIOPSMax=/dev/vda 188
> IOWriteIOPSMax=/dev/vda 188
>
> Telemetry for the affected VM does not show that it actually reaches 188 IOPS (the load is mostly writing) but creates a kind of gaussian curve …
>
> The underlying storage and network was completely inconspicuous during the whole time.

Not disagreeing with Linus at all, but given that you've got IO
throttling too, we might really just be waiting. It's hard to tell
because the hung task timeouts only give you information about one process.

I've attached a minimal version of a script we use here to show all the
D state processes, it might help explain things. The only problem is
you have to actually ssh to the box and run it when you're stuck.

The idea is to print the stack trace of every D state process, and then
also print out how often each unique stack trace shows up. When we're
deadlocked on something, there are normally a bunch of the same stack
(say waiting on writeback) and then one jerk sitting around in a
different stack who is causing all the trouble.

(I made some quick changes to make this smaller, so apologies if you get
silly errors)

Example output:

sudo ./walker.py
15 rcu_tasks_trace_kthread D
[<0>] __wait_rcu_gp+0xab/0x120
[<0>] synchronize_rcu+0x46/0xd0
[<0>] rcu_tasks_wait_gp+0x86/0x2a0
[<0>] rcu_tasks_one_gp+0x300/0x430
[<0>] rcu_tasks_kthread+0x9a/0xb0
[<0>] kthread+0xad/0xe0
[<0>] ret_from_fork+0x1f/0x30

1440504 dd D
[<0>] folio_wait_bit_common+0x149/0x2d0
[<0>] filemap_read+0x7bd/0xd10
[<0>] blkdev_read_iter+0x5b/0x130
[<0>] __x64_sys_read+0x1ce/0x3f0
[<0>] do_syscall_64+0x3d/0x90
[<0>] entry_SYSCALL_64_after_hwframe+0x46/0xb0

-----
stack summary

1 hit:
[<0>] __wait_rcu_gp+0xab/0x120
[<0>] synchronize_rcu+0x46/0xd0
[<0>] rcu_tasks_wait_gp+0x86/0x2a0
[<0>] rcu_tasks_one_gp+0x300/0x430
[<0>] rcu_tasks_kthread+0x9a/0xb0
[<0>] kthread+0xad/0xe0
[<0>] ret_from_fork+0x1f/0x30

-----
[<0>] folio_wait_bit_common+0x149/0x2d0
[<0>] filemap_read+0x7bd/0xd10
[<0>] blkdev_read_iter+0x5b/0x130
[<0>] __x64_sys_read+0x1ce/0x3f0
[<0>] do_syscall_64+0x3d/0x90
[<0>] entry_SYSCALL_64_after_hwframe+0x46/0xb0#!/usr/bin/env python3
#
# this walks all the tasks on the system and prints out a stack trace
# of any tasks waiting in D state. If you pass -a, it will print out
# the stack of every task it finds.
#
# It also makes a histogram of the common stacks so you can see where
# more of the tasks are. Usually when we're deadlocked, we care about
# the least common stacks.
#
import sys
import os
import argparse

parser = argparse.ArgumentParser(description='Show kernel stacks')
parser.add_argument('-a', '--all_tasks', action='store_true', help='Dump all stacks')
parser.add_argument('-p', '--pid', type=str, help='Filter on pid')
parser.add_argument('-c', '--command', type=str, help='Filter on command name')
options = parser.parse_args()

stacks = {}

# parse the units from a number and normalize into KB
def parse_number(s):
try:
words = s.split()
unit = words[-1].lower()
number = int(words[1])
tag = words[0].lower().rstrip(':')

# we store in kb
if unit == "mb":
number = number * 1024
elif unit == "gb":
number = number * 1024 * 1024
elif unit == "tb":
number = number * 1024 * 1024

return (tag, number)
except:
return (None, None)

# read /proc/pid/stack and add it to the hashes
def add_stack(path, pid, cmd, status):
global stacks

try:
stack = open(os.path.join(path, "stack"), 'r').read()
except:
return

if (status != "D" and not options.all_tasks):
return

print("%s %s %s" % (pid, cmd, status))
print(stack)
v = stacks.get(stack)
if v:
v += 1
else:
v = 1
stacks[stack] = v


# worker to read all the files for one individual task
def run_one_task(path):

try:
stat = open(os.path.join(path, "stat"), 'r').read()
except:
return
words = stat.split()
pid, cmd, status = words[0:3]

cmd = cmd.lstrip('(')
cmd = cmd.rstrip(')')

if options.command and options.command != cmd:
return

add_stack(path, pid, cmd, status)

def print_usage():
sys.stderr.write("Usage: %s [-a]\n" % sys.argv[0])
sys.exit(1)

# for a given pid in string form, read the files from proc
def run_pid(name):
try:
pid = int(name)
except:
return

p = os.path.join("/proc", name, "task")
if not os.path.exists(p):
return

try:
for t in os.listdir(p):
run_one_task(os.path.join(p, t))
except:
pass

if options.pid:
run_pid(options.pid)
else:
for name in os.listdir("/proc"):
run_pid(name)

values = {}
for stack, count in stacks.items():
l = values.setdefault(count, [])
l.append(stack)

counts = list(values.keys())
counts.sort(reverse=True)
if counts:
print("-----\nstack summary\n")

for x in counts:
if x == 1:
print("1 hit:")
else:
print("%d hits: " % x)

l = values[x]
for stack in l:
print(stack)
print("-----")