Re: [RFC PATCH 0/1] Large folios in block buffered IO path
From: Bharata B Rao
Date: Tue Dec 03 2024 - 00:02:15 EST
On 02-Dec-24 3:38 PM, Mateusz Guzik wrote:
On Mon, Dec 2, 2024 at 10:37 AM Bharata B Rao <bharata@xxxxxxx> wrote:
On 28-Nov-24 10:01 AM, Mateusz Guzik wrote:
WIlly mentioned the folio wait queue hash table could be grown, you
can find it in mm/filemap.c:
1062 #define PAGE_WAIT_TABLE_BITS 8
1063 #define PAGE_WAIT_TABLE_SIZE (1 << PAGE_WAIT_TABLE_BITS)
1064 static wait_queue_head_t folio_wait_table[PAGE_WAIT_TABLE_SIZE]
__cacheline_aligned;
1065
1066 static wait_queue_head_t *folio_waitqueue(struct folio *folio)
1067 {
1068 │ return &folio_wait_table[hash_ptr(folio, PAGE_WAIT_TABLE_BITS)];
1069 }
Can you collect off cpu time? offcputime-bpfcc -K > /tmp/out
Flamegraph for "perf record --off-cpu -F 99 -a -g --all-kernel
--kernel-callchains -- sleep 120" is attached.
Off-cpu samples were collected for 120s at around 45th minute run of the
FIO benchmark that actually runs for 1hr. This run was with kernel that
had your inode_lock fix but no changes to PAGE_WAIT_TABLE_BITS.
Hopefully this captures the representative sample of the scalability
issue with folio lock.
Here is the data from offcputime-bpfcc -K run with inode_lock fix and no
change to PAGE_WAIT_TABLE_BITS. This data was captured for the entire
duration of FIO run (1hr). Since the data is huge, I am pasting a few
relevant entries.
The first entry in the offcputime records
finish_task_switch.isra.0
schedule
irqentry_exit_to_user_mode
irqentry_exit
sysvec_reschedule_ipi
asm_sysvec_reschedule_ipi
- fio (33790)
2
There are thousands of entries for read and write paths of FIO and I
have shown only the first and last entries for the same here.
First entry for FIO read path that waits on folio_lock
finish_task_switch.isra.0
schedule
io_schedule
folio_wait_bit_common
filemap_get_pages
filemap_read
blkdev_read_iter
vfs_read
ksys_read
__x64_sys_read
x64_sys_call
do_syscall_64
entry_SYSCALL_64_after_hwframe
- fio (34143)
3381769535
Last entry for FIO read path that waits on folio_lock
finish_task_switch.isra.0
schedule
io_schedule
folio_wait_bit_common
filemap_get_pages
filemap_read
blkdev_read_iter
vfs_read
ksys_read
__x64_sys_read
x64_sys_call
do_syscall_64
entry_SYSCALL_64_after_hwframe
- fio (34171)
3516224519
First entry for FIO write path that waits on folio_lock
finish_task_switch.isra.0
schedule
io_schedule
folio_wait_bit_common
__filemap_get_folio
iomap_get_folio
iomap_write_begin
iomap_file_buffered_write
blkdev_write_iter
vfs_write
ksys_write
__x64_sys_write
x64_sys_call
do_syscall_64
entry_SYSCALL_64_after_hwframe
- fio (33842)
48900
Last entry for FIO write path that waits on folio_lock
finish_task_switch.isra.0
schedule
io_schedule
folio_wait_bit_common
__filemap_get_folio
iomap_get_folio
iomap_write_begin
iomap_file_buffered_write
blkdev_write_iter
vfs_write
ksys_write
__x64_sys_write
x64_sys_call
do_syscall_64
entry_SYSCALL_64_after_hwframe
- fio (34187)
1815993
The last entry in the offcputime records
finish_task_switch.isra.0
schedule
futex_wait_queue
__futex_wait
futex_wait
do_futex
__x64_sys_futex
x64_sys_call
do_syscall_64
entry_SYSCALL_64_after_hwframe
- multipathd (6308)
3698877753
Regards,
Bharata.