Re: Why inode number is zero in writeback?
From: Andreas Dilger
Date: Wed Jul 25 2018 - 12:56:47 EST
On Jul 25, 2018, at 6:01 AM, Ilya Plenne <libbkmz.dev@xxxxxxxxx> wrote:
> I'm researching linux kernel. Right now only for v3.10.61, it's just
> proof of concept.
> I need to pass-through some hints to hardware about what kind of data
> in particular WRITE\READ operation. E.g. read inodes bitmap or write
> journal block or write user data or something else...
> I'm assuming that at driver level I can reach bio struct from
> request_queue struct just for now. For testing purposes I'm using
> virtio_blk driver under qemu.
Please see the following articles/threads for similar developments
that are already present in Linux:
Author: Jens Axboe <axboe@xxxxxxxxx>
AuthorDate: Tue Jun 27 11:47:04 2017 -0600
Commit: Jens Axboe <axboe@xxxxxxxxx>
CommitDate: Tue Jun 27 12:05:22 2017 -0600
fs: add fcntl() interface for setting/getting write life time hints
This patch was landed in kernel 4.12, so you would be well advised to
update your testing to at least that kernel. If there is a specific
reason to use the 3.10 kernel (e.g. RHEL7 requirement) then it would
be best to backport this patch series to the older kernel, rather than
making a different set of interfaces that will conflict with newer
If you want to add additional streams (e.g. for different types of
filesystem metadata), you should modify nvme_configure_directives()
to remove the BLK_MAX_WRITE_HINTS limit to nr_streams, and then use
stream IDs > BLK_MAX_WRITE_HINTS - 1 for the filesystem-specific IDs.
Alternately, add a separate value BLK_MIN_WRITE_HITS = 5, and check
that against ctrl->nssa, and increase the value of BLK_MAX_WRITE_HITS
to 16 or similar (not too large, as there is an array in the queue
based on this value to track stream usage stats).
You still wouldn't be able to pass larger hints (stream IDs) from
userspace, but it would be enough to pass hints directly from ext4.
Using higher stream IDs would also avoid conflicts with user supplied
IDs, and doesn't affect operation otherwise.
There would need to be some way to map the higher IDs to lower ones
if there weren't enough distinct IDs in the device. One option is to
encode the filesystem stream ID into the high 8 bits of the fields,
and the fallback hint into the low 8 (really 3) bits of the fields.
The fallback hint would be set manually (e.g. journal and bitmap use
the WRITE_LIFE_SHORT hint, inode uses WRITE_LIFE_MEDIUM, etc.).
It isn't clear whether "user data" has a good fallback hint or not,
but that is always true even if you segregate it into its own class
(it will intermingle short- and long-term files), so it could just
fall back to WRITE_LIFE_NONE.
Alternately, you could expand enum rw_hint to have a few generic
internal IDs not accessible from userspace, like WRITE_LIFE_INODE,
WRITE_LIFE_BITMAP, WRITE_LIFE_JOURNAL, WRITE_LIFE_DIRECTORY, and
WRITE_LIFE_TREE, etc. that could be used by multiple filesystems.
> Here what I have done:
> 1. I've created FS with this command: mkfs.ext4 -b 4096 -E
> lazy_itable_init=0,lazy_journal_init=0 -m 0 /dev/vda
> 2. I've mounted it using this command: mount -o
> rw,nosuid,nodev,discard,noauto_da_alloc,data=ordered /dev/vda /mnt
> 3. Added breakpoint in virtio_blk.c:377 and will skip writes related to journal
> 4. Execute this dd command: dd if=/dev/urandom of=/mnt/foo2 bs=764 count=34
> 5. Wait for breakpoint hit and will analyze backtrace under writeback subsystem.
> Here is the backtrace:
> #0 virtblk_request (q=0x87ab8000) at drivers/block/virtio_blk.c:377
> #1 0x801c0b4c in __blk_run_queue_uncond (q=<optimized out>) at
> #2 __blk_run_queue (q=0x87ab8000) at block/blk-core.c:329
> #3 0x801c0c94 in queue_unplugged (q=0x87ab8000, depth=<optimized
> out>, from_schedule=<optimized out>) at block/blk-core.c:2920
> #4 0x801c3a98 in blk_flush_plug_list (plug=<optimized out>,
> from_schedule=false) at block/blk-core.c:3030
> #5 0x801c3d9c in blk_finish_plug (plug=0x8785fd8c) at block/blk-core.c:3037
> #6 0x80091644 in generic_writepages (mapping=<optimized out>,
> wbc=0x8785fde0) at mm/page-writeback.c:1910
> #7 0x80092a88 in do_writepages (mapping=<optimized out>,
> wbc=<optimized out>) at mm/page-writeback.c:1923
> #8 0x800e7084 in __writeback_single_inode (inode=0x8740c290,
> wbc=0x8785fde0) at fs/fs-writeback.c:454
> #9 0x800e7360 in writeback_sb_inodes (sb=0x87811000, wb=0x87ab81b0,
> work=0x8785fea4) at fs/fs-writeback.c:678
> #10 0x800e757c in __writeback_inodes_wb (wb=0x1 <__vectors_start>,
> work=0x3ff) at fs/fs-writeback.c:723
> #11 0x800e7760 in wb_writeback (wb=0x87ab81b0, work=0x8785fea4) at
> #12 0x800e838c in wb_check_old_data_flush (wb=<optimized out>) at
> #13 wb_do_writeback (wb=0x87ab81b0, force_wait=0) at fs/fs-writeback.c:1010
> #14 0x800e848c in bdi_writeback_workfn (work=0x87ab81bc) at
> #15 0x80039d34 in process_one_work (worker=0x87816180,
> work=0x87ab81bc) at kernel/workqueue.c:2189
> #16 0x8003a40c in worker_thread (__worker=0x1 <__vectors_start>) at
> #17 0x8003f714 in kthread (_create=0x87845e20) at kernel/kthread.c:200
> #18 0x8000dfb8 in ret_from_fork () at arch/arm/kernel/entry-common.S:91
> Ok, here we go... At frame #8 we have not optimized inode variable for
> p (*inode)->i_ino
> $7 = 0
> Can anyone explain me what this inode is about? Where can I found
> information regarding this kind of inodes? And how can I track the
> inode number for writeback operations?
> PS. That's my first email in linux kernel mailing list... I'm sorry if
> I have done something wrong
Description: Message signed with OpenPGP