Re: [RFC v2 01/83] Introduction and documentation of NOVA filesystem.

From: Randy Dunlap
Date: Mon Mar 19 2018 - 16:44:52 EST


On 03/10/2018 10:17 AM, Andiry Xu wrote:
> From: Andiry Xu <jix024@xxxxxxxxxxx>
>
> NOVA is a log-structured file system tailored for byte-addressable non-volatile memories.
> It was designed and developed at the Non-Volatile Systems Laboratory in the Computer
> Science and Engineering Department at the University of California, San Diego.
> Its primary authors are Andiry Xu <jix024@xxxxxxxxxxxx>, Lu Zhang
> <luzh@xxxxxxxxxxxx>, and Steven Swanson <swanson@xxxxxxxxxxxx>.
>
> These two papers provide a detailed, high-level description of NOVA's design goals and approach:
>
> NOVA: A Log-structured File system for Hybrid Volatile/Non-volatile Main Memories
> In The 14th USENIX Conference on File and Storage Technologies (FAST '16)
> (http://cseweb.ucsd.edu/~swanson/papers/FAST2016NOVA.pdf)
>
> NOVA-Fortis: A Fault-Tolerant Non-Volatile Main Memory File System
> In The 26th ACM Symposium on Operating Systems Principles (SOSP '17)
> (http://cseweb.ucsd.edu/~swanson/papers/SOSP2017-NOVAFortis.pdf)
>
> This patchset contains features from the FAST paper. We leave NOVA-Fortis features,
> such as snapshot, metadata and data replication and RAID parity for
> future submission.
>
> Signed-off-by: Andiry Xu <jix024@xxxxxxxxxxx>
> ---
> Documentation/filesystems/00-INDEX | 2 +
> Documentation/filesystems/nova.txt | 498 +++++++++++++++++++++++++++++++++++++
> MAINTAINERS | 8 +
> 3 files changed, 508 insertions(+)
> create mode 100644 Documentation/filesystems/nova.txt

> diff --git a/Documentation/filesystems/nova.txt b/Documentation/filesystems/nova.txt
> new file mode 100644
> index 0000000..4728f50
> --- /dev/null
> +++ b/Documentation/filesystems/nova.txt
> @@ -0,0 +1,498 @@
> +The NOVA Filesystem
> +===================
> +
> +NOn-Volatile memory Accelerated file system (NOVA) is a DAX file system
> +designed to provide a high performance and production-ready file system
> +tailored for byte-addressable non-volatile memories (e.g., NVDIMMs
> +and Intel's soon-to-be-released 3DXPoint DIMMs).
> +NOVA combines design elements from many other file systems
> +and adapts conventional log-structured file system techniques to
> +exploit the fast random access that NVMs provide. In particular, NOVA maintains
> +separate logs for each inode to improve concurrency, and stores file data
> +outside the log to minimize log size and reduce garbage collection costs. NOVA's
> +logs provide metadata and data atomicity and focus on simplicity and
> +reliability, keeping complex metadata structures in DRAM to accelerate lookup
> +operations.
> +
> +NOVA was developed by the Non-Volatile Systems Laboratory (NVSL) in
> +the Computer Science and Engineering Department at the University of
> +California, San Diego.
> +
> +A more thorough discussion of NOVA's design is avaialable in these two papers:

available

> +
> +NOVA: A Log-structured File System for Hybrid Volatile/Non-volatile Main Memories
> +Jian Xu and Steven Swanson
> +In The 14th USENIX Conference on File and Storage Technologies (FAST '16)
> +
> +NOVA-Fortis: A Fault-Tolerant Non-Volatile Main Memory File System
> +Jian Xu, Lu Zhang, Amirsaman Memaripour, Akshatha Gangadharaiah, Amit Borase,
> +Tamires Brito Da Silva, Andy Rudoff and Steven Swanson
> +In The 26th ACM Symposium on Operating Systems Principles (SOSP '17)
> +
> +This version of NOVA contains features from the FAST paper.
> +NOVA-Fortis features, such as snapshot, metadata and data protection and replication
> +are left for future submission.
> +
> +The main NOVA features include:
> +
> + * POSIX semantics
> + * Directly access (DAX) byte-addressable NVMM without page caching
> + * Per-CPU NVMM pool to maximize concurrency
> + * Strong consistency guarantees with 8-byte atomic stores
> +
> +
> +Filesystem Design
> +=================
> +
> +NOVA divides NVMM into several regions. NOVA's 512B superblock contains global

(prefer:) 512-byte

> +file system information and the recovery inode. The recovery inode represents a
> +special file that stores recovery information (e.g., the list of unallocated
> +NVMM pages). NOVA divides its inode tables into per-CPU stripes. It also
> +provides per-CPU journals for complex file operations that involve multiple
> +inodes. The rest of the available NVMM stores logs and file data.
> +
> +NOVA is log-structured and stores a separate log for each inode to maximize
> +concurrency and provide atomicity for operations that affect a single file. The
> +logs only store metadata and comprise a linked list of 4 KB pages. Log entries
> +are small â between 32 and 64 bytes. Logs are generally non-contiguous, and log
> +pages may reside anywhere in NVMM.
> +
> +NOVA keeps copies of most file metadata in DRAM during normal
> +operations, eliminating the need to access metadata in NVMM during reads.
> +
> +NOVA supports both copy-on-write and in-place file data updates and appends
> +metadata about the write to the log. For operations that affect multiple inodes

inodes,

> +NOVA uses lightweight, fixed-length journals âone per core.

-- one per core.

> +
> +NOVA divides the allocatable NVMM into multiple regions, one region per CPU
> +core. A per-core allocator manages each of the regions, minimizing contention
> +during memory allocation.
> +
> +After a system crash, NOVA must scan all the logs to rebuild the memory
> +allocator state. Since, there are many logs, NOVA aggressively parallelizes the

Since there are

> +scan.
> +
> +
> +Building and using NOVA
> +=======================
> +
> +To build NOVA, build the kernel with PMEM (`CONFIG_BLK_DEV_PMEM`),
> +DAX (`CONFIG_FS_DAX`) and NOVA (`CONFIG_NOVA_FS`) support. Install as usual.
> +
> +NOVA runs on a pmem non-volatile memory region. You can create one of these
> +regions with the `memmap` kernel command line option. For instance, adding
> +`memmap=16G!8G` to the kernel boot parameters will reserve 16GB memory starting
> +from address 8GB, and the kernel will create a `pmem0` block device under the
> +`/dev` directory.
> +
> +After the OS has booted, you can initialize a NOVA instance with the following commands:
> +
> +
> +# modprobe nova
> +# mount -t NOVA -o init /dev/pmem0 /mnt/nova

Hmph, unique in upper-case-ness (at least for in-tree fs-es).
Would you consider "nova" instead?

> +
> +
> +The above commands create a NOVA instance on `/dev/pmem0` and mounts it on
> +`/mnt/nova`.
> +
> +NOVA support several module command line options:

supports

> +
> + * measure_timing: Measure the timing of file system operations for profiling (default: 0)
> +
> + * inplace_data_updates: Update data in place rather than with COW (default: 0)
> +
> +To recover an existing NOVA instance, mount NOVA without the init option, for example:
> +
> +# mount -t NOVA /dev/pmem0 /mnt/nova
> +
> +
> +Sysfs support
> +-------------
> +
> +NOVA provides sysfs support to enable user to get/set information of

enable a user
or enable users

And the line above ends with a trailing space. Please check/remove all of those.

> +a running NOVA instance.
> +After mount, NOVA creates four entries under proc directory /proc/fs/nova/pmem#/:

Above uses lower-case "nova" in /proc/fs/nova/... but the examples below use NOVA.
nova is preferred (IMO).

> +
> +timing_stats IO_stats allocator gc
> +
> +Show NOVA file operation timing statistics:
> +# cat /proc/fs/NOVA/pmem#/timing_stats
> +
> +Clear timing statistics:
> +# echo 1 > /proc/fs/NOVA/pmem#/timing_stats
> +
> +Show NOVA I/O statistics:
> +# cat /proc/fs/NOVA/pmem#/IO_stats
> +
> +Clear I/O statistics:
> +# echo 1 > /proc/fs/NOVA/pmem#/IO_stats
> +
> +Show NOVA allocator information:
> +# cat /proc/fs/NOVA/pmem#/allocator
> +
> +Manual garbage collection:
> +# echo #inode_number > /proc/fs/NOVA/pmem#/gc
> +
> +
> +Source File Structure
> +=====================
> +
> + * nova_def.h/nova.h
> + Defines NOVA macros and key inline functions.
> +
> + * balloc.{h,c}
> + NOVA's pmem allocator implementation.
> +
> + * bbuild.c
> + Implements recovery routines to restore the in-use inode list and the NVMM
> + allocator information.
> +
> + * dax.c
> + Implements DAX read/write and mmap functions to access file data. NOVA uses
> + copy-on-write to modify file pages by default, unless inplace data update is
> + enabled at mount-time.
> +
> + * dir.c
> + Contains functions to create, update, and remove NOVA dentries.
> +
> + * file.c
> + Implements file-related operations such as open, fallocate, llseek, fsync,
> + and flush.
> +
> + * gc.c
> + NOVA's garbage collection functions.
> +
> + * inode.{h,c}
> + Creates, reads, and frees NOVA inode tables and inodes.
> +
> + * ioctl.c
> + Implements some ioctl commands to call NOVA's internal functions.
> +
> + * journal.{h,c}
> + For operations that affect multiple inodes NOVA uses lightweight,
> + fixed-length journals â one per core. This file contains functions to
> + create and manage the lite journals.
> +
> + * log.{h,c}
> + Functions to manipulate NOVA inode logs, including log page allocation, log
> + entry creation, commit, modification, and deletion.
> +
> + * namei.c
> + Functions to create/remove files, directories, and links. It also looks for
> + the NOVA inode number for a given path name.
> +
> + * rebuild.c
> + When mounting NOVA, rebuild NOVA inodes from its logs.
> +
> + * stats.{h,c}
> + Provide routines to gather and print NOVA usage statistics.
> +
> + * super.{h,c}
> + Super block structures and NOVA FS layout and entry points for NOVA
> + mounting and unmounting, initializing or recovering the NOVA super block
> + and other global file system information.
> +
> + * symlink.c
> + Implements functions to create and read symbolic links in the filesystem.
> +
> + * sysfs.c
> + Implements sysfs entries to take user inputs for printing NOVA statistics.

s/sysfs/procfs/

> +
> +
> +Filesystem Layout
> +=================
> +
> +A NOVA file systems resides in single PMEM device. *****
> +NOVA divides the device into 4KB blocks.

4 KB {or use 4KB way up above here}

> +
> + block
> ++---------------------------------------------------------+
> +| 0 | primary super block (struct nova_super_block) |
> ++---------------------------------------------------------+
> +| 1 | Reserved inodes |
> ++---------------------------------------------------------+
> +| 2 - 15 | reserved |
> ++---------------------------------------------------------+
> +| 16 - 31 | Inode table pointers |
> ++---------------------------------------------------------+
> +| 32 - 47 | Journal pointers |
> ++---------------------------------------------------------+
> +| 48 - 63 | reserved |
> ++---------------------------------------------------------+
> +| ... | log and data pages |
> ++---------------------------------------------------------+
> +| n-2 | replica reserved Inodes |
> ++---------------------------------------------------------+
> +| n-1 | replica super block |
> ++---------------------------------------------------------+
> +
> +
> +
> +Superblock and Associated Structures
> +====================================
> +
> +The beginning of the PMEM device hold the super block and its associated

holds

> +tables. These include reserved inodes, a table of pointers to the journals
> +NOVA uses for complex operations, and pointers to inodes tables. NOVA
> +maintains replicas of the super block and reserved inodes in the last two
> +blocks of the PMEM area.
> +
> +
> +Block Allocator/Free Lists
> +==========================
> +
> +NOVA uses per-CPU allocators to manage free PMEM blocks. On initialization,> +NOVA divides the range of blocks in the PMEM device among the CPUs, and those
> +blocks are managed solely by that CPU. We call these ranges of "allocation regions".
> +Each allocator maintains a red-black tree of unallocated ranges (struct
> +nova_range_node).
> +
> +Allocation Functions
> +--------------------
> +
> +NOVA allocate PMEM blocks using two mechanisms:

allocates

> +
> +1. Static allocation as defined in super.h
> +
> +2. Allocation for log and data pages via nova_new_log_blocks() and
> +nova_new_data_blocks().
> +
> +
> +PMEM Address Translation
> +------------------------
> +
> +In NOVA's persistent data structures, memory locations are given as offsets
> +from the beginning of the PMEM region. nova_get_block() translates offsets to
> +PMEM addresses. nova_get_addr_off() performs the reverse translation.
> +
> +
> +Inodes
> +======
> +
> +NOVA maintains per-CPU inode tables, and inode numbers are striped across the
> +tables (i.e., inos 0, n, 2n,... on cpu 0; inos 1, n + 1, 2n + 1, ... on cpu 1).
> +
> +The inodes themselves live in a set of linked lists (one per CPU) of 2MB
> +blocks. The last 8 bytes of each block points to the next block. Pointers to
> +heads of these list live in PMEM block INODE_TABLE_START.

lists

> +Additional space for inodes is allocated on demand.
> +
> +To allocate inodes, NOVA maintains a per-cpu "inuse_list" in DRAM holds a RB

s/cpu/CPU/g
s/a RB/an RB/

but that isn't quite a sentence. Please fix it.

> +tree that holds ranges of allocated inode numbers.
> +
> +
> +Logs
> +====
> +
> +NOVA maintains a log for each inode that records updates to the inode's
> +metadata and holds pointers to the file data. NOVA makes updates to file data
> +and metadata atomic by atomically appending log entries to the log.
> +
> +Each inode contains pointers to head and tail of the inode's log. When the log
> +grows past the end of the last page, nova allocates additional space. For
> +short logs (less than 1MB) , it doubles the length. For longer logs, it adds a
> +fixed amount of additional space (1MB).
> +
> +Log space is reclaimed during garbage collection.
> +
> +Log Entries
> +-----------
> +
> +There are four kinds of log entry, documented in log.h. The log entries have
> +several entries in common:
> +
> + 1. 'epoch_id' gives the epoch during which the log entry was created.
> + Creating a snapshot increments the epoch_id for the file systems.

file system. (?)
or do multiple epochs (snapshots) => multiple fs-es?

> + Currently disabled (always zero).
> +
> + 2. 'trans_id' is per-inode, monotone increasing, number assigned each
> + log entry. It provides an ordering over FS operations on a single inode.
> +
> + 3. 'invalid' is true if the effects of this entry are dead and the log
> + entry can be garbage collected.
> +
> + 4. 'csum' is a CRC32 checksum for the entry. Currently it is disabled.
> +
> +Log structure
> +-------------
> +
> +The logs comprise a linked list of PMEM blocks. The tail of each block
> +contains some metadata about the block and pointers to the next block and
> +block's replica (struct nova_inode_page_tail).
> +
> ++----------------+
> +| log entry |
> ++----------------+
> +| log entry |
> ++----------------+
> +| ... |
> ++----------------+
> +| tail |
> +| metadata |
> +| -> next block |
> ++----------------+
> +
> +
> +Journals
> +========
> +
> +NOVA uses a lightweight journaling mechanisms to provide atomicity for

mechanism

> +operations that modify more than one on inode. The journals providing logging

end of that "sentence" (above) is confusing or missing something.

> +for two operations:
> +
> +1. Single word updates (JOURNAL_ENTRY)
> +2. Copying inodes (JOURNAL_INODE)
> +
> +The journals are undo logs: NOVA creates the journal entries for an operation,
> +and if the operation does not complete due to a system failure, the recovery
> +process rolls back the changes using the journal entries.
> +
> +To commit, NOVA drops the log.
> +
> +NOVA maintains one journal per CPU. The head and tail pointers for each
> +journal live in a reserved page near the beginning of the file system.
> +
> +During recovery, NOVA scans the journals and undoes the operations described by
> +each entry.
> +
> +
> +File and Directory Access
> +=========================
> +
> +To access file data via read(), NOVA maintains a radix tree in DRAM for each
> +inode (nova_inode_info_header.tree) that maps file offsets to write log
> +entries. For directories, the same tree maps a hash of filenames to their
> +corresponding dentry.
> +
> +In both cases, the nova populates the tree when the file or directory is opened

the nova fs (?)

> +by scanning its log.
> +
> +
> +MMap and DAX
> +============
> +
> +NOVA leverages the kernel's DAX mechanisms for mmap and file data access.
> +NOVA supports DAX-style mmap, i.e. mapping NVM pages directly to the
> +application's address space.
> +
> +
> +Garbage Collection
> +==================
> +
> +NOVA recovers log space with a two-phase garbage collection system. When a log
> +reaches the end of its allocated pages, NOVA allocates more space. Then, the
> +fast GC algorithm scans the log to remove pages that have no valid entries.
> +Then, it estimates how many pages the logs valid entries would fill. If this
> +is less than half the number of pages in the log, the second GC phase copies
> +the valid entries to new pages.
> +
> +For example (V=valid; I=invalid):
> +
> ++---+ +---+ +---+
> +| I | | I | | V |
> ++---+ +---+ Thorough +---+
> +| V | | V | GC | V |
> ++---+ +---+ =====> +---+
> +| I | | I | | V |
> ++---+ +---+ +---+
> +| V | | V | | V |
> ++---+ +---+ +---+
> + | |
> + V V
> ++---+ +---+
> +| I | | V |
> ++---+ +---+
> +| I | fast GC | I |
> ++---+ ====> +---+
> +| I | | I |
> ++---+ +---+
> +| I | | V |
> ++---+ +---+
> + |
> + V
> ++---+
> +| V |
> ++---+
> +| I |
> ++---+
> +| I |
> ++---+
> +| V |
> ++---+
> +
> +
> +Umount and Recovery
> +===================
> +
> +Clean umount/mount
> +------------------
> +
> +On a clean unmount, NOVA saves the contents of many of its DRAM data structures
> +to PMEM to accelerate the next mount:
> +
> +1. NOVA stores the allocator state for each of the per-cpu allocators to the
> + log of a reserved inode (NOVA_BLOCK_NODE_INO).
> +
> +2. NOVA stores the per-CPU lists of alive inodes (the inuse_list) to the
> + NOVA_BLOCK_INODELIST_INO reserved inode.
> +
> +After a clean unmount, the following mount restores these data and then
> +invalidates them.
> +
> +Recovery after failures
> +-----------------------
> +
> +In case of a unclean dismount (e.g., system crash), NOVA must rebuild these

of an unclean

> +DRAM structures by scanning the inode logs. NOVA log scanning is fast because
> +per-CPU inode tables and per-inode logs allow for parallel recovery.
> +
> +The number of live log entries in an inode log is roughly the number of extents
> +in the file. As a result, NOVA only needs to scan a small fraction of the NVMM
> +during recovery.
> +
> +The NOVA failure recovery consists of two steps:
> +
> +First, NOVA checks its lite weight journals and rolls back any uncommitted

should be one word: lightweight (or liteweight)

> +transactions to restore the file system to a consistent state.
> +
> +Second, NOVA starts a recovery thread on each CPU and scans the inode tables in
> +parallel, performing log scanning for every valid inode in the inode table.
> +NOVA use different recovery mechanisms for directory inodes and file inodes:

and file inodes.

> +For a directory inode, NOVA scans the log's linked list to enumerate the pages
> +it occupies, but it does not inspect the log's contents. For a file inode,
> +NOVA reads the write entries in the log to enumerate the data pages.
> +
> +During the recovery scan NOVA builds a bitmap of occupied pages, and rebuilds
> +the allocator based on the result. After this process completes, the file
> +system is ready to accept new requests.
> +
> +During the same scan, it rebuilds the list of available inodes.
> +
> +
> +Gaps, Missing Features, and Development Status
> +==============================================
> +
> +Although NOVA is a fully-functional file system, there is still much work left
> +to be done. In particular, (at least) the following items are currently missing:
> +
> +1. Snapshot, metadata and data replication and protection are left for future submission.
> +2. There is no mkfs or fsck utility (`mount` takes `-o init` to create a NOVA file system).
> +3. NOVA only works on x86-64 kernels.
> +4. NOVA does not currently support extended attributes or ACL.
> +5. NOVA doesn't provide quota support.
> +6. Moving NOVA file systems between machines with different numbers of CPUs does not work.

You could artificially limit the number of "known" CPUs so that a NOVA fs could be
moved from a 16-CPU system to an 8-CPU system by telling NOVA to use only 8 CPUs
(as an example). Just a thought.

> +
> +None of these are fundamental limitations of NOVA's design.
> +
> +NOVA is complete and robust enough to run a range of complex applications, but
> +it is not yet ready for production use. Our current focus is on adding a few
> +missing features from the list above and finding/fixing bugs.
> +
> +
> +Hacking and Contributing
> +========================
> +
> +If you find bugs, please report them at https://github.com/NVSL/linux-nova/issues.
> +
> +If you have other questions or suggestions you can contact the NOVA developers
> +at cse-nova-hackers@xxxxxxxxxxxxx


--
~Randy