[RFC v2 00/83] NOVA: a new file system for persistent memory

From: Andiry Xu
Date: Sat Mar 10 2018 - 13:20:19 EST


From: Andiry Xu <jix024@xxxxxxxxxxx>

This is the second version of RFC patch series that impements
NOVA (NOn-Volatile memory Accelerated file system), a new file system built for PMEM.

NOVA's goal is to provide a high performance, production-ready
file system tailored for byte-addressable non-volatile memories (e.g., NVDIMMs
and Intel's soon-to-be-released 3DXpoint DIMMs).

NOVA was developed at the Non-Volatile Systems Laboratory in the Computer
Science and Engineering Department at the University of California, San Diego.
Its primary authors are Andiry Xu <jix024@xxxxxxxxxxx>, Lu Zhang
<luzh@xxxxxxxxxxxx>, and Steven Swanson <swanson@xxxxxxxxxxxx>.

NOVA is stable enough to run complex applications, but there is substantial
work left to do. This RFC is intended to gather feedback to guide its
development toward eventual inclusion upstream.

The patches are based on Linux 4.16-rc4.


Changes from v1:

* Remove snapshot, metadata replication and data parity for future submission.
This significantly reduces complexity and LOC: 22129 -> 13834.

* Breakdown the code in a more reviewer-friendly way:
The patchset starts with a simple skeleton and adds more features gradually.
Each patch leaves the tree in a compilable and working state,
and is self-contained and small, so easier to review.

* Fix bugs so that NOVA passes xfstests: https://github.com/NVSL/xfstests


Overview
========

NOVA is primarily a log-structured file system, but rather than maintain a
single global log for the entire file system, it maintains separate logs for
each inode. NOVA breaks the logs into 4KB pages, they need not be
contiguous in memory. The logs only contain metadata.

File data pages reside outside the log, and log entries for write operations
point to data pages they modify. File modification can be done in
either inplace update or copy-on-write (COW) way to provide atomic file updates.

For file operations that involve multiple inodes, NOVA use small, fixed-sized
redo logs to atomically append log entries to the logs of the inodes involved.

This structure keeps logs small and makes garbage collection very fast. It also
enables enormous parallelism during recovery from an unclean unmount, since
threads can scan logs in parallel.

Documentation/filesystems/NOVA.txt contains some lower-level implementation and
usage information. A more thorough discussion of NOVA's goals and design is
avaialable in two papers:

NOVA: A Log-structured File system for Hybrid Volatile/Non-volatile Main Memories
http://cseweb.ucsd.edu/~swanson/papers/FAST2016NOVA.pdf
Jian Xu and Steven Swanson
Published in FAST 2016

NOVA-Fortis: A Fault-Tolerant Non-Volatile Main Memory File System
http://cseweb.ucsd.edu/~swanson/papers/SOSP2017-NOVAFortis.pdf
Jian Xu, Lu Zhang, Amirsaman Memaripour, Akshatha Gangadharaiah,
Amit Borase, Tamires Brito Da Silva, Andy Rudoff, Steven Swanson
Published in SOSP 2017

This version contains features from the FAST paper. We leave NOVA-Fortis
features for future.


Build and Run
=============

To build NOVA, build the kernel with PMEM (`CONFIG_BLK_DEV_PMEM`),
DAX (`CONFIG_FS_DAX`) and NOVA (`CONFIG_NOVA_FS`) support. Install as usual.

NOVA runs on a pmem non-volatile memory region created by memmap kernel option.
For instance, adding 'memmap=16G!8G' to the kernel boot parameters will reserve
16GB memory starting from address 8GB, and the kernel will create a pmem0
block device under the /dev directory.

After the OS has booted, initialize a NOVA instance with the following commands:

# modprobe nova
# mount -t NOVA -o init /dev/pmem0 /mnt/nova

The above commands create a NOVA instance on /dev/pmem0 and mounts it on
/mnt/nova. Currently NOVA does not have mkfs or fsck support.


Performance
===========

Comparing to other DAX file systems such as ext4-DAX and xfs-DAX,
NOVA provides fine-grained, byte granularity metadata operation,
and it performs better in metadata-intensive and write-intensive applications.
NOVA also excel in append-fsync access pattern, i.e. write-ahead logging,
which is very common in DBMS and key-value stores.

The following test is performed on Intel i7-3770K with 16GB DRAM
and 8GB PMEM emulated with DRAM. The kernel is 4.16-rc4 64bit on Ubuntu 16.04.
Performance may vary on different platforms.


Filebench throughout (ops/s):
xfs-DAX ext4-DAX NOVA
Fileserver 86971 177826 334166
Varmail 148032 288033 999794
Webserver 370245 370144 374130
Webproxy 315084 737544 927216

Webserver is read-intensive and all the file systems have similar performance.


SQLite test:
SQLite has four journaling modes:
Delete: delete the undo log file after transaction commit
Truncate: truncate the undo log file to zero after transaction commit
Persist: write a flag at the beginning of the log file after transaction commit
WAL: write-ahead logging

SQLite insert (transactions/s):
xfs-DAX ext4-DAX NOVA
Delete 18525 23615 45289
Truncate 21930 26391 52046
Persist 58053 56106 50554
WAL 38622 62703 85395

NOVA performs bad in Persist mode because it does copy-on-write for writes,
and writes 4KB for sub-page writes.


Redis: fsync the WAL file after every set.
Redis set throughout (trans/s):
xfs-DAX ext4-DAX NOVA
49771 88308 102560


RocksDB fillunique test (ops/s):
xfs-DAX ext4-DAX NOVA
WAL sync 33563 62066 295655
WAL nosync 254533 288106 393713

Both ext4-DAX and xfs-DAX suffer from high fsync overhead.

More test results are available in the two NOVA papers.

NOVA uses per-inode logging, per-CPU inode table and journal to avoid lock contention.
We use the FxMark test suite (https://github.com/sslab-gatech/fxmark)
to test the filesystem scalability. The result is at
http://cseweb.ucsd.edu/~jix024/sc.pdf


Thanks,
Andiry

---

Andiry Xu (83):
Introduction and documentation of NOVA filesystem.
Add nova_def.h.
Add super.h.
NOVA inode definition.
Add NOVA filesystem definitions and useful helper routines.
Add inode get/read methods.
Initialize inode_info and rebuild inode information in nova_iget().
NOVA superblock operations.
Add Kconfig and Makefile
Add superblock integrity check.
Add timing and I/O statistics for performance analysis and profiling.
Add timing for mount and init.
Add remount_fs and show_options methods.
Add range node kmem cache.
Add free list data structure.
Initialize block map and free lists in nova_init().
Add statfs support.
Add freelist statistics printing.
Add pmem block free routines.
Pmem block allocation routines.
Add log structure.
Inode log pages allocation and reclaimation.
Save allocator to pmem in put_super.
Initialize and allocate inode table.
Support get normal inode address and inode table extentsion.
Add inode_map to track inuse inodes.
Save the inode inuse list to pmem upon umount
Add NOVA address space operations
Add write_inode and dirty_inode routines.
New NOVA inode allocation.
Add new vfs inode allocation.
Add log entry definitions.
Inode log and entry printing for debug purpose.
Journal: NOVA light weight journal definitions.
Journal: Lite journal helper routines.
Journal: Lite journal recovery.
Journal: Lite journal create and commit.
Journal: NOVA lite journal initialization.
Log operation: dentry append.
Log operation: file write entry append.
Log operation: setattr entry append
Log operation: link change append.
Log operation: in-place update log entry
Log operation: invalidate log entries
Log operation: file inode log lookup and assign
Dir: Add Directory radix tree insert/remove methods.
Dir: Add initial dentries when initializing a directory inode log.
Dir: Readdir operation.
Dir: Append create/remove dentry.
Inode: Add nova_evict_inode.
Rebuild: directory inode.
Rebuild: file inode.
Namei: lookup.
Namei: create and mknod.
Namei: mkdir
Namei: link and unlink.
Namei: rmdir
Namei: rename
Namei: setattr
Add special inode operations.
Super: Add nova_export_ops.
File: getattr and file inode operations
File operation: llseek.
File operation: open, fsync, flush.
File operation: read.
Super: Add file write item cache.
Dax: commit list of file write items to log.
File operation: copy-on-write write.
Super: Add module param inplace_data_updates.
File operation: Inplace write.
Symlink support.
File operation: fallocate.
Dax: Add iomap operations.
File operation: Mmap.
File operation: read/write iter.
Ioctl support.
GC: Fast garbage collection.
GC: Thorough garbage collection.
Normal recovery.
Failure recovery: bitmap operations.
Failure recovery: Inode pages recovery routines.
Failure recovery: Per-CPU recovery.
Sysfs support.

Documentation/filesystems/00-INDEX | 2 +
Documentation/filesystems/nova.txt | 498 +++++++++++++
MAINTAINERS | 8 +
fs/Kconfig | 2 +
fs/Makefile | 1 +
fs/nova/Kconfig | 15 +
fs/nova/Makefile | 8 +
fs/nova/balloc.c | 730 ++++++++++++++++++
fs/nova/balloc.h | 96 +++
fs/nova/bbuild.c | 1437 ++++++++++++++++++++++++++++++++++++
fs/nova/bbuild.h | 28 +
fs/nova/dax.c | 970 ++++++++++++++++++++++++
fs/nova/dir.c | 520 +++++++++++++
fs/nova/file.c | 728 ++++++++++++++++++
fs/nova/gc.c | 459 ++++++++++++
fs/nova/inode.c | 1310 ++++++++++++++++++++++++++++++++
fs/nova/inode.h | 277 +++++++
fs/nova/ioctl.c | 184 +++++
fs/nova/journal.c | 412 +++++++++++
fs/nova/journal.h | 56 ++
fs/nova/log.c | 1111 ++++++++++++++++++++++++++++
fs/nova/log.h | 417 +++++++++++
fs/nova/namei.c | 848 +++++++++++++++++++++
fs/nova/nova.h | 566 ++++++++++++++
fs/nova/nova_def.h | 128 ++++
fs/nova/rebuild.c | 499 +++++++++++++
fs/nova/stats.c | 600 +++++++++++++++
fs/nova/stats.h | 178 +++++
fs/nova/super.c | 1063 ++++++++++++++++++++++++++
fs/nova/super.h | 171 +++++
fs/nova/symlink.c | 133 ++++
fs/nova/sysfs.c | 379 ++++++++++
32 files changed, 13834 insertions(+)
create mode 100644 Documentation/filesystems/nova.txt
create mode 100644 fs/nova/Kconfig
create mode 100644 fs/nova/Makefile
create mode 100644 fs/nova/balloc.c
create mode 100644 fs/nova/balloc.h
create mode 100644 fs/nova/bbuild.c
create mode 100644 fs/nova/bbuild.h
create mode 100644 fs/nova/dax.c
create mode 100644 fs/nova/dir.c
create mode 100644 fs/nova/file.c
create mode 100644 fs/nova/gc.c
create mode 100644 fs/nova/inode.c
create mode 100644 fs/nova/inode.h
create mode 100644 fs/nova/ioctl.c
create mode 100644 fs/nova/journal.c
create mode 100644 fs/nova/journal.h
create mode 100644 fs/nova/log.c
create mode 100644 fs/nova/log.h
create mode 100644 fs/nova/namei.c
create mode 100644 fs/nova/nova.h
create mode 100644 fs/nova/nova_def.h
create mode 100644 fs/nova/rebuild.c
create mode 100644 fs/nova/stats.c
create mode 100644 fs/nova/stats.h
create mode 100644 fs/nova/super.c
create mode 100644 fs/nova/super.h
create mode 100644 fs/nova/symlink.c
create mode 100644 fs/nova/sysfs.c

--
2.7.4