Re: [Xen-devel] [RFC KERNEL PATCH 0/2] Add Dom0 NVDIMM support for Xen

From: Haozhong Zhang
Date: Wed Oct 12 2016 - 11:00:49 EST


On 10/12/16 05:32 -0600, Jan Beulich wrote:
On 12.10.16 at 12:33, <haozhong.zhang@xxxxxxxxx> wrote:
The layout is shown as the following diagram.

+---------------+-----------+-------+----------+--------------+
| whatever used | Partition | Super | Reserved | /dev/pmem0p1 |
| by kernel | Table | Block | for Xen | |
+---------------+-----------+-------+----------+--------------+
\_____________________ _______________________/
V
/dev/pmem0

I have to admit that I dislike this, for not being OS-agnostic.
Neither should there be any Xen-specific region, nor should the
"whatever used by kernel" one be restricted to just Linux. What
I could see is an OS-reserved area ahead of the partition table,
the exact usage of which depends on which OS is currently
running (and in the Xen case this might be both Xen _and_ the
Dom0 kernel, arbitrated by a tbd protocol). After all, when
running under Xen, the Dom0 may not have a need for as much
control data as it has when running on bare hardware, for it
controlling less (if any) of the actual memory ranges when Xen
is present.


Isn't this OS-reserved area still not OS-agnostic, as it requires OS
to know where the reserved area is? Or do you mean it's not if it's
defined by a protocol that is accepted by all OSes?

Let me list another two methods just coming to my mind.

1. The first method extends the usage of the super block used by
current Linux kernel to reserve space on pmem.

Current Linux kernel places a super block of the following
structure near the beginning of a pmem namespace.

struct nd_pfn_sb {
u8 signature[PFN_SIG_LEN];
u8 uuid[16];
u8 parent_uuid[16];
__le32 flags;
__le16 version_major;
__le16 version_minor;
__le64 dataoff; /* relative to namespace_base + start_pad */
__le64 npfns;
__le32 mode;
/* minor-version-1 additions for section alignment */
__le32 start_pad;
__le32 end_trunc;
/* minor-version-2 record the base alignment of the mapping */
__le32 align;
u8 padding[4000];
__le64 checksum;
}

Two interesting fields here are 'dataoff' and 'mode':
- 'dataoff' indicates the offset where the data area starts,
ie. IIUC, the part that can be accessed via /dev/pmemN or
/dev/daxN.
- 'mode' indicates whether Linux puts struct page for this
namespace in the ram (= PFN_MODE_RAM) or on the device (=
PFN_MODE_PMEM).

Currently for Linux, only 'mode' is customizable, while 'dataoff'
is not. If mode == PFN_MODE_RAM, no reservation for struct page is
made on the device, and dataoff starts almost immediately after
the super block except a small reserved area in between for other
structures and alignment. If mode == PFN_MODE_PMEM, the size of
the reservation is decided by kernel, i.e. 64 bytes per struct
page.

I propose to make the size of the reserved area customizable,
e.g. via ioctl and ndctl.
- If mode == PFN_MODE_PMEM and
* if the given reserved size is large enough to hold what an OS
(not limited to Linux) wants to put in, then the OS just
starts use it as desired;
* if the given reserved size is not enough, then the OS reports
error and may take other fallback actions.
- If mode == PFN_MODE_RAM and
* if the reserved size is zero, then it's the current way that
Linux uses the device;
* if the reserved size is non-zero, I would like to reserve this
case for hypervisor (right now, namely Xen hypervisor)
usage. That is, the OS should not use the reserved area. For
Xen, we could add a function in xen driver in kernel to report
the reserved area to hypervisor.

I guess this might be the OS-agnostic way Jan expects, but Dan may
object to.


2. Lay another pseudo device on the block device (e.g. /dev/pmemN)
provided by the NVDIMM driver.

This pseudo device can reserve the size according to user's
requirement. The reservation information can be persistently
recorded in a super block before the reserved area.

This pseudo device also implements another pseudo block device to
allow the non-reserved area be accessed as a block device (we can
even implement it as DAX-capable).

pseudo block device
/---------^-----------\
+------------------+-------+---------------+-----------------------+
| whatever used | Super | reserved by | |
| by NVDIMM driver | Block | pseudo device | |
+------------------+-------+---------------+-----------------------+
\_____________________ _______________________/
V
/dev/pmem0
(provided by NVDIMM driver)

In order to make it work across difference OSes, it requires other
OS recognizes the same types of pmem block devices made by Linux,
and implements the driver for the pseudo device.

This is inspired by Dan's reply at
https://lists.xenproject.org/archives/html/xen-devel/2016-10/msg00651.html.

However, it's essentially the same as my partition solution, so I guess
Jan will still dislike.


Any comments?

The assumption of course is that the reserved area holds no
persistent data. If that assumption didn't hold, you'd have to
have per-OS reserved areas anyway (as many of them as
there might be OSes [planned to get] installed on a particular
system).


No persistent data should be placed in the reserved area.

Thanks,
Haozhong