[RFC] persistent store

From: Luck, Tony
Date: Sat Nov 20 2010 - 18:48:28 EST


Here's a patch based on some discussions I had with Thomas
Gleixner at plumbers conference that implements a generic
layer for persistent storage usable to pass tens or hundreds
of kilobytes of data from the dying breath of a crashing
kernel to its successor.

The usage model I'm envisioning is that a platform driver
will register with this code to provide the actual storage.
I've tried to make this interface general, but I'm working
from a sample of one (the ACPI ERST code), so if anyone else
has some persistent store that can't be handled by this code,
speak up and we can put in the necessary tweaks.

My assumptions are that the data that Linux cares about will
be wrapped in some error record structure with a header, and
possibly a footer that the device code needs. So the driver
specifies how much padding to put around a buffer to make
life easy for it. It also specifies the maximum number of
bytes that can be saved in one record.

There are three callback functions from the generic code to
the driver:

"reader" which iterates over all records currently in the
store - returning type, size and a record identifier as
well as the actual data.

"writer" which writes a record with a type to the persistent store

"eraser" which takes a record identifier, and clears that item
from the store.


The Linux user visible interface is via /sys (similar to
the "efivars" interface)

# ls -l /sys/firmware/pstore
total 0
-r--r--r-- 1 root root 0 2010-11-20 11:03 dmesg-0
--w------- 1 root root 0 2010-11-20 11:03 erase

The "type" of error record I mentioned earlier is used to
name the files ... saved console logs from kmsg_dmp() are
named with a "dmesg" prefix as shown above.

Once an error record has been viewed, analysed, saved. The
user can request it to be cleared by writing its name to the
"erase" file:

# echo "dmesg-0" > erase

Answers to a few questions that I think you might ask:

1) "Why do you only allow one platform driver to register?"
I only have one such driver. Adding more is easy from the "read" side
(just collect all the records from all devices and remember where they
came from so you can call the correct "eraser" function). But the "write"
side opens up questions that I don't have good answers for:
- Which device(s) should error records be written to?
All of them? Start with one and move on when it is
full? Write some types of records to one device?
If someone has a machine with multiple persistent storage devices -
then we can talk about how to answer these questions.

2) "Why do you read in all the data from the device when it
registers and save it in memory? Couldn't you just get the
list of records and pick up the data from the device when
the user reads the file?"
I don't think this is going to be very much data, just a few hundred
kilobytes (i.e. less that $0.01 worth of memory, even expensive server
memory). The memory is freed when the record is erased ... which is
likely to be soon after boot.

3) "/sys/firmware/pstore is the wrong pathname for this".
You are probably right. I put it under "firmware" because that's where
the "efivars" driver put its top level directory. In my case the ERST
back end is firmware, so there is some vague logic to it ... but better
suggestions are welcome. Perhaps /sys/devices/platform/pstore?

4) "/sys is the wrong place for this."
Perhaps. I definitely want to use some sort of filesystem interface (so
each record shows up as a file to the user). This seems a lot cleaner
than trying to map the semantics of actual persistent storage devices
onto a character device. The "sysfs_create_bin_file()" API seems very
well designed for this usage. If not /sys, then where? "debugfs"
would work - but not everyone mounts debugfs. Creating a whole new
filesystem for this seems like overkill.

5) "Why is the record identifier type 'u64'?"
This is one place where I knowingly let the ERST implementation bleed
all the way up to the top - it uses 64-bit record numbers. It would be
possible to map these to something smaller like "int" ... but the code
to do so would be far larger than the memory saved. The most common
usage case is likely to be a software crash with just one "dmesg" record.

6) "Is this widely useful? How many systems have persistent storage?"
Although ERST was only added to the ACPI spec earlier this year, it
merely documents existing functionality required for WHEA (Windows
Hardware Error Architecture). So most modern server systems should
have it (my test system has it, and it has a BIOS written in mid 2008).
Sorry desktops & laptops - no love for you here.

No-sign-off-yet-this-is-just-RFC

-Tony

---
diff --git a/drivers/firmware/Kconfig b/drivers/firmware/Kconfig
index e8b6a13..06afe40 100644
--- a/drivers/firmware/Kconfig
+++ b/drivers/firmware/Kconfig
@@ -134,4 +134,16 @@ config ISCSI_IBFT
detect iSCSI boot parameters dynamically during system boot, say Y.
Otherwise, say N.

+config PSTORE
+ tristate "Persistant store support via /sys"
+ default n
+ help
+ This option enables generic access to platform level persistent
+ storage via /sys/firmware/pstore. Only useful if you have a
+ platform level driver that registers with pstore to provide
+ the data, so you probably should just go say "Y" (or "M") to
+ a platform specific persistent store driver (e.g. ACPI_APEI on
+ X86) which will select this for you. If you don't have a platform
+ persistent store driver, say N.
+
endmenu
diff --git a/drivers/firmware/Makefile b/drivers/firmware/Makefile
index 1c3c173..ba19784 100644
--- a/drivers/firmware/Makefile
+++ b/drivers/firmware/Makefile
@@ -11,3 +11,4 @@ obj-$(CONFIG_DMIID) += dmi-id.o
obj-$(CONFIG_ISCSI_IBFT_FIND) += iscsi_ibft_find.o
obj-$(CONFIG_ISCSI_IBFT) += iscsi_ibft.o
obj-$(CONFIG_FIRMWARE_MEMMAP) += memmap.o
+obj-$(CONFIG_PSTORE) += pstore.o
diff --git a/drivers/firmware/pstore.c b/drivers/firmware/pstore.c
new file mode 100644
index 0000000..e11b454
--- /dev/null
+++ b/drivers/firmware/pstore.c
@@ -0,0 +1,313 @@
+/*
+ * Persistent Storage - pstore.c
+ *
+ * Copyright (C) 2010 Intel Corporation <tony.luck@xxxxxxxxx>
+ *
+ * This code is the generic layer to export data records from platform
+ * level persistent storage via sysfs.
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License version 2 as
+ * published by the Free Software Foundation.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
+ * GNU General Public License for more details.
+ *
+ * You should have received a copy of the GNU General Public License
+ * along with this program; if not, write to the Free Software
+ * Foundation, Inc., 59 Temple Place, Suite 330, Boston, MA 02111-1307 USA
+ */
+
+#include <linux/atomic.h>
+#include <linux/types.h>
+#include <linux/errno.h>
+#include <linux/init.h>
+#include <linux/kmsg_dump.h>
+#include <linux/module.h>
+#include <linux/pstore.h>
+#include <linux/string.h>
+#include <linux/sysfs.h>
+#include <linux/slab.h>
+#include <linux/uaccess.h>
+
+MODULE_AUTHOR("Tony Luck <tony.luck@xxxxxxxxx>");
+MODULE_DESCRIPTION("sysfs interface to persistent storage");
+MODULE_LICENSE("GPL");
+
+static DEFINE_SPINLOCK(pstore_lock);
+static LIST_HEAD(pstore_list);
+static struct kset *pstore_kset;
+
+#define PSTORE_NAMELEN 16
+
+struct pstore_entry {
+ struct bin_attribute attr;
+ char name[PSTORE_NAMELEN];
+ u64 id;
+ int type;
+ int size;
+ struct list_head list;
+ char data[];
+};
+
+static int pstore_create_sysfs_entry(struct pstore_entry *new_pstore);
+
+static struct pstore_info *psinfo;
+
+static char *pstore_buf;
+
+/*
+ * callback from kmsg_dump. (s2,l2) has the most recently
+ * written bytes, older bytes are in (s1,l1). Save as much
+ * as we can from the end of the buffer.
+ */
+static void
+pstore_dump(struct kmsg_dumper *dumper, enum kmsg_dump_reason reason,
+ const char *s1, unsigned long l1,
+ const char *s2, unsigned long l2)
+{
+ unsigned long s1_start, s2_start;
+ unsigned long l1_cpy, l2_cpy;
+ char *dst = pstore_buf + psinfo->header_size;
+
+ /* Don't dump oopses to persistent store */
+ if (reason == KMSG_DUMP_OOPS)
+ return;
+
+ l2_cpy = min(l2, psinfo->data_size);
+ l1_cpy = min(l1, psinfo->data_size - l2_cpy);
+
+ s2_start = l2 - l2_cpy;
+ s1_start = l1 - l1_cpy;
+
+ memcpy(dst, s1 + s1_start, l1_cpy);
+ memcpy(dst + l1_cpy, s2 + s2_start, l2_cpy);
+
+ psinfo->writer(PSTORE_DMESG, pstore_buf, l1_cpy + l2_cpy);
+}
+
+static struct kmsg_dumper pstore_dumper = {
+ .dump = pstore_dump,
+};
+
+/*
+ * platform specific persistent storage driver registers with
+ * us here. Read out all the records right away and install
+ * them in /sys. Register with kmsg_dump to save last part
+ * of console log on panic.
+ */
+int
+pstore_register(struct pstore_info *psi)
+{
+ struct pstore_entry *new_pstore;
+ int rc = 0, type;
+ unsigned long size;
+ u64 id;
+ unsigned long ps_maxsize;
+
+ spin_lock(&pstore_lock);
+ if (psinfo) {
+ spin_unlock(&pstore_lock);
+ return -EBUSY;
+ }
+ psinfo = psi;
+ spin_unlock(&pstore_lock);
+
+ ps_maxsize = psi->header_size + psi->data_size + psi->footer_size;
+ pstore_buf = kzalloc(ps_maxsize, GFP_KERNEL);
+ if (!pstore_buf)
+ return -ENOMEM;
+ for (;;) {
+ if (psi->reader(&id, &type, pstore_buf, &size) <= 0)
+ break;
+ new_pstore = kzalloc(sizeof(struct pstore_entry) + size,
+ GFP_KERNEL);
+ if (!new_pstore) {
+ rc = -ENOMEM;
+ break;
+ }
+ new_pstore->id = id;
+ new_pstore->type = type;
+ new_pstore->size = size;
+ memcpy(new_pstore->data, pstore_buf + psi->header_size, size);
+ if (pstore_create_sysfs_entry(new_pstore)) {
+ kfree(new_pstore);
+ rc = -EINVAL;
+ break;
+ }
+ }
+
+ kobject_uevent(&pstore_kset->kobj, KOBJ_ADD);
+
+ kmsg_dump_register(&pstore_dumper);
+
+ return rc;
+}
+EXPORT_SYMBOL_GPL(pstore_register);
+
+int
+pstore_write(int type, char *buf, unsigned long size)
+{
+ if (!psinfo->writer)
+ return -ENODEV;
+ if (size > psinfo->data_size)
+ return -EFBIG;
+
+ memcpy(pstore_buf + psinfo->header_size, buf, size);
+ return psinfo->writer(type, pstore_buf, size);
+}
+EXPORT_SYMBOL_GPL(pstore_write);
+
+#define to_pstore_entry(obj) container_of(obj, struct pstore_entry, attr)
+
+/*
+ * "read" function for files containing persistent store records
+ */
+static ssize_t pstore_show(struct file *filp, struct kobject *kobj,
+ struct bin_attribute *bin_attr, char *buf,
+ loff_t offset, size_t count)
+{
+ struct pstore_entry *ps = to_pstore_entry(bin_attr);
+
+ return memory_read_from_buffer(buf, count, &offset,
+ ps->data, ps->size);
+}
+
+/*
+ * Erase records by writing their filename to the "erase" file. E.g.
+ * # echo "dmesg-0" > erase
+ */
+static ssize_t pstore_erase(struct file *filp, struct kobject *kobj,
+ struct bin_attribute *bin_attr,
+ char *buf, loff_t pos, size_t count)
+{
+ struct pstore_entry *search_pstore, *n;
+ int len1, len2, found = 0;
+
+ len1 = count;
+ if (buf[len1 - 1] == '\n')
+ len1--;
+
+ spin_lock(&pstore_lock);
+
+ /*
+ * Find this record
+ */
+ list_for_each_entry_safe(search_pstore, n, &pstore_list, list) {
+ len2 = strlen(search_pstore->name);
+ if (len1 == len2 && memcmp(buf, search_pstore->name,
+ len1) == 0) {
+ found = 1;
+ break;
+ }
+ }
+ if (!found) {
+ spin_unlock(&pstore_lock);
+ return -EINVAL;
+ }
+
+ if (psinfo->eraser)
+ if (psinfo->eraser(search_pstore->id)) {
+ spin_unlock(&pstore_lock);
+ return -EIO;
+ }
+
+ list_del(&search_pstore->list);
+
+ spin_unlock(&pstore_lock);
+
+ sysfs_remove_bin_file(&pstore_kset->kobj, &search_pstore->attr);
+
+ return count;
+}
+
+static struct bin_attribute attr_erase = {
+ .attr = {.name = "erase", .mode = 0200},
+ .write = pstore_erase,
+};
+
+static int
+pstore_create_sysfs_entry(struct pstore_entry *new_pstore)
+{
+ static atomic_t next;
+ int error, seq;
+
+ seq = atomic_add_return(1, &next);
+
+ switch (new_pstore->type) {
+ case PSTORE_DMESG:
+ sprintf(new_pstore->name, "dmesg-%d", seq);
+ break;
+ case PSTORE_MCE:
+ sprintf(new_pstore->name, "mce-%d", seq);
+ break;
+ default:
+ sprintf(new_pstore->name, "type%d-%d", new_pstore->type, seq);
+ break;
+ }
+
+ sysfs_attr_init(&new_pstore->attr.attr);
+ new_pstore->attr.size = 0;
+ new_pstore->attr.read = pstore_show;
+ new_pstore->attr.attr.name = new_pstore->name;
+ new_pstore->attr.attr.mode = 0444;
+ error = sysfs_create_bin_file(&pstore_kset->kobj, &new_pstore->attr);
+ if (!error) {
+ spin_lock(&pstore_lock);
+ list_add(&new_pstore->list, &pstore_list);
+ spin_unlock(&pstore_lock);
+ }
+ return error;
+}
+
+static int __init
+pstore_init(void)
+{
+ int error = 0;
+
+ /* Register the pstore directory at /sys/firmware/pstore */
+ pstore_kset = kset_create_and_add("pstore", NULL, firmware_kobj);
+ if (!pstore_kset) {
+ printk(KERN_ERR "pstore: Subsystem registration failed.\n");
+ return -ENOMEM;
+ }
+
+ /*
+ * Add attribute to allow records to be erased from persistent store
+ */
+ error = sysfs_create_bin_file(&pstore_kset->kobj,
+ &attr_erase);
+ if (error) {
+ printk(KERN_ERR "pstore: unable to create 'erase' sysfs file"
+ " due to error %d\n", error);
+ kset_unregister(pstore_kset);
+ }
+
+ return error;
+}
+
+static void __exit
+pstore_exit(void)
+{
+ struct pstore_entry *entry, *n;
+
+ if (psinfo)
+ kmsg_dump_unregister(&pstore_dumper);
+
+ list_for_each_entry_safe(entry, n, &pstore_list, list) {
+ spin_lock(&pstore_lock);
+ list_del(&entry->list);
+ spin_unlock(&pstore_lock);
+ sysfs_remove_bin_file(&pstore_kset->kobj, &entry->attr);
+ }
+ sysfs_remove_bin_file(&pstore_kset->kobj, &attr_erase);
+
+ kset_unregister(pstore_kset);
+
+ kfree(pstore_buf);
+}
+
+module_init(pstore_init);
+module_exit(pstore_exit);
diff --git a/include/linux/pstore.h b/include/linux/pstore.h
new file mode 100644
index 0000000..785ad86
--- /dev/null
+++ b/include/linux/pstore.h
@@ -0,0 +1,54 @@
+/*
+ * Persistent Storage - pstore.h
+ *
+ * Copyright (C) 2010 Intel Corporation <tony.luck@xxxxxxxxx>
+ *
+ * This code is the generic layer to export data records from platform
+ * level persistent storage via sysfs.
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License version 2 as
+ * published by the Free Software Foundation.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
+ * GNU General Public License for more details.
+ *
+ * You should have received a copy of the GNU General Public License
+ * along with this program; if not, write to the Free Software
+ * Foundation, Inc., 59 Temple Place, Suite 330, Boston, MA 02111-1307 USA
+ */
+#ifndef _LINUX_PSTORE_H
+#define _LINUX_PSTORE_H
+
+/* types */
+#define PSTORE_DMESG 0
+#define PSTORE_MCE 1
+
+struct pstore_info {
+ unsigned long header_size;
+ unsigned long data_size;
+ unsigned long footer_size;
+ int (*reader)(u64 *id, int *type, char *buf, unsigned long *size);
+ int (*writer)(int type, char *buf, unsigned long size);
+ int (*eraser)(u64 id);
+};
+
+#if defined(CONFIG_PSTORE) || defined(CONFIG_PSTORE_MODULE)
+extern int pstore_register(struct pstore_info *);
+extern int pstore_write(int type, char *buf, unsigned long size);
+#else
+static inline int
+pstore_register(struct pstore_info *psi)
+{
+ return -ENODEV;
+}
+static inline int
+pstore_write(int type, char *buf, unsigned long size)
+{
+ return -ENODEV;
+}
+#endif
+
+#endif /*_LINUX_PSTORE_H*/
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/