[RFC PATCH 0/2] ivring: Add IVRing driver

From: Yoshihiro YUNOMAE
Date: Tue Jun 05 2012 - 07:00:23 EST


Hi All,

The following patch set provides a new communication path "IVRing" for
collecting kernel log or tracing data of guests by a host without using network
in a virtualization environment. Network is generally used to collect log or
tracing data after outputting the data as a file. However, since I/O resources
such as network or block are shared with other guests, these resources should
not be used for logging or tracing. Moreover, high load will be taken to
applications on guests using network I/O because there are many network stack
layers. Then, a communication method for collecting the data without using
I/O resources is needed.

There are two requirements to collect kernel log or tracing data by a host:
(1) To minimize for user applications in a guest
- not using I/O resources
(2) To be implemented recording buffer like ring
- keep on recording log data or trace data
To meet these requirements, a ring-buffer as a device driver for guest OSs,
called IVRing, is constructed on Inter-VM shared memory (IVShmem) device.
IVShmem implemented in QEMU is a virtual PCI RAM device and uses POSIX shared
memory on a host. This device is originally used as a virtual device for
low-overhead communication between two guests. On the other hand, here, IVShmem
is used as a communication path between a guest and a host for collecting data.
IVRing is a buffer of logging or tracing data in a guest, and IVRing-reader,
opening shared memory as IVRing on a host, reads the data without memory copying
between a guest and a host. Thus, two requirements are met for collecting kernel
log or tracing data.

We will talk about IVRing in LinuxCon Japan 2012:
https://events.linuxfoundation.org/events/linuxcon-japan
Title: Low-Overhead Ring-Buffer of Kernel Tracing &
Tracing Across Host OS and Guest OS
Speakers: Yoshihiro Yunomae and Akihiro Nagai
You can download our slides about IVRing in the schedule page.

***Evaluation***
When a host collects tracing data of a guest, the performance of using IVRing
is compared with that of using network.

<environment>
The overview of this evaluation is as follows:
(a) A guest on a KVM is prepared.
- The guest is dedicated one physical CPU as a virtual CPU(VCPU).

(b) The guest starts to write tracing data to a SystemTap buffer.
- The probe points of SystemTap are all trace points of sched, timer,
and kmem.

(c) The tracing data are recorded to IVRing sharing memory with a host or
the tracing data are sent to a host via network.
- 3 patterns, IVRing, NFS, and SSH, are measured.
Each methods is explained about later.

(d) Writing trace data, dhrystone 2 in UNIX bench is executed as a benchmark
tool in the guest.
- Dhrystone 2 intends system performance by repeating integer arithmetic
as a score.
- Since higher score equals to better system performance, if the score
decrease based on bare environment, it indicates that any operation
disturbs the integer arithmetic. Then, we define the overhead of
transporting trace data is calculated as follows:
OVERHEAD = (1 - SCORE_OF_A_METHOD/BARE_SCORE) * 100.

The performance of each method is compared as follows:
[1] IVRing
- A SystemTap script in a guest records trace data to IVRing.
- A IVRing-reader on a host reads the data.
[2] NFS
- A directory in a guest is shared with that in a host via NFS.
- A SystemTap script in a guest records trace data to a file
in the directory.
[3] SSH
- A SystemTap script in a guest output trace data to a host using
standard output via SSH.

Other information is as follows:
- host
kernel: 3.3.1-5 (Fedora16)
CPU: Intel Xeon x5660@xxxxxxx(6core)
Memory: 50GB

- guest(only booting one guest)
kernel: 3.4.0+ (Fedora16)
CPU: 1VCPU(dedicated)
Memory: 2GB

<result>
3 patterns based on the bare environment were indicated as follows:
Scores overhead against [0] Bare
[0] Bare 29043600 -
[1] IVRing 28565398 1.6[%]
[2] NFS 22000508 24.3[%]
[3] SSH 10246792 64.7[%]
The overhead of IVRing is much lower than other methods using network. This is
because the IVRing method only records trace data to a ring-buffer. On the
other hand, other methods read trace data from a SystemTap buffer to the
userland and send the data to a host via network. Therefore, a method of using
IVRing minimizes the overhead of transporting trace data from a guest to a host.

***How to use***
Here, how to use IVRing and IVRing-reader is simply given.

1. Prepare any distribution including qemu-kvm binary after 0.13.0 version.
IVShmem was pushed on qemu-kvm mainline after 0.13.0 version.
Latest Fedora or Ubuntsu are available.

2. Boot a guest installed IVRing driver with device option.
A device option is needed as follows:
-device ivshmem,size=<shm_size in MB>,shm=<shm_obj>
shm_obj, shared memory object path, is used later to share the memory region
with the reader on a host. For example, a device option is like below:
-device ivshmem,size=2,shm=/ivshmem
IVShmem supports interrupts mode using ivshmem_server and this IVRing driver is
implemented as usable for doorbelling to the reader as a experimental feature.
This feature will be used near the future.

3. Run IVRing-reader on a host.
To share the memory region with IVShmem, s option for indicating shm_obj which
is same as the second step is needed like below:
./ivring_reader -m 2 -f /tmp/log.txt -S 10 -N 2 -s /ivshmem
Each options are indicated 2nd patch in detail.
Then, IVRing-reader starts to read data from IVRing, but the ring-buffer is
empty yet.
shared object size: 2097152 (bytes)
Ring header is already initialized
reader -1, writer 0, pos 20074a9f
ivring_init_hdr: 0x7f128417d000
Receive an interrupt 2
Try to read buffer.
Receive an interrupt 2
no data
__ivring_read ret=0
Try to read buffer.
no data
__ivring_read ret=0
Try to read buffer.
...

4. Start to record logging or tracing data on a guest.
API for kernel programing is available for IVRing driver:
ivring_write(int ID, void *buf, size_t size).

It is used for kernel logging as follows:

int len;
char buf[1024];
len = sprintf(buf, "hogehoge\n",... )
ivring_write(0, buf, len);

When SystemTap is used as a tracer, a sample script is as follows:

%{
extern int ivring_write(int id, void *buf, size_t size);
%}

function ivring_print(str:string) %{
ivring_write(0, THIS->str, strlen(THIS->str));
%}

probe kernel.trace("sched*") {
ivring_print(sprintf("%u: %s(%s)\n", gettimeofday(), pn(), $$parms))
}

The script is executed as
stap -vg ivring_writer_sample.stp.

When it is success to record data to IVRing, reader outputs as follows:
Try to read buffer.
__ivring_read ret=4096
__ivring_read ret=4096
__ivring_read ret=313
Try to read buffer.
__ivring_read ret=4096
__ivring_read ret=4096
__ivring_read ret=632
Try to read buffer.

***Future Work***
Features below will be implemented as future work:
1. To implement a feature of notification from a guest to a host
2. To implement user I/F on a guest
3. To be usable in tracing system existing in-kernel
4. To be usable in SMP environment
(lockless ring-buffer like ftrace, one ring-buffer one CPU)
5. To design for Live Migration

Thank you,

---

Yoshihiro YUNOMAE (2):
ivring: Add a ring-buffer reader tool
ivring: Add a ring-buffer driver on IVShmem


drivers/Kconfig | 1
drivers/Makefile | 1
drivers/ivshmem/Kconfig | 9 +
drivers/ivshmem/Makefile | 5
drivers/ivshmem/ivring.c | 551 +++++++++++++++++++++++++++++++++++++++++
drivers/ivshmem/ivring.h | 77 ++++++
tools/Makefile | 1
tools/ivshmem/Makefile | 19 +
tools/ivshmem/ivring_reader.c | 516 ++++++++++++++++++++++++++++++++++++++
tools/ivshmem/ivring_reader.h | 15 +
tools/ivshmem/pr_msg.c | 125 +++++++++
tools/ivshmem/pr_msg.h | 19 +
12 files changed, 1339 insertions(+), 0 deletions(-)
create mode 100644 drivers/ivshmem/Kconfig
create mode 100644 drivers/ivshmem/Makefile
create mode 100644 drivers/ivshmem/ivring.c
create mode 100644 drivers/ivshmem/ivring.h
create mode 100644 tools/ivshmem/Makefile
create mode 100644 tools/ivshmem/ivring_reader.c
create mode 100644 tools/ivshmem/ivring_reader.h
create mode 100644 tools/ivshmem/pr_msg.c
create mode 100644 tools/ivshmem/pr_msg.h

--
Yoshihiro YUNOMAE
Software Platform Research Dept. Linux Technology Center
Hitachi, Ltd., Yokohama Research Laboratory
E-mail: yoshihiro.yunomae.ez@xxxxxxxxxxx
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/