Re: [trace-cmd Patch RFC] trace-cmd: top: A new interface to detect peak memory

From: Pratyush Anand
Date: Thu Apr 27 2017 - 10:03:00 EST


Hi Steve,

On Wednesday 26 April 2017 07:31 PM, Steven Rostedt wrote:

Sorry for the late reply. I finally have time to start looking at
trace-cmd again.

Thanks a lot for your review :-)


On Wed, 1 Mar 2017 20:32:37 +0530
Pratyush Anand <panand@xxxxxxxxxx> wrote:


[...]

This is as nice idea, but it needs some clean ups. I tried it out just
to play with it.

First, I just did:

trace-cmd top

And nothing happened. Either a help message needs to be printed, or it
starts something. Maybe have it be interactive, that is, print out data
as it comes in?


[...]


But then for some reason it -p stopped doing anything. And -e didn't
stop it either. But after killing it manually and removing the pid
file, it appeared to work normal again. I can't seem to repeat it, so
lets not worry about that now.


Yes, this version was sharing info between two processes using files. I have a new version [1] which has sharing through socket and that works better. Help issue is also fixed in that version.

More below.


[...]

+void top_disable_events(void)
+{
+ char *path;
+ char c;
+ int fd;
+ int ret;
+
+ /*
+ * WARNING: We want only few events related to memory allocation to
+ * be enabled. Disable all events here. Latter, selective events
+ * would be enabled

trace-cmd reset does this. You may want to call that instead of doing
it manually here. Someone else wanted to pull out trace-cmd reset into
its own file. I'll have to get those patches. But then we can call that
and enable tracing again.

Or better yet, if we are only tracing memory events, just create your
own instance and use that. Why bother with normal tracing?

mkdir /sys/kernel/debug/tracing/instances/trace-cmd-memory

Yes, it seems a good idea. Will do.


+ */
+ c = '0';
+ path = get_instance_file(&top_instance, "events/enable");
+ fd = open(path, O_WRONLY);
+ if (fd < 0)
+ die("failed to open '%s'", path);
+ ret = write(fd, &c, 1);
+ close(fd);
+ tracecmd_put_tracing_file(path);
+ if (ret < 1)
+ die("failed to write 0 to events/enable");
+
+ path = get_instance_file(&top_instance, "tracing_on");
+ fd = open(path, O_WRONLY);
+ if (fd < 0)
+ die("failed to open '%s'", path);
+ ret = write(fd, &c, 1);
+ close(fd);
+ tracecmd_put_tracing_file(path);
+ if (ret < 1)
+ die("failed to write 0 to tracing_on\n");
+
+ path = get_instance_file(&top_instance, "set_event");
+ fd = open(path, O_WRONLY);
+ if (fd < 0)
+ die("failed to open '%s'", path);
+ close(fd);
+ tracecmd_put_tracing_file(path);
+}
+
+void top_exit(void)
+{
+ int i;
+ /*
+ * free all the dynamic memories which could have been allocated by
+ * this program
+ */
+ if (kbuf)
+ kbuffer_free(kbuf);
+ if (record)
+ free_record(record);
+ if (pevent)
+ pevent_free(pevent);

The above three functions all can handle a NULL pointer. Remove the if
statements.

ok


+ if (top_task_stats) {
+ for (i = 0; i < top_task_cnt; i++)
+ free(top_task_stats[i].comm);
+ free(top_task_stats);
+ }
+
+ top_disable_events();
+ remove(TRACE_CMD_TOP_PID_FILE);
+}
+
+void top_initialize_events(void)
+{
+ char *path;
+ int fd;
+ char c;
+ int ret;
+ char *buf;
+ int size;
+ enum kbuffer_long_size long_size;
+ enum kbuffer_endian endian;
+ char id_str[8];
+

Might want to look at trace_stream_init() and use that instead. It is
made for live streaming of events.

OK, will look into trace_stream_init().


+ /* trace data is read from the trace_raw_pipe directly */
+ if (tracecmd_host_bigendian())
+ endian = KBUFFER_ENDIAN_BIG;
+ else
+ endian = KBUFFER_ENDIAN_LITTLE;
+

[...]

+void top_start(void)
+{
+ pid_t pid;
+ int fd, ret;
+
+ top_disable_events();
+ top_initialize_events();
+ page_size = getpagesize();
+
+ pid = fork();
+ if (!pid) {
+ do {
+ top_process_events();
+ sleep(1);

We should probably just sleep on the buffer, waiting for data.

OK


+ if (set_print_only) {
+ top_print_stats();
+ set_print_only = false;
+ }
+ if (set_print_and_exit) {
+ top_print_stats();
+ top_exit();
+ break;
+ }
+ } while (1);
+ } else {
+ fd = open(TRACE_CMD_TOP_PID_FILE, O_WRONLY | O_CREAT);
+ if (fd < 0)
+ goto kill_child;
+ ret = write(fd, &pid, sizeof(pid));
+ if (ret < sizeof(pid))
+ goto close_fd;
+ close(fd);
+ }
+
+ return;
+
+close_fd:
+ close(fd);
+kill_child:
+ kill(pid, SIGUSR2);
+ die("Failed to start trace-top");
+ return;
+}
+

[...]

Looks good, I think we can extend on this.

Thanks :-)

I will implement your review comments and will send next revision.
However, I had couple of observation which I was unable to justify:

# ./trace-cmd top -s /tmp/test
# ./trace-cmd top -p /tmp/test | grep trace-cmd
15292 trace-cmd 22144 15808
Here,15292 is the pid of trace-cmd task
22144 KB is the peak memory usage
15808 KB is the current memory usage

Now check rss component from statm
# cat /proc/15292/statm
50 35 23 7 0 12 0 36

This is a result on ARM64/64KB page size. Therefore, as per statm rss is 35 pages = 35*64 = 2240KB
I patched my kernel [2] for test purpose, so that statm reports peak memory as well. Here, the last extra entry in statm output is peak memory and it is 36 pages = 2304KB.
So, this is a huge difference between what has been reported by statm and what we get from trace-cmd.
I understand that `trace-cmd top` output would only be approximate, because many of the memory could be allocated by task and freed in interrupt context. So, many a time it can differ. But, is such a huge difference justified? If yes, can we count on the output of this utility to find early boot time oom issues?


[1] https://github.com/pratyushanand/trace-cmd/commit/602c2cd96aa613633ad20c6d382e41f7db37a2a4
[2] https://github.com/pratyushanand/linux/commit/197e2045361b6b70085c46c78ea6607d8afce517