Re: x86, ptrace: support for branch trace store(BTS)

From: Ingo Molnar
Date: Tue Dec 11 2007 - 09:53:39 EST



* Metzger, Markus T <markus.t.metzger@xxxxxxxxx> wrote:

> That would be a variation on Andi's zero-copy proposal, wouldn't it?
>
> The user supplies the BTS buffer and the kernel manages DS.
>
> Andi further suggested a vDSO to interpret the data and translate the
> hardware format into a higher level user format.
>
> I take it that you would leave that inside ptrace.

yeah - i think both zero-copy and vdso are probably overkill for this.

On the highest level, there are two main usecases of BTS that i can
think of: debugging [a user-space task crashes and developer would like
to see the last few branches taken - possibly extended to kernel space
crashes as well], and instrumentation.

In the first use-case (debugging) zero-copy is just an unnecessary
complication.

In the second use-case (tracing, profiling, call coverage metrics), we
could live without zero-copy, as long as the buffer could be made "large
enough". The current 4000 records limit seems rather low (and arbitrary)
and probably makes the mechanism unsuitable for say call coverage
profiling purposes. There's also no real mechanism that i can see to
create a guaranteed flow of this information between the debugger and
debuggee (unless i missed something), the code appears to overflow the
array, and destroy earlier entries, right? That's "by design" for
debugging, but quite a limitation for instrumentation which might want
to have a reliable stream of the data (and would like the originating
task to block until the debugger had an opportunity to siphoon out the
data).

> I need to look more into mlock. So far, I found a system call in
> /usr/include/sys/mman.h and two functions sys_mlock() and
> user_shm_lock() in the kernel. Is there a memory expert around who
> could point me to some interesting places to look at?

sys_mlock() is what i meant - you could just call it internally from
ptrace and fail the call if sys_mlock() returns -EPERM. This keeps all
the "there's too much memory pinned down" details out of the ptrace
code.

> Can we distinguish kernel-locked memory from user-locked memory? I
> could imagine a malicious user to munlock() the buffer he provided to
> ptrace.

yeah. Once mlock()-ed, you need to "pin it" via get_user_pages(). That
gives a permanent reference count to those pages.

> Is there a real difference between mlock()ing user memory and
> allocating kernel memory? There would be if we could page out
> mlock()ed memory when the user thread is not running. We would need to
> disable DS before paging out, and page in before enabling it. If we
> cannot, then kernel allocated memory would require less space in
> physical memory.

mlock() would in essence just give you an easy "does this user have
enough privilege to lock this many pages" API. The real pinning would be
done by get_user_pages(). Once you have those pages, they wont be
swapped out.

Ingo
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/