Re: [PATCH 1/2] x86/mm: don't mmap() over 4GB with compat syscall

From: Dmitry Safonov
Date: Thu Jan 12 2017 - 09:15:23 EST


On 01/12/2017 01:26 AM, Andy Lutomirski wrote:
On Wed, Jan 11, 2017 at 10:17 AM, Dmitry Safonov <dsafonov@xxxxxxxxxxxxx> wrote:
During fixing CRIU bugs on ZDTM tests for 32-bit C/R, I found that
compatible ia32/x32 syscalls mmap() and mmap2() can return address
over 4Gb in x86_64 applications, which results in returning lower
4 bytes of address while dropping the higher bytes.
It happens because mmap() upper limit doesn't differ native/compat
syscalls for 64-bit task, it's: (TASK_UNMAPPED_BASE + random_factor)
which is: (PAGE_ALIGN(TASK_SIZE / 3)) + random_factor
(in case of legacy mmap it's just TASK_SIZE).
This patch limits higher address that can be mmaped with compat
syscalls in 64-bit applications with IA32_PAGE_OFFSET (+randomization).

Signed-off-by: Dmitry Safonov <dsafonov@xxxxxxxxxxxxx>
---
arch/x86/kernel/sys_x86_64.c | 37 +++++++++++++++++++++++++++++--------
1 file changed, 29 insertions(+), 8 deletions(-)

diff --git a/arch/x86/kernel/sys_x86_64.c b/arch/x86/kernel/sys_x86_64.c
index a55ed63b9f91..0893725db6e6 100644
--- a/arch/x86/kernel/sys_x86_64.c
+++ b/arch/x86/kernel/sys_x86_64.c
@@ -100,7 +100,7 @@ SYSCALL_DEFINE6(mmap, unsigned long, addr, unsigned long, len,
static void find_start_end(unsigned long flags, unsigned long *begin,
unsigned long *end)
{
- if (!test_thread_flag(TIF_ADDR32) && (flags & MAP_32BIT)) {
+ if (!test_thread_flag(TIF_ADDR32)) {
/* This is usually used needed to map code in small
model, so it needs to be in the first 31bit. Limit
it to that. This means we need to move the
@@ -109,14 +109,24 @@ static void find_start_end(unsigned long flags, unsigned long *begin,
malloc knows how to fall back to mmap. Give it 1GB
of playground for now. -AK */
*begin = 0x40000000;
- *end = 0x80000000;
- if (current->flags & PF_RANDOMIZE) {
- *begin = randomize_page(*begin, 0x02000000);
+
+ if (flags & MAP_32BIT) {
+ if (current->flags & PF_RANDOMIZE)
+ *begin = randomize_page(*begin, 0x02000000);
+ *end = 0x80000000;
+ return;
+ }
+ if (current->thread.status & TS_COMPAT) {
+ if (current->flags & PF_RANDOMIZE)
+ *begin = randomize_page(*begin,
+ 1UL << mmap_rnd_compat_bits);
+ *end = IA32_PAGE_OFFSET;
+ return;
}
- } else {
- *begin = current->mm->mmap_legacy_base;
- *end = TASK_SIZE;
}
+
+ *begin = current->mm->mmap_legacy_base;
+ *end = TASK_SIZE;
}

unsigned long
@@ -187,10 +197,21 @@ arch_get_unmapped_area_topdown(struct file *filp, const unsigned long addr0,
return addr;
}

+ if (current->thread.status & TS_COMPAT) {

in_compat_syscall(), please.

Also, we need to verify that, if this is called execve(), it does the
right thing.

+ if (current->flags & PF_RANDOMIZE) {
+ unsigned long rnd = 1UL << mmap_rnd_compat_bits;
+
+ info.high_limit =
+ randomize_page(IA32_PAGE_OFFSET - rnd, rnd);
+ } else {
+ info.high_limit = IA32_PAGE_OFFSET;
+ }
+ } else {
+ info.high_limit = mm->mmap_base;
+ }

This code was incomprehensible before and it's worse now. Could you
try to clean it up a bit? For example, a patch that simply folds
find_start_end() into its sole caller as the first patch in the series
without changing any semantics would probably help.

So, I debugged it a little more.
Here it is: we really should mmap here not just up to IA32_PAGE_OFFSET,
but also think about stack of compat applications. Need to have a gap
in the end of address space the same way it's counted in mmap_base().
And here is another bug (but less painful): for 32-bit applications,
that set 64-bit CS and do native x86_64 syscalls, it's not possible to
map over 4Gb (or around), 32-bit mmap_base.

I think there are two ways to fix these issues:
1. Introduce mmap_compat_base (and mmap_compat_legacy_base, I guess).
Initialize only one of mmap_{,compat}_base in arch_pick_mmap_layout()
and initialize the second only if 64-bit app does 32-bit mmap and
vice-versa. Use these two mmap bases accordingly.
2. Save random_factor in arch_pick_mmap_layout() and make mmap_base()
calculations during mmap() using compat or native task_size according
to in_compat_syscall() - the perf overhead in mmap_base() doesn't look
that great (but I may be wrong). Need also be careful to RLIMIT_STACK
and fall back to bottom-up allocations in case of infinity.

So, does (1) or (2) make sense, or maybe there is some simpler solution
to these bugs?

--
Dmitry