1. 13 Jun, 2019 3 commits
    • lxc_clone: get rid of some indirection · 8a545679
      Tycho Andersen authored
      We have a do_clone(), which just calls a void f(void *) that it gets
      passed. We build up a struct consisting of two args that are just the
      actual arg and actual function. Let's just have the syscall do this for us.
      Signed-off-by: 's avatarTycho Andersen <tycho@tycho.ws>
    • doc: add a little note about shared ns + LSMs · 4f7e281f
      Tycho Andersen authored
      We should add a little not about the race in the previous patch.
      Signed-off-by: 's avatarTycho Andersen <tycho@tycho.ws>
    • lxc_clone: pass non-stack allocated stack to clone · 68a1966d
      Tycho Andersen authored
      There are two problems with this code:
      
      1. The math is wrong. We allocate a char *foo[__LXC_STACK_SIZE]; which
         means it's really sizeof(char *) * __LXC_STACK_SIZE, instead of just
         __LXC_STACK SIZE.
      
      2. We can't actually allocate it on our stack. When we use CLONE_VM (which
         we do in the shared ns case) that means that the new thread is just
         running one page lower on the stack, but anything that allocates a page
         on the stack may clobber data. This is a pretty short race window since
         we just do the shared ns stuff and then do a clone without CLONE_VM.
      
      However, it does point out an interesting possible privilege escalation if
      things aren't configured correctly: do_share_ns() sets up namespaces while
      it shares the address space of the task that spawned it; once it enters the
      pid ns of the thing it's sharing with, the thing it's sharing with can
      ptrace it and write stuff into the host's address space. Since the function
      that does the clone() is lxc_spawn(), it has a struct cgroup_ops* on the
      stack, which itself has function pointers called later in the function, so
      it's possible to allocate shellcode in the address space of the host and
      run it fairly easily.
      
      ASLR doesn't mitigate this since we know exactly the stack offsets; however
      this patch has the kernel allocate a new stack, which will help. Of course,
      the attacker could just check /proc/pid/maps to find the location of the
      stack, but they'd still have to guess where to write stuff in.
      
      The thing that does prevent this is the default configuration of apparmor.
      Since the apparmor profile is set in the second clone, and apparmor
      prevents ptracing things under a different profile, attackers confined by
      apparmor can't do this. However, if users are using a custom configuration
      with shared namespaces, care must be taken to avoid this race.
      
      Shared namespaces aren't widely used now, so perhaps this isn't a problem,
      but with the advent of crio-lxc for k8s, this functionality will be used
      more.
      Signed-off-by: 's avatarTycho Andersen <tycho@tycho.ws>
  2. 21 May, 2019 1 commit
  3. 18 May, 2019 36 commits