1. 31 Aug, 2017 6 commits
    • network: rework network creation · 74c6e2b0
      Christian Brauner authored
      - On unprivileged veth network creation have lxc-user-nic send the names of the
        veth devices and their respective ifindeces. The advantage of retrieving this
        information from lxc-user-nic is that we spare us sending around more stuff
        via the netpipe in start.c. Also, lxc-user-nic operates in both namespaces
        (the container's namespace and the hosts's namespace) via setns and so is
        guaranteed to retrieve the correct ifindex via if_nametoindex() which is an
        network namespace aware ioctl() call. While I'm pretty sure the ifindeces for
        veth devices are identical across network namespaces I'm weary to rely on
        this. We need the ifindexes to guarantee safe deletion of unprivileged
        network devices via lxc-user-nic later on since we use them to identify the
        network devices in their corresponding network namespaces.
      - Move the network device logging from the child to the parent. The child does
        not have all of the information about the network devices available only the
        few bits it actually needs to now. The monitor process is the only process
        that needs all this information.
      - The network creation code for privileged and unprivileged networks was
        previously mangled into one single function but at the same time some of the
        privileged code had additional functions that were called in other places in
        start.c. Let's divide and conquer and split out the privileged and
        unprivileged network creation into completely separate functions. This makes
        what's happening way more clear. This will also have no performance impact
        since either you are privileged and only execute the privileged network
        creation functions or you are unprivileged and only execute the unprivileged
        network creation functions.
      Signed-off-by: 's avatarChristian Brauner <christian.brauner@ubuntu.com>
    • network: document all fields in struct lxc_netdev · 085bb443
      Christian Brauner authored
      This is menial work but I'll thank myself later... a lot.
      Signed-off-by: 's avatarChristian Brauner <christian.brauner@ubuntu.com>
    • network: add ifindex field for host veth device · 4239e9c3
      Christian Brauner authored
      We should not just record the ifindex for the container's veth device but also
      for the host's veth device. This is useful when {configuring,deconfiguring}
      veth devices and becomes crucial when calling our lxc-user-nic setuid helper
      where we rely on the ifindex to make decisions about whether we are licensed to
      perform certain operations on the veth device in question.
      Signed-off-by: 's avatarChristian Brauner <christian.brauner@ubuntu.com>
    • network: log veth_attr.pair and veth_attr.veth1 · 8ce727fc
      Christian Brauner authored
      If the user specified lxc.net.[i].veth.pair attribute to request that the host
      side of a veth pair be given a specific name let's log it at the trace level.
      Otherwise, if the user didn't not specify lxc.net.[i].veth.pair veth_attr.veth1
      will contain the name of the host side veth device.
      Signed-off-by: 's avatarChristian Brauner <christian.brauner@ubuntu.com>
    • lxc-user-nic: test privilege over netns on delete · 1bd8d726
      Christian Brauner authored
      When lxc-user-nic is called with the "delete" subcommand we need to make sure
      that we are actually privileged over the network namespace for which we are
      supposed to delete devices on the host. To this end we require that path to the
      affected network namespace is passed. We then setns() to the network namespace
      and drop privilege to the caller's real user id. Then we try to delete the
      loopback interface which is not possible. If we are privileged over the network
      namespace this operation will fail with ENOTSUP. If we are not privileged over
      the network namespace we will get EPERM.
      
      This is the first part of the commit. As of now nothing guarantees that the
      caller does not just give us a random path to a network namespace it is
      privileged over.
      Signed-off-by: 's avatarChristian Brauner <christian.brauner@ubuntu.com>
  2. 30 Aug, 2017 3 commits
    • Merge pull request #1769 from brauner/2017-08-30/improve_empty_cgroup_deletion · 70a49815
      Stéphane Graber authored
      Revert "cgfsng: try to delete parent cgroups"
    • confile: remove unnecessary cleanup code · cf7faeb3
      Christian Brauner authored
      set_config_string_item() already free()s before setting the new value.
      Signed-off-by: 's avatarChristian Brauner <christian.brauner@ubuntu.com>
    • Revert "cgfsng: try to delete parent cgroups" · 308a6c94
      Christian Brauner authored
      This reverts commit 92c590ae.
      
      Problem:
      
          Commit 92c590ae introduced the following
          behavior:
      
          > cgfsng: try to delete parent cgroups
          >
          > Say we have
          >
          >     lxc.uts.name = c1
          >     lxc.cgroup.dir = lxd/a/b/c
          >
          > the path for the container's cgroup would be
          >
          >     lxd/a/b/c/c1
          >
          > When the container is shutdown we should not just try to delete "c1" we
          > should also try to delete "c", "b", "a", and "lxd". This is to ensure
          > that we don't leave empty cgroups around thereby increasing the chance
          > that we run into trouble with cgroup limits. The algorithm for this isn't
          > too costly since we can simply stop walking upwards at the first rmdir()
          > failure.
      
          The algorithm employs recursive_destroy() which opens each directory
          specified in lxc.cgroup.dir and tries to delete each directory within that
          directory. For example, assume "/sys/fs/cgroup/memory/lxd/a/b/c" only
          contains the cgroup "c1" for container "c1". Assume that "c1" calls
          recursive_destroy() to cleanup it's cgroups. It will first delete "c1" and
          anything underneath it. This is perfectly fine since anything underneath
          that cgroup is under its control. The new algorithm will then tell it to
          "recurse upwards". So recursive_destroy() will try to delete
          "/sys/fs/cgroup/lxd/a/b/c" next. Now assume that a second container "c2"
          has "lxc.cgroup.dir = lxd/a/b/c" set in its config file and calls
          cgroup_create(). This will create the *empty* cgroup
          "/sys/fs/cgroup/memory/lxd/a/b/c/c2". Now assume that after having created
          "c2" container "c1"'s call to recursive_destroy() reaches
          "/sys/fs/cgroup/memory/lxd/a/b/c/c2" before it is populated. Then the
          cgroup "c2" will be removed. Now "c2" calls cgroup_enter() to enter its
          created cgroup. This will fail since c1 deleted the cgroup "c2". (As a
          sidenote: This is in the set of the few race conditions that are actually
          easy to describe.)
      
      Possible Solution:
      
          Instead of calling recursive_destroy() on all cgroups specified in
          lxc.cgroup.dir we only call recursive_destroy() on the container's own
          cgroup "/sys/fs/cgroup/memory/lxd/a/b/c/c1". When we start to recurse
          upwards we only call unlinkat(AT_FDCWD, path, AT_REMOVEDIR). This should
          avoid the race described above. My argument is as follows. Assume that the
          container c1 has created the cgroup "/sys/fs/cgroup/lxd/a/b/c/c1" for
          itself. Now c1 calls cgroup_destroy(). First, recursive_destroy() will be
          called on the cgroup "c1" which will delete any emtpy cgroup directories
          underneath "c1" and finally "c1" itself. This is fine since everything
          under "c1" is the container's c1 sole property. Now container c1 will call
          unlinkat() on "/sys/fs/cgroup/memory/lxd/a/b/c/c1":
          - Assume that in the meantime container c2 has created the cgroup
            "/sys/fs/cgroup/memory/lxd/a/b/c/c2". Then c1's unlinkat() will fail.
            This will stop c1 from recursing upwards. So c2's cgroup_enter() call
            will find all its cgroups intact and well. unlinkat() will come with the
            appropriate in-kernel locking which will stop it from racing with
            mkdir().
          - There's still a subtle race left. c2 might be calling an implementation
            of mkdir -p to try and create e.g. the cgroup
            "/sys/fs/cgroup/memory/lxd/a/b". Let's assume "b" exists then c2 will
            receive EEXIST on "b" and move on to create "c". Let's further assume c1
            has already deleted "c". c1 will now be able to delete
            "/sys/fs/cgroup/memory/lxd/a/b/" and c2's call to create "c" will fail.
      
      The latter subtle race makes me rethink this approach. For now we'll just leave
      empty cgroups behind since I don't want to start locking stuff.
      Signed-off-by: 's avatarChristian Brauner <christian.brauner@ubuntu.com>
  3. 29 Aug, 2017 3 commits
  4. 28 Aug, 2017 7 commits
  5. 27 Aug, 2017 13 commits
  6. 25 Aug, 2017 8 commits