Unverified Commit 133d9608 by Stéphane Graber Committed by GitHub

Merge pull request #3290 from brauner/2020-03-11/fixes

pidfds: switch infrastructure to rely on pidfds whenever possible
parents c6a63531 f3741b92
...@@ -35,17 +35,16 @@ mirror the current mount and umount API of the kernel. ...@@ -35,17 +35,16 @@ mirror the current mount and umount API of the kernel.
## seccomp\_allow\_nesting ## seccomp\_allow\_nesting
This adds support for seccomp filters to be stacked regardless of whether a seccomp profile is This adds support for seccomp filters to be stacked regardless of whether a seccomp profile is already loaded. This allows nested containers to load their own seccomp profile.
already loaded. This allows nested containers to load their own seccomp profile.
## seccomp\_notify ## seccomp\_notify
This adds "notify" as seccomp action that will cause LXC to register a seccomp listener and retrieve This adds "notify" as seccomp action that will cause LXC to register a seccomp listener and retrieve a listener file descriptor from the kernel. When a syscall is made that is registered as "notify" the kernel will generate a poll event and send a message over the file descriptor.
a listener file descriptor from the kernel. When a syscall is made that is registered as "notify"
the kernel will generate a poll event and send a message over the file descriptor.
The caller can read this message, inspect the syscalls including its arguments. Based on this information the caller is expected to send back a message informing the kernel which action to take. Until that message is sent the kernel will block the calling process. The format of the messages to read and sent is documented in seccomp itself. The caller can read this message, inspect the syscalls including its arguments. Based on this information the caller is expected to send back a message informing the kernel which action to take. Until that message is sent the kernel will block the calling process. The format of the messages to read and sent is documented in seccomp itself.
A new API function `seccomp_notify_fd()` has been added which allows callers to retrieve the notifier fd for the container's seccomp filter.
## network\_veth\_routes ## network\_veth\_routes
This introduces the `lxc.net.[i].veth.ipv4.route` and `lxc.net.[i].veth.ipv6.route` properties This introduces the `lxc.net.[i].veth.ipv4.route` and `lxc.net.[i].veth.ipv6.route` properties
...@@ -97,24 +96,25 @@ This is primarily intended for use with layer 3 networking devices, such as IPVL ...@@ -97,24 +96,25 @@ This is primarily intended for use with layer 3 networking devices, such as IPVL
This introduces the ability to specify a custom MTU for `phys` and `macvlan` devices using the This introduces the ability to specify a custom MTU for `phys` and `macvlan` devices using the
`lxc.net.[i].mtu` property. `lxc.net.[i].mtu` property.
# network\_veth\_router ## network\_veth\_router
This introduces the ability to specify a `lxc.net.[i].veth.mode` setting, which takes a value of "bridge" or "router". This defaults to "bridge".
In "router" mode static routes are created on the host for the container's IP addresses pointing to the host side veth interface. In addition to the routes, a static IP neighbour proxy is added to the host side veth interface for the IPv4 and IPv6 gateway IPs.
## cgroup2\_devices
This introduces the ability to specify a `lxc.net.[i].veth.mode` setting, which takes a value of This enables `LXC` to make use of the new devices controller in the unified cgroup hierarchy. `LXC` will now create, load, and attach bpf program to the cgroup of the container when the controller is available.
"bridge" or "router". This defaults to "bridge".
In "router" mode static routes are created on the host for the container's IP addresses pointing to ## cgroup2
the host side veth interface. In addition to the routes, a static IP neighbour proxy is added to
the host side veth interface for the IPv4 and IPv6 gateway IPs.
This enables `LXC` to make complete use of the unified cgroup hierarchy. With this extension it is possible to run `LXC` containers on systems that use a pure unified cgroup layout.
# cgroup2\_devices ## init\_pidfd
This enables `LXC` to make use of the new devices controller in the unified This adds a new API function `init_pidfd()` which allows to retrieve a pidfd for the container's init process allowing process management interactions such as sending signal to be completely reliable and rac-e free.
cgroup hierarchy. `LXC` will now create, load, and attach bpf program to the
cgroup of the container when the controller is available.
# cgroup2 ## pidfd
This enables `LXC` to make complete use of the unified cgroup hierarchy. With When running on kernels that support pidfds LXC will rely on them for most operations. This makes interacting with containers not just more reliable it also makes it significantly safer and eliminates various races inherent to PID-based kernel APIs. LXC will require that the running kernel at least support `pidfd_send_signal()`, `CLONE_PIDFD`, `P_PIDFD`, and pidfd polling support. Any kernel starting with `Linux 5.4` should have full support for pidfds.
this extension it is possible to run `LXC` containers on systems that use
a pure unified cgroup layout.
...@@ -11,6 +11,13 @@ ...@@ -11,6 +11,13 @@
#include "macro.h" #include "macro.h"
#include "state.h" #include "state.h"
/*
* Value command callbacks should return when they want the client fd to be
* cleaned up by the main loop. This is most certainly what you want unless you
* have specific reasons to keep the file descriptor alive.
*/
#define LXC_CMD_REAP_CLIENT_FD 1
typedef enum { typedef enum {
LXC_CMD_CONSOLE, LXC_CMD_CONSOLE,
LXC_CMD_TERMINAL_WINCH, LXC_CMD_TERMINAL_WINCH,
...@@ -30,6 +37,7 @@ typedef enum { ...@@ -30,6 +37,7 @@ typedef enum {
LXC_CMD_FREEZE, LXC_CMD_FREEZE,
LXC_CMD_UNFREEZE, LXC_CMD_UNFREEZE,
LXC_CMD_GET_CGROUP2_FD, LXC_CMD_GET_CGROUP2_FD,
LXC_CMD_GET_INIT_PIDFD,
LXC_CMD_MAX, LXC_CMD_MAX,
} lxc_cmd_t; } lxc_cmd_t;
...@@ -77,6 +85,7 @@ extern char *lxc_cmd_get_config_item(const char *name, const char *item, const c ...@@ -77,6 +85,7 @@ extern char *lxc_cmd_get_config_item(const char *name, const char *item, const c
extern char *lxc_cmd_get_name(const char *hashed_sock); extern char *lxc_cmd_get_name(const char *hashed_sock);
extern char *lxc_cmd_get_lxcpath(const char *hashed_sock); extern char *lxc_cmd_get_lxcpath(const char *hashed_sock);
extern pid_t lxc_cmd_get_init_pid(const char *name, const char *lxcpath); extern pid_t lxc_cmd_get_init_pid(const char *name, const char *lxcpath);
extern int lxc_cmd_get_init_pidfd(const char *name, const char *lxcpath);
extern int lxc_cmd_get_state(const char *name, const char *lxcpath); extern int lxc_cmd_get_state(const char *name, const char *lxcpath);
extern int lxc_cmd_stop(const char *name, const char *lxcpath); extern int lxc_cmd_stop(const char *name, const char *lxcpath);
......
...@@ -185,5 +185,6 @@ int lxc_add_state_client(int state_client_fd, struct lxc_handler *handler, ...@@ -185,5 +185,6 @@ int lxc_add_state_client(int state_client_fd, struct lxc_handler *handler,
move_ptr(newclient); move_ptr(newclient);
move_ptr(tmplist); move_ptr(tmplist);
return log_trace(MAX_STATE, "Added state client %d to state client list", state_client_fd); TRACE("Added state client fd %d to state client list", state_client_fd);
return MAX_STATE;
} }
...@@ -611,6 +611,16 @@ static pid_t do_lxcapi_init_pid(struct lxc_container *c) ...@@ -611,6 +611,16 @@ static pid_t do_lxcapi_init_pid(struct lxc_container *c)
WRAP_API(pid_t, lxcapi_init_pid) WRAP_API(pid_t, lxcapi_init_pid)
static int do_lxcapi_init_pidfd(struct lxc_container *c)
{
if (!c)
return ret_errno(EBADF);
return lxc_cmd_get_init_pidfd(c->name, c->config_path);
}
WRAP_API(int, lxcapi_init_pidfd)
static bool load_config_locked(struct lxc_container *c, const char *fname) static bool load_config_locked(struct lxc_container *c, const char *fname)
{ {
if (!c->lxc_conf) if (!c->lxc_conf)
...@@ -1966,8 +1976,9 @@ static bool lxcapi_create(struct lxc_container *c, const char *t, ...@@ -1966,8 +1976,9 @@ static bool lxcapi_create(struct lxc_container *c, const char *t,
static bool do_lxcapi_reboot(struct lxc_container *c) static bool do_lxcapi_reboot(struct lxc_container *c)
{ {
__do_close_prot_errno int pidfd = -EBADF;
pid_t pid = -1;
int ret; int ret;
pid_t pid;
int rebootsignal = SIGINT; int rebootsignal = SIGINT;
if (!c) if (!c)
...@@ -1976,18 +1987,23 @@ static bool do_lxcapi_reboot(struct lxc_container *c) ...@@ -1976,18 +1987,23 @@ static bool do_lxcapi_reboot(struct lxc_container *c)
if (!do_lxcapi_is_running(c)) if (!do_lxcapi_is_running(c))
return false; return false;
pid = do_lxcapi_init_pid(c); pidfd = do_lxcapi_init_pidfd(c);
if (pid <= 0) if (pidfd < 0) {
return false; pid = do_lxcapi_init_pid(c);
if (pid <= 0)
return false;
}
if (c->lxc_conf && c->lxc_conf->rebootsignal) if (c->lxc_conf && c->lxc_conf->rebootsignal)
rebootsignal = c->lxc_conf->rebootsignal; rebootsignal = c->lxc_conf->rebootsignal;
ret = kill(pid, rebootsignal); if (pidfd >= 0)
if (ret < 0) { ret = lxc_raw_pidfd_send_signal(pidfd, rebootsignal, NULL, 0);
WARN("Failed to send signal %d to pid %d", rebootsignal, pid); else
return false; ret = kill(pid, rebootsignal);
} if (ret < 0)
return log_warn(false, "Failed to send signal %d to pid %d",
rebootsignal, pid);
return true; return true;
} }
...@@ -1996,10 +2012,11 @@ WRAP_API(bool, lxcapi_reboot) ...@@ -1996,10 +2012,11 @@ WRAP_API(bool, lxcapi_reboot)
static bool do_lxcapi_reboot2(struct lxc_container *c, int timeout) static bool do_lxcapi_reboot2(struct lxc_container *c, int timeout)
{ {
int killret, ret; __do_close_prot_errno int pidfd = -EBADF, state_client_fd = -EBADF;
pid_t pid; int rebootsignal = SIGINT;
int rebootsignal = SIGINT, state_client_fd = -1; pid_t pid = -1;
lxc_state_t states[MAX_STATE] = {0}; lxc_state_t states[MAX_STATE] = {0};
int killret, ret;
if (!c) if (!c)
return false; return false;
...@@ -2007,9 +2024,12 @@ static bool do_lxcapi_reboot2(struct lxc_container *c, int timeout) ...@@ -2007,9 +2024,12 @@ static bool do_lxcapi_reboot2(struct lxc_container *c, int timeout)
if (!do_lxcapi_is_running(c)) if (!do_lxcapi_is_running(c))
return true; return true;
pid = do_lxcapi_init_pid(c); pidfd = do_lxcapi_init_pidfd(c);
if (pid <= 0) if (pidfd < 0) {
return true; pid = do_lxcapi_init_pid(c);
if (pid <= 0)
return true;
}
if (c->lxc_conf && c->lxc_conf->rebootsignal) if (c->lxc_conf && c->lxc_conf->rebootsignal)
rebootsignal = c->lxc_conf->rebootsignal; rebootsignal = c->lxc_conf->rebootsignal;
...@@ -2035,21 +2055,18 @@ static bool do_lxcapi_reboot2(struct lxc_container *c, int timeout) ...@@ -2035,21 +2055,18 @@ static bool do_lxcapi_reboot2(struct lxc_container *c, int timeout)
} }
/* Send reboot signal to container. */ /* Send reboot signal to container. */
killret = kill(pid, rebootsignal); if (pidfd >= 0)
if (killret < 0) { killret = lxc_raw_pidfd_send_signal(pidfd, rebootsignal, NULL, 0);
if (state_client_fd >= 0) else
close(state_client_fd); killret = kill(pid, rebootsignal);
if (killret < 0)
WARN("Failed to send signal %d to pid %d", rebootsignal, pid); return log_warn(false, "Failed to send signal %d to pid %d", rebootsignal, pid);
return false;
}
TRACE("Sent signal %d to pid %d", rebootsignal, pid); TRACE("Sent signal %d to pid %d", rebootsignal, pid);
if (timeout == 0) if (timeout == 0)
return true; return true;
ret = lxc_cmd_sock_rcv_state(state_client_fd, timeout); ret = lxc_cmd_sock_rcv_state(state_client_fd, timeout);
close(state_client_fd);
if (ret < 0) if (ret < 0)
return false; return false;
...@@ -2064,10 +2081,11 @@ WRAP_API_1(bool, lxcapi_reboot2, int) ...@@ -2064,10 +2081,11 @@ WRAP_API_1(bool, lxcapi_reboot2, int)
static bool do_lxcapi_shutdown(struct lxc_container *c, int timeout) static bool do_lxcapi_shutdown(struct lxc_container *c, int timeout)
{ {
int killret, ret; __do_close_prot_errno int pidfd = -EBADF, state_client_fd = -EBADF;
pid_t pid; int haltsignal = SIGPWR;
int haltsignal = SIGPWR, state_client_fd = -EBADF; pid_t pid = -1;
lxc_state_t states[MAX_STATE] = {0}; lxc_state_t states[MAX_STATE] = {0};
int killret, ret;
if (!c) if (!c)
return false; return false;
...@@ -2075,6 +2093,7 @@ static bool do_lxcapi_shutdown(struct lxc_container *c, int timeout) ...@@ -2075,6 +2093,7 @@ static bool do_lxcapi_shutdown(struct lxc_container *c, int timeout)
if (!do_lxcapi_is_running(c)) if (!do_lxcapi_is_running(c))
return true; return true;
pidfd = do_lxcapi_init_pidfd(c);
pid = do_lxcapi_init_pid(c); pid = do_lxcapi_init_pid(c);
if (pid <= 0) if (pid <= 0)
return true; return true;
...@@ -2085,8 +2104,10 @@ static bool do_lxcapi_shutdown(struct lxc_container *c, int timeout) ...@@ -2085,8 +2104,10 @@ static bool do_lxcapi_shutdown(struct lxc_container *c, int timeout)
else if (task_blocks_signal(pid, (SIGRTMIN + 3))) else if (task_blocks_signal(pid, (SIGRTMIN + 3)))
haltsignal = (SIGRTMIN + 3); haltsignal = (SIGRTMIN + 3);
/* Add a new state client before sending the shutdown signal so that we
* don't miss a state. /*
* Add a new state client before sending the shutdown signal so
* that we don't miss a state.
*/ */
if (timeout != 0) { if (timeout != 0) {
states[STOPPED] = 1; states[STOPPED] = 1;
...@@ -2103,24 +2124,47 @@ static bool do_lxcapi_shutdown(struct lxc_container *c, int timeout) ...@@ -2103,24 +2124,47 @@ static bool do_lxcapi_shutdown(struct lxc_container *c, int timeout)
if (ret < MAX_STATE) if (ret < MAX_STATE)
return false; return false;
}
/* Send shutdown signal to container. */ if (pidfd >= 0) {
killret = kill(pid, haltsignal); struct pollfd pidfd_poll = {
if (killret < 0) { .events = POLLIN,
if (state_client_fd >= 0) .fd = pidfd,
close(state_client_fd); };
WARN("Failed to send signal %d to pid %d", haltsignal, pid); killret = lxc_raw_pidfd_send_signal(pidfd, haltsignal,
return false; NULL, 0);
if (killret < 0)
return log_warn(false, "Failed to send signal %d to pidfd %d",
haltsignal, pidfd);
TRACE("Sent signal %d to pidfd %d", haltsignal, pidfd);
/*
* No need for going through all of the state server
* complications anymore. We can just poll on pidfds. :)
*/
if (timeout != 0) {
ret = poll(&pidfd_poll, 1, timeout);
if (ret < 0 || !(pidfd_poll.revents & POLLIN))
return false;
TRACE("Pidfd polling detected container exit");
}
} else {
killret = kill(pid, haltsignal);
if (killret < 0)
return log_warn(false, "Failed to send signal %d to pid %d",
haltsignal, pid);
TRACE("Sent signal %d to pid %d", haltsignal, pid);
}
} }
TRACE("Sent signal %d to pid %d", haltsignal, pid);
if (timeout == 0) if (timeout == 0)
return true; return true;
ret = lxc_cmd_sock_rcv_state(state_client_fd, timeout); ret = lxc_cmd_sock_rcv_state(state_client_fd, timeout);
close(state_client_fd);
if (ret < 0) if (ret < 0)
return false; return false;
...@@ -5323,6 +5367,7 @@ struct lxc_container *lxc_container_new(const char *name, const char *configpath ...@@ -5323,6 +5367,7 @@ struct lxc_container *lxc_container_new(const char *name, const char *configpath
c->console = lxcapi_console; c->console = lxcapi_console;
c->console_getfd = lxcapi_console_getfd; c->console_getfd = lxcapi_console_getfd;
c->init_pid = lxcapi_init_pid; c->init_pid = lxcapi_init_pid;
c->init_pidfd = lxcapi_init_pidfd;
c->load_config = lxcapi_load_config; c->load_config = lxcapi_load_config;
c->want_daemonize = lxcapi_want_daemonize; c->want_daemonize = lxcapi_want_daemonize;
c->want_close_all_fds = lxcapi_want_close_all_fds; c->want_close_all_fds = lxcapi_want_close_all_fds;
......
...@@ -848,7 +848,23 @@ struct lxc_container { ...@@ -848,7 +848,23 @@ struct lxc_container {
int (*umount)(struct lxc_container *c, const char *target, int (*umount)(struct lxc_container *c, const char *target,
unsigned long mountflags, struct lxc_mount *mnt); unsigned long mountflags, struct lxc_mount *mnt);
/*!
* \brief Retrieve a file descriptor for the container's seccomp filter.
*
* \param c Container
*
* \return file descriptor for container's seccomp filter
*/
int (*seccomp_notify_fd)(struct lxc_container *c); int (*seccomp_notify_fd)(struct lxc_container *c);
/*!
* \brief Retrieve a pidfd for the container's init process.
*
* \param c Container.
*
* \return pidfd of init process of the container.
*/
int (*init_pidfd)(struct lxc_container *c);
}; };
/*! /*!
......
...@@ -7,6 +7,7 @@ ...@@ -7,6 +7,7 @@
#define _GNU_SOURCE 1 #define _GNU_SOURCE 1
#endif #endif
#include <sched.h> #include <sched.h>
#include <stdbool.h>
#include <stdio.h> #include <stdio.h>
#include <stdlib.h> #include <stdlib.h>
#include <signal.h> #include <signal.h>
...@@ -18,6 +19,11 @@ ...@@ -18,6 +19,11 @@
#define CLONE_PIDFD 0x00001000 #define CLONE_PIDFD 0x00001000
#endif #endif
/* waitid */
#ifndef P_PIDFD
#define P_PIDFD 3
#endif
/* /*
* lxc_raw_clone() - create a new process * lxc_raw_clone() - create a new process
* *
......
...@@ -1659,6 +1659,10 @@ static int lxc_spawn(struct lxc_handler *handler) ...@@ -1659,6 +1659,10 @@ static int lxc_spawn(struct lxc_handler *handler)
} }
TRACE("Cloned child process %d", handler->pid); TRACE("Cloned child process %d", handler->pid);
/* Verify that we can actually make use of pidfds. */
if (!lxc_can_use_pidfd(handler->pidfd))
close_prot_errno_disarm(handler->pidfd);
ret = snprintf(pidstr, 20, "%d", handler->pid); ret = snprintf(pidstr, 20, "%d", handler->pid);
if (ret < 0 || ret >= 20) if (ret < 0 || ret >= 20)
goto out_delete_net; goto out_delete_net;
......
...@@ -12,6 +12,7 @@ ...@@ -12,6 +12,7 @@
#include <inttypes.h> #include <inttypes.h>
#include <libgen.h> #include <libgen.h>
#include <pthread.h> #include <pthread.h>
#include <signal.h>
#include <stddef.h> #include <stddef.h>
#include <stdio.h> #include <stdio.h>
#include <stdlib.h> #include <stdlib.h>
...@@ -291,6 +292,20 @@ again: ...@@ -291,6 +292,20 @@ again:
return 0; return 0;
} }
int wait_for_pidfd(int pidfd)
{
int ret;
siginfo_t info = {
.si_signo = 0,
};
do {
ret = waitid(P_PIDFD, pidfd, &info, __WALL | WEXITED);
} while (ret < 0 && errno == EINTR);
return !ret && WIFEXITED(info.si_status) && WEXITSTATUS(info.si_status) == 0;
}
int lxc_wait_for_pid_status(pid_t pid) int lxc_wait_for_pid_status(pid_t pid)
{ {
int status, ret; int status, ret;
...@@ -1846,3 +1861,34 @@ int lxc_setup_keyring(char *keyring_label) ...@@ -1846,3 +1861,34 @@ int lxc_setup_keyring(char *keyring_label)
return ret; return ret;
} }
bool lxc_can_use_pidfd(int pidfd)
{
int ret;
if (pidfd < 0)
return log_error(false, "Kernel does not support pidfds");
/*
* We don't care whether or not children were in a waitable state. We
* just care whether waitid() recognizes P_PIDFD.
*
* Btw, while I have your attention, the above waitid() code is an
* excellent example of how _not_ to do flag-based kernel APIs. So if
* you ever go into kernel development or are already and you add this
* kind of flag potpourri even though you have read this comment shame
* on you. May the gods of operating system development have mercy on
* your soul because I won't.
*/
ret = waitid(P_PIDFD, pidfd, NULL,
/* Type of children to wait for. */
__WALL |
/* How to wait for them. */
WNOHANG | WNOWAIT |
/* What state to wait for. */
WEXITED | WSTOPPED | WCONTINUED);
if (ret < 0)
return log_error_errno(false, errno, "Kernel does not support waiting on processes through pidfds");
return log_trace(true, "Kernel supports pidfds");
}
...@@ -85,6 +85,7 @@ static inline void __auto_lxc_pclose__(struct lxc_popen_FILE **f) ...@@ -85,6 +85,7 @@ static inline void __auto_lxc_pclose__(struct lxc_popen_FILE **f)
*/ */
extern int wait_for_pid(pid_t pid); extern int wait_for_pid(pid_t pid);
extern int lxc_wait_for_pid_status(pid_t pid); extern int lxc_wait_for_pid_status(pid_t pid);
extern int wait_for_pidfd(int pidfd);
#if HAVE_OPENSSL #if HAVE_OPENSSL
extern int sha1sum_file(char *fnam, unsigned char *md_value, unsigned int *md_len); extern int sha1sum_file(char *fnam, unsigned char *md_value, unsigned int *md_len);
...@@ -236,5 +237,6 @@ extern int lxc_set_death_signal(int signal, pid_t parent, int parent_status_fd); ...@@ -236,5 +237,6 @@ extern int lxc_set_death_signal(int signal, pid_t parent, int parent_status_fd);
extern int fd_cloexec(int fd, bool cloexec); extern int fd_cloexec(int fd, bool cloexec);
extern int recursive_destroy(const char *dirname); extern int recursive_destroy(const char *dirname);
extern int lxc_setup_keyring(char *keyring_label); extern int lxc_setup_keyring(char *keyring_label);
extern bool lxc_can_use_pidfd(int pidfd);
#endif /* __LXC_UTILS_H */ #endif /* __LXC_UTILS_H */
Markdown is supported
0% or
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment