runc container breakouts via procfs writes

Full Report

The report discusses three vulnerabilities found in runc, the underlying containerization used by Docker and Podman. All of them allow for writing to the /proc file system to escape the container. runc will mask several files. In practice, this means that the value just points to /dev/null in the local container. However, there is a race condition around this. It's possible to use the race condition on the creation of a bind-mount to create a symlink for the target on the host system. The ol' switcheroo! By getting read/write to /proc/sys/kernel/core_pattern via this trick, it's possible to get a container escape with the coredump privileged upcalls. There was a second variant to this issue. If /dev/null is deleted on the container, then runc would ignore the error, and the masking process becomes a no-op. In practice, this means that an attacker could read the /proc files. This was found after the first one and was also fixed. The second full issue is similar to the first: a TOCTOU issue with /dev/console bind-mounts. When creating the bind mount to /dev/pts/$n, an attacker can replace /dev/pts/$n with a symlink. Naturally, this allows for writing to files on the host machine. This bug is after the pivot from root but the core_pattern trick from above can still be used. The author also found some issues around os.Create() that were stress-inducing. Although not directly exploitable, they decided to provide fixes for them anyway. Around race conditions on /dev/pts/$n writes, they added additional protections. A single bug should really trigger a large set of security improvements while you are there. The final vulnerability is a more sophisticated variant of CVE-2019-16884. Linux Security Modules (LSM) put labels or metadata to every process and file on the system. The original vulnerability was able to trick the LSM to write these labels to a dummy tmpfs instead of the correct location. This led to a bypass of the protections put in place. The trick was to have the images startup instructions mount /proc to a tmpfs. The patch for the original vulnerability ensured that it was applied to a real procfs file system before performing the LSM label write. The new variant allowed for using a real tmpfs file that would effectively be a no-op. For instance, force it to write to /proc/self/sched instead of the proper one. This was done via a symlink. runc thinks that it was writing to /proc/self/attr/exec but it wrote to another file instead. This bug makes the write into a no-op. An attacker could also redirect the write to a malicious target on the host system. Using this file write, it's likely that a container escape is possible. The development team was concerned that other write operations might be redirected in this way. They conducted further analysis on the system to determine if this was possible. They hope to write some custom linters in the future to try to prevent this. youki, LXC, and crun were found to have very similar flaws, requiring patch coordination between all of them. Interestingly enough, LXC doesn't consider these attacks in its threat model because non-user-namespaced containers are fundamentally insecure. All of these attacker require startup-time exploits as opposed to being in an already-running container. Overall, a great set of bugs!

Analysis Summary