Full Report
io_uring is a new subsystem in the Linux kernel used for speedy IO operations. In particular, the program may need to do privilege transitions many times via syscalls. Instead, a series of IO operations can be performed in parallel. Rapid development == more bugs though. Additionally, complex code with a ton of asynchronous operations tends to have security bugs as well. Additionally, many bugs within io_uring have been used to break out of the Container Optimized OS ran by kCTF, making this a good attack surface for them. When the function io_req_init_async is called, it assigns its own identity to be the worker of the IO request. However, if two threads submit an IO request to the same io_uring at the same time, then they will be attached into the same work queue but with different IDs. The fact that the same identity is used for two different requests is what causes the very subtle security issue. If one of the threads exits then the IO events are all reaped. In this process, the exiting threads identity gets assigned instead of the request submitter. Why does this later? One part of the code uses this as a heap object and the other uses this as a pointer to the middle of a structure. Aka, we have a type confusion creating an invalid free. How exploitable is this? Because of the CONFIG_HARDENDED_USERCOPY (which is enabled on the Container-Optimized OS), the function used to copy data from userland (copy_from_user) cannot be used across slot boundaries. So, the typically method of putting msg_msg and corrupting this will not work. It's possible to spray this area with objects we don't own but its not trivial. What's the strategy then? Allocate the victim object in an invalid slot (between two slots) then use the other parts of the slot (upper and lower) to corrupt it. The object timerfd_ctx is within the kmalloc-256 slot and has plenty of pointers, making it a prime target for exploitation within our fake slot. From the fake slot, the author decided to use the upper and lower slots with the msg_msgseg object, which has mostly user controlled data. Once the heap feng shui is done, we can get the information leak from the object. First, the linked list within timerfd_ctx points back to itself (heap), leading to a nice leak from the msg_msgseg object. For breaking KASLR, arming the timer will set a function pointer which points to the .text section. Hijacking code execution is easy via the function pointer within the timer; but, this leads to a ton of issues. So, they decided to free the timer and attack the allocators freelist instead. The CONFIG_SLAB_FREELIST_HARDENED flag is turned on, which is a type of pointer encoding that requires uses to know the storage address of the pointer, a random value and the new pointer itself. By filling up the entire slab, we can force the ptr to be NULL, leak it and calculate the random value to write the pointer ourselves. By hijacking the freelist, we know have a completely functional arbitrary write primitive. Since they wanted a container escape (and more money) they targeted the way Linux loads executables via binfmt. The structures used for loading executables are writable! Using the primitive from above, the load_binary callback function can be abused to get PC control to ROP. Game over, right? This worked on the authors machine but not the kCTF machine - the only writable part of the system was tmpfs, which was not compatible with the exploit and we needed the O_DIRECT file flag to make this possible. Only a few files could be opened with this flag in the container and they were all very small, making the exploit unreliable. After playing with the heap feng shui and playing with the freelist, they decided to go with a different strategy. They used the timerfd_ctx to ROP instead. Using this, the same controlled binfmt overwrite could be used to get code execution. Another novel technique that was used was to call msleep to gracefully end the ROP in the interrupt context to cause the program to not crash. Amazing article! Great background, nice references and I love the ups & downs included in the article. The thought process behind every decision is very clear, regardless if the thing worked or not. Great exploit and definitely worth the 90K from Google.
Analysis Summary
# Vulnerability: Linux Kernel io_uring Identity Confusion Leading to Type Confusion and Arbitrary Write
## CVE Details
- CVE ID: CVE-2022-1786
- CVSS Score: Information not provided in the text (Severity is implied to be High due to container escape success)
- CWE: CWE-824 (Improper Access of Referenced Memory) or CWE-840 (Type Confusion)
## Affected Systems
- Products: Linux Kernel
- Versions: Specifically mentioned vulnerable in kernel v5.10.
- Configurations: Relevant when `IORING_SETUP_IOPOLL` is enabled. The exploitation focused on systems hardening like Container Optimized OS (CO-OS) with features like `CONFIG_HARDENDED_USERCOPY` and `CONFIG_SLAB_FREELIST_HARDENED` enabled.
## Vulnerability Description
The vulnerability stems from an incorrect assumption in the `io_req_init_async` function within the io_uring subsystem. When two threads concurrently submit IO requests to the same io_uring, they can be attached to the same work queue but associated with different thread identities. If one of these threads exits while `IORING_SETUP_IOPOLL` is active, the thread reaping mechanism reuses the exiting thread's identity for the remaining request. This identity reuse causes a confusion where one part of the kernel code interprets this identity as a heap object, while another part treats it as a pointer to the middle of a structure, leading to a **Type Confusion** bug and ultimately an **Invalid Free**.
## Exploitation
- Status: Successfully **Exploited** (in a controlled environment/CTF setting, leading to container escape). PoC details are documented in the source article.
- Complexity: **High**. Although the initial primitive (type confusion/invalid free) was found quickly, bypassing modern hardening features (`CONFIG_HARDENDED_USERCOPY`, `CONFIG_SLAB_FREELIST_HARDENED`) required complex **heap feng shui** techniques, including allocating objects (`timerfd_ctx`, `msg_msgseg`) into invalid slots to corrupt them, eventually leading to an arbitrary write primitive.
- Attack Vector: **Local** (Container Escape). The exploit targeted structures related to executable loading (`binfmt`) within the compromised container context.
## Impact
- Confidentiality: **High** (Achieved through information leak used to break KASLR).
- Integrity: **Critical** (Achieved arbitrary code execution/write primitive).
- Availability: **High** (Denial of Service through kernel crash/memory corruption, though the goal was escalation).
## Remediation
### Patches
- The vulnerability was reported and subsequently **fixed** by the Linux kernel security team. Users must update to a patched kernel version where the faulty identity assignment logic in `io_req_init_async` is corrected (or where the conditions leading to the state are mitigated).
### Workarounds
- Disable `IORING_SETUP_IOPOLL` if possible, as this feature was key to triggering the thread exit/reaping condition leading to the bug activation.
- Ensure the kernel is built with the latest security mitigations enabled (although the exploit successfully bypassed some of them).
## Detection
- The scenario involves concurrent IO submission, thread exit under I/O polling context, and subsequent re-use of identity structures, followed by heap manipulation targeting `timerfd_ctx` and `msg_msgseg`.
- Detection methods would involve monitoring for unusual heap operations, unexpected memory accesses related to io_uring work queues, or kernel memory corruption tracing tools.
## References:
- Vendor Advisory: [CVE-2022-1786] (Implied from context)
- Relevant Links:
- hXXps://blog.kylebot.net/