blob: b0002a964653e88ec854c3aefec3f66e57fc2931 [file] [log] [blame]
{"name":"mTCP","tagline":"mTCP: A Highly Scalable User-level TCP Stack for Multicore Systems","body":"***\r\n\r\n### What is mTCP?\r\n\r\nmTCP is a high-performance user-level TCP stack for multicore systems. Scaling the performance of short TCP connections is fundamentally challenging due to inefficiencies in the kernel. mTCP addresses these inefficiencies from the ground up - from packet I/O and TCP connection management all the way to the application interface.\r\n\r\n<div align=\"center\">\r\n <img src=\"\" height=\"350\">\r\n <div><b> Figure 1. mTCP overview </b></div>\r\n</div>\r\n\r\nBesides adopting well-known techniques, our mTCP stack (1) translates expensive system calls to shared memory access between two threads within the same CPU core, (2) allows efficient flow-level event aggregation, and (3) performs batch processing of RX/TX packets for high I/O efficiency. mTCP on an 8-core machine improves the performance of small message transactions by a factor 25 (compared with the latest Linux TCP stack (kernel version 3.10.12)) and 3 (compared with with the best-performing research system). It also improves the performance of various popular applications by 33% (SSLShader) to 320% (lighttpd) compared with those on the Linux stack.\r\n\r\n### Why User-level TCP?\r\n\r\nMany high-performance network applications spend a significant portion of CPU cycles for TCP processing in the kernel. (e.g., ~80% inside kernel for lighttpd) Even worse, these CPU cycles are not utilized effectively; according to our measurements, Linux spends more than 4x the cycles than mTCP in handling the same number of TCP transactions.\r\n\r\nThen, can we design a user-level TCP stack that incorporates all existing optimizations into a single system? Can we bring the performance of existing packet I/O libraries to the TCP stack? To answer these questions, we build a TCP stack in the user level. User-level TCP is attractive for many reasons.\r\n\r\n* Easily depart from the kernel's complexity\r\n* Directly benefit from the optimizations in the high performance packet I/O libraries\r\n* Naturally aggregate flow-level events by packet-level I/O batching\r\n* Easily preserve the existing application programming interface\r\n\r\n### Event-driven Packet I/O Library\r\n\r\nSeveral packet I/O systems allow high-speed packet I/O (~100M packets/s) from a user-level application. However, they are not suitable for implementing a transport layer because (i) they waste CPU cycles by polling NICs and (ii) they do not allow multiplexing between RX and TX. To address these challenges, we extend <a href=\"\">PacketShader I/O engine (PSIO)</a> for efficient event-driven packet I/O. The new event-driven interface, `ps_select()`, works similarly to select() except that it operates on TX/RX queues of interested NIC ports. For example, mTCP specifies interested NIC interfaces for RX and/or TX events with a timeout in microseconds, and `ps_select()` returns immediately if any events of interests are available.\r\n\r\nThe use of PSIO brings the opportunity to amortize the overhead of various system calls and context switches throughout the system, in addition to eliminating the per-packet memory allocation and DMA overhead. For more detail about the PSIO, please refer to the <a href=\"\">PacketShader project page</a>.\r\n\r\n### User-level TCP Stack\r\n\r\nmTCP is implemented as a separate-TCP-thread-per-application-thread model.Since coupling TCP jobs with the application thread could break time-based operations such as handling TCP retransmission timeouts, we choose to create a separate TCP thread for each application thread affinitized to the same CPU core. Figure 2 shows how mTCP interacts with the application thread. Applications can communicate with the mTCP threads via library functions that grant safe sharing of the internal TCP data.\r\n\r\n<div align=\"center\">\r\n <img src=\"\" height=\"180\">\r\n <div><b> Figure 2. Thread model of mTCP </b></div>\r\n</div>\r\n\r\nWhile designing the TCP stack, we consider following primitives for performance scalability and efficient event delivery.\r\n\r\n* Thread mapping and flow-level core affinity\r\n* Multicore and cache-friendly data structures\r\n* Batched event handling\r\n* Optimizations for short-lived connections\r\n\r\nOur TCP implementation follows the original TCP specification, RFC793. It supports basic TCP features such as connection management, reliable data transfer, flow control, and congestion control. mTCP also implements popular options such as timestamp, MSS, and window scaling. For congestion control, mTCP implements NewReno.\r\n\r\n### Application Interface\r\n\r\nOur programming interface preserves as much as possible the most commonly-used semantics for easy migration of applications. We introduce our user-level socket API and an event system as below.\r\n\r\n**User-level socket API**\r\n\r\nmTCP provides a BSD-like socket interface; for each BSD socket function, we have a corresponding function call (e.g., `accept()` -> `mtcp_accept()`). In addition, we provide some of the `fcntl()` or `ioctl()` functionalities that are frequently used with sockets (e.g., setting socket as nonblocking, getting/setting the socket buffer size) and event systems as below.\r\n\r\n**User-level event system**\r\n\r\nAs shown in Figure 3, we provide an epoll-like event system. Applications can fetch the events through `mtcp_epoll_wait()` and register events through `mtcp_epoll_ctl()`, which correspond to `epoll_wait()` and `epoll_ctl()` in Linux.\r\n\r\n```\r\nmctx_t mctx = mtcp_create_context();\r\nint ep_id = mtcp_epoll_create(mctx, N);\r\nmtcp_listen(mctx, listen_id, 4096);\r\nwhile (1) {\r\n n = mtcp_epoll_wait(mctx, ep_id, events, N, -1);\r\n for (i = 0; i < n; i++) {\r\n sockid = events[i].data.sockid;\r\n if (sockid == listen_id) {\r\n c = mtcp_accept(mctx, listen_id, NULL);\r\n mtcp_setsock_nonblock(mctx, c);\r\n = EPOLLIN | EPOLLOUT);\r\n = c;\r\n mtcp_epoll_ctl(mctx, ep_id, EPOLL_CTL_ADD, c, &ev);\r\n } else if (events[i].events == EPOLLIN) {\r\n r = mtcp_read(mctx, sockid, buf, LEN);\r\n if (r == 0)\r\n mtcp_close(mctx, sockid);\r\n } else if (events[i].events == EPOLLOUT) {\r\n mtcp_write(mctx, sockid, buf, len);\r\n }\r\n }\r\n}\r\n```\r\n<div align=\"center\">\r\n <b> Figure 3. Sample event-driven mTCP application </b>\r\n</div>\r\n\r\nAs in Figure 2, you can program with mTCP just as you do with Linux `epoll` and sockets. One difference is that the mTCP functions require `mctx` (mTCP thread context) for all functions, managing resources independently among different threads for core-scalability.\r\n\r\n### Performance\r\n\r\nWe first show mTCP's scalability with a benchmark for a server sending a short (64B) message. All servers are multi-threaded with a single listening port. Figure 3 shows the performance as a function of the number of CPU cores. While Linux shows poor scaling due to a shared accept queue, and Linux with `SO_REUSEPORT` scales but not linearly, mTCP scales almost linearly with the number of CPU cores. On 8 cores, mTCP shows 25x, 5x, 3x higher performance over Linux, Linux+`SO_REUSEPORT`, and MegaPipe, respectively.\r\n\r\n<div align=\"center\">\r\n <img src=\"\" height=\"240\">\r\n <div><b> Figure 4. Small message transaction benchmark </b></div>\r\n</div>\r\n\r\nTo gauge the performance of lighttpd in a realistic setting, we run a test by extracting the static file workload from SpecWeb2009 as <a href=\"\">Affinity-Accept</a> and <a href=\"\">MegaPipe</a> did. Figure 4 shows that mTCP improves the throughput by 3.2x, 2.2x, 1.5x over Linux, REUSEPORT, and MegaPipe, respectively.\r\n\r\nFor lighttpd, we changed only ~65 LoC to use mTCP-specific event and socket function calls. For multi-threading, a total of ~800 lines were modified out of lighttpd's ~40,000 LoC.\r\n\r\n<div align=\"center\">\r\n <img src=\"\" height=\"240\">\r\n <div><b> Figure 5. Performance of lighttpd for static file workload from SpecWeb2009 </b></div>\r\n</div>\r\n\r\n**Experiment setup:**\r\n\r\n1 Intel Xeon E5-2690 @ 2.90 GHz (octacore)<br>\r\n32 GB RAM (4 memory channels)<br>\r\n1~2 Intel dual port 82599 10 GbE NIC<br>\r\nLinux 2.6.32 (for mTCP), Linux 3.1.3 (for MegaPipe), Linux 3.10.12<br>\r\nixgbe-3.17.3\r\n\r\n### Publications\r\n\r\n**[mTCP: A Highly Scalable User-level TCP Stack for Multicore Systems](**<br>\r\nEunYoung Jeong, Shinae Woo, Muhammad Jamshed, Haewon Jeong, Sunghwan Ihm, Dongsu Han, and KyoungSoo Park <br>\r\nIn Proceedings of the 11th USENIX Symposium on Networked Systems Design and Implementation (NSDI'14)\r\nSeattle, WA, April 2014\r\n","google":"","note":"Don't delete this file! It's used internally to help with page regeneration."}