Raw Socket Mastery: High-Performance TCP Load Balancing in C
If you're still using fork() for every incoming connection, your server is going to crawl and die the moment you get Slashdotted. We're in 1998, and the "C10k" wall is real. Linux 2.0.36 is stable, but its scheduler isn't ready to juggle 5,000 processes. To scale a web cluster today, you need a single-process multiplexing engine.
The secret is select(). We treat file descriptors as bitfields and let the kernel tell us when a socket has data ready to be drained. No context switching overhead, no memory bloat.
The Non-Blocking Socket Setup
First, forget blocking I/O. We want our load balancer to never sleep. Every socket-both the listening side and the backend connections-must be set to O_NONBLOCK.
#include <fcntl.h>
#include <sys/socket.h>
int set_nonblocking(int fd) {
int opts = fcntl(fd, F_GETFL);
if (opts < 0) return -1;
opts = (opts | O_NONBLOCK);
if (fcntl(fd, F_SETFL, opts) < 0) return -1;
return 0;
}
The Multiplexing Loop
We maintain a master_fds set. When select() returns, we iterate through the active bits. If it's the listener, we accept(). If it's an established connection, we shovel bytes between the client and our backend farm.
fd_set master_fds;
fd_set read_fds;
int fd_max;
// ... initialization ...
for(;;) {
read_fds = master_fds; // copy it
if (select(fd_max+1, &read_fds, NULL, NULL, NULL) == -1) {
exit(1);
}
for(int i = 0; i <= fd_max; i++) {
if (FD_ISSET(i, &read_fds)) {
if (i == listener) {
// handle new connection
addrlen = sizeof(remoteaddr);
newfd = accept(listener, (struct sockaddr *)&remoteaddr, &addrlen);
if (newfd != -1) {
set_nonblocking(newfd);
FD_SET(newfd, &master_fds);
if (newfd > fd_max) fd_max = newfd;
}
} else {
// handle data from client or backend
char buf[2048];
int nbytes = recv(i, buf, sizeof(buf), 0);
if (nbytes <= 0) {
close(i);
FD_CLR(i, &master_fds);
} else {
// find the peer (backend or client) and send
// real hackers use a lookup table here
send(peer_map[i], buf, nbytes, 0);
}
}
}
}
}
Memory Efficiency: The Real Bottleneck
In a production environment, you cannot afford malloc() for every packet. We pre-allocate a static pool of buffers at startup. Each connection gets a pointer into this ring buffer. If your load balancer is swapping to disk, you've already lost. We keep the state machine lean: just a struct per connection tracking the client FD, the backend FD, and a small byte-count offset.
Linux 2.2 is on the horizon, promising better threading, but for now, the single-threaded select() loop is the fastest path to high-availability. If you need more than 1024 descriptors (the default FD_SETSIZE), you'll have to recompile your kernel or start looking at poll(), though support for it is still spotty across different Unices.
Data alignment matters. On Alpha or SPARC, unaligned access will SIGBUS your process. Even on x86, it's a performance hit. Pack your structs tightly and keep your hot loops in the L1 cache.
Aunimeda provides DevOps engineering and infrastructure services - CI/CD pipelines, containerization, cloud deployments, and monitoring setups.
Contact us to discuss your infrastructure needs. See also: DevOps Services, Custom Software Development