C++ Socket Server - Unable to saturate CPU

C++ Socket Server - Unable to saturate CPU - c++

I've developed a mini HTTP server in C++, using boost::asio, and now I'm load testing it with multiple clients and I've been unable to get close to saturating the CPU. I'm testing on a Amazon EC2 instance, and getting about 50% usage of one cpu, 20% of another, and the remaining two are idle (according to htop).
Details:
The server fires up one thread per core
Requests are received, parsed, processed, and responses are written out
The requests are for data, which is read out of memory (read-only for this test)
I'm 'loading' the server using two machines, each running a java application, running 25 threads, sending requests
I'm seeing about 230 requests/sec throughput (this is application requests, which are composed of many HTTP requests)
So, what should I look at to improve this result? Given the CPU is mostly idle, I'd like to leverage that additional capacity to get a higher throughput, say 800 requests/sec or whatever.
Ideas I've had:
The requests are very small, and often fulfilled in a few ms, I could modify the client to send/compose bigger requests (perhaps using batching)
I could modify the HTTP server to use the Select design pattern, is this appropriate here?
I could do some profiling to try to understand what the bottleneck's are/is

boost::asio is not as thread-friendly as you would hope - there is a big lock around the epoll code in boost/asio/detail/epoll_reactor.hpp which means that only one thread can call into the kernel's epoll syscall at a time. And for very small requests this makes all the difference (meaning you will only see roughly single-threaded performance).
Note that this is a limitation of how boost::asio uses the Linux kernel facilities, not necessarily the Linux kernel itself. The epoll syscall does support multiple threads when using edge-triggered events, but getting it right (without excessive locking) can be quite tricky.
BTW, I have been doing some work in this area (combining a fully-multithreaded edge-triggered epoll event loop with user-scheduled threads/fibers) and made some code available under the nginetd project.

As you are using EC2, all bets are off.
Try it using real hardware, and then you might be able to see what's happening. Trying to do performance testing in VMs is basically impossible.
I have not yet worked out what EC2 is useful for, if someone find out, please let me know.

From your comments on network utilization,
You do not seem to have much network movement.
3 + 2.5 MiB/sec is around the 50Mbps ball-park (compared to your 1Gbps port).
I'd say you are having one of the following two problems,
Insufficient work-load (low request-rate from your clients)
Blocking in the server (interfered response generation)
Looking at cmeerw's notes and your CPU utilization figures
(idling at 50% + 20% + 0% + 0%)
it seems most likely a limitation in your server implementation.
I second cmeerw's answer (+1).

230 requests/sec seems very low for such simple async requests. As such, using multiple threads is probably premature optimisation - get it working properly and tuned in a single thread, and see if you still need them. Just getting rid of un-needed locking may get things up to speed.
This article has some detail and discussion on I/O strategies for web server-style performance circa 2003. Anyone got anything more recent?

ASIO is fine for small to medium tasks but it isn't very good at leveraging the power of the underlying system. Neither are raw socket calls, or even IOCP on Windows but if you are experienced you will always be better than ASIO. Either way there is a lot of overhead with all of those methods, just more with ASIO.
For what it is worth. using raw socket calls on my custom HTTP can serve 800K dynamic requests per second with a 4 core I7. It is serving from RAM, which is where you need to be for that level of performance. At this level of performance the network driver and OS are consuming about 40% of the CPU. Using ASIO I can get around 50 to 100K requests per second, its performance is quite variable and mostly bound in my app. The post by #cmeerw mostly explains why.
One way to improve performance is by implementing a UDP proxy. Intercepting HTTP requests and then routing them over UDP to your backend UDP-HTTP server you can bypass a lot of TCP overhead in the operating system stacks. You can also have front ends which pipe through on UDP themselves, which shouldn't be too hard to do yourself. An advantage of a HTTP-UDP proxy is that it allows you to use any good frontend without modification, and you can swap them out at will without any impact. You just need a couple more servers to implement it. This modification on my example lowered the OS CPU usage to 10%, which increased my requests per second to just over a million on that single backend. And FWIW You should always have a frontend-backend setup for any performant site because the frontends can cache data without slowing down the more important dynamic requests backend.
The future seems to be writing your own driver that implements its own network stack so you can get as close to the requests as possible and implement your own protocol there. Which probably isn't what most programmers want to hear as it is more complicated. In my case I would be able to use 40% more CPU and move to over 1 million dynamic requests per second. The UDP proxy method can get you close to optimal performance without needing to do this, however you will need more servers - though if you are doing this many requests per second you will usually need multiple network cards and multiple frontends to handle the bandwidth so having a couple lightweight UDP proxies in there isn't that big a deal.
Hope some of this can be useful to you.

How many instances of io_service do you have? Boost asio has an example that creates an io_service per CPU and use them in the manner of RoundRobin.
You can still create four threads and assign one per CPU, but each thread can poll on its own io_service.

Related

Data distribution fairness: is TCP and websocket a good choice?

I am learning about servers and data distribution. Much of what I have read from various sources (here is just one) talks about how market data is distributed over UDP to take advantage of multicasting. Indeed, in this video at this point about building a trading exchange, the presenter mentions how TCP is not the optimal choice to distribute data because it means having to "loop over" every client then send the data to each in turn, meaning that the "first in the list" of clients has a possibly unfair advantage.
I was very surprised then when I learned that I could connect to the Binance feed of market data using a websocket connection, which is TCP, using a command such as
websocat_linux64 wss://stream.binance.com:9443/ws/btcusdt#trade --protocol ws
Many other sources mention Websockets, so they certainly seem to be a common method of delivering market data, indeed this states
"Cryptocurrency trading applications often have real-time market data
streamed to trader front-ends via websockets"
I am confused. If Binance distributes over TCP, is "fairness" really a problem as the YouTube video seems to suggest?
So, overall, my main question is that if I want to distribute data (of any kind generally, but we can keep the market data theme if it helps) to multiple clients (possibly thousands) over the internet, should I use UDP or TCP, and is there any specific technique that could be employed to ensure "fairness" if that is relevant?
I've added the C++ tag as I would use C++, lots of high performance servers are written in C++, and I feel there's a good chance that someone will have done something similar and/or accessed the Binance feeds using C++.

The argument on fairness due to looping, in code, is ridiculous.
The whole field of trading where decisions need to be made quickly, where you need to use new information before someone else does is called: low-latency trading.
This tells you what's important: reducing the latency to a minimum. This is why UDP is used over TCP. TCP has flow control, re-sends data and buffers traffic to deliver it in order. This would make it terrible for low-latency trading.
WebSockets, in addition to being built on top of TCP are heavier and slower simply due to the extra amount of data (and needed processing to read/write it).
So even though the looping would be a tiny marginal latency cost, there's plenty of other reasons to pick UDP over TCP and even more over WebSockets.
So why does Binance does it? Their market is not institutional traders with hardware located at the exchanges. It's for traders that are willing to accept some latency. If you don't trade to the millisecond, then some extra latency is acceptable. It makes it much easier to integrate different piece of software together. It also makes fairness, in latency, not so important. If Alice is 0.253 seconds away and Bob is 0.416 seconds away, does it make any difference who I tell first (by a few microseconds)? Probably not.

C/C++ technologies involved in sending data across networks very fast

In terms of low latency (I am thinking about financial exchanges/co-location- people who care about microseconds) what options are there for sending packets from a C++ program on two Unix computers?
I have heard about kernel bypass network cards, but does this mean you program against some sort of API for the card? I presume this would be a faster option in comparison to using the standard Unix berkeley sockets?
I would really appreciate any contribution, especially from persons who are involved in this area.
EDITED from milliseconds to microseconds
EDITED I am kinda hoping to receive answers based more upon C/C++, rather than network hardware technologies. It was intended as a software question.

UDP sockets are fast, low latency, and reliable enough when both machines are on the same LAN.
TCP is much slower than UDP but when the two machines are not on the same LAN, UDP is not reliable.

Software profiling will stomp obvious problems with your program. However, when you are talking about network performance, network latency is likely to be you largest bottleneck. If you are using TCP, then you want to do things that avoid congestion and loss on your network to prevent retransmissions. There are a few things to do to cope:
Use a network with bandwidth and reliability guarantees.
Properly size your TCP parameters to maximize utilization without incurring loss.
Use error correction in your data transmission to correct for the small amount of loss you might encounter.
Or you can avoid using TCP altogether. But if reliability is required, you will end up implementing much of what is already in TCP.
But, you can leverage existing projects that have already thought through a lot of these issues. The UDT project is one I am aware of, that seems to be gaining traction.

At some point in the past, I worked with a packet sending driver that was loaded into the Windows kernel. Using this driver it was possible to generate stream of packets something 10-15 times stronger (I do not remember exact number) than from the app that was using the sockets layer.
The advantage is simple: The sending request comes directly from the kernel and bypasses multiple layers of software: sockets, protocol (even for UDP packet simple protocol driver processing is still needed), context switch, etc.

Usually reduced latency comes at a cost of reduced robustness. Compare for example the (often greatly advertised) fastpath option for ADSL. The reduced latency due to shorter packet transfer times comes at a cost of increased error susceptibility. Similar technologies migt exist for a large number of network media. So it very much depends on the hardware technologies involved. Your question suggests you're referring to Ethernet, but it is unclear whether the link is Ethernet-only or something else (ATM, ADSL, …), and whether some other network technology would be an option as well. It also very much depends on geographical distances.
EDIT:
I got a bit carried away with the hardware aspects of this question. To provide at least one aspect tangible at the level of application design: have a look at zero-copy network operations like sendfile(2). They can be used to eliminate one possible cause of latency, although only in cases where the original data came from some source other than the application memory.

As my day job, I work for a certain stock exchange. Below answer is my own opinion from the software solutions which we provide exactly for this kind of high throughput low latency data transfer. It is not intended in any way to be taken as marketing pitch(please i am a Dev.)This is just to give what are the Essential components of the software stack in this solution for this kind of fast data( Data could be stock/trading market data or in general any data):-
1] Physical Layer - Network interface Card in case of a TCP-UDP/IP based Ethernet network, or a very fast / high bandwidth interface called Infiniband Host Channel Adaptor. In case of IP/Ethernet software stack, is part of the OS. For Infiniband the card manufacturer (Intel, Mellanox) provide their Drivers, Firmware and API library against which one has to implement the socket code(Even infiniband uses its own 'socketish' protocol for network communications between 2 nodes.
2] Next layer above the physical layer we have is a Middleware which basically abstracts the lower network protocol nittigritties, provides some kind of interface for data I/O from physical layer to application layer. This layer also provides some kind of network data quality assurance (IF using tCP)
3] Last layer would be a application which we provide on top of middleware. Any one who gets 1] and 2] from us, can develop a low latency/hight throughput 'data transfer of network' kind of app for stock trading, algorithmic trading kind os applications using a choice of programming language interfaces - C,C++,Java,C#.
Basically a client like you can develop his own application in C,C++ using the APIs we provide, which will take care of interacting with the NIC or HCA(i.e. the actual physical network interface) to send and receive data fast, really fast.
We have a comprehensive solution catering to different quality and latency profiles demanded by our clients - Some need Microseconds latency is ok but they need high data quality/very little errors; Some can tolerate a few errors, but need nano seconds latency, Some need micro seconds latency, no errors tolerable, ...
If you need/or are interested in any way in this kind of solution , ping me offline at my contacts mentioned here at SO.

epoll vs select for very small number of connections

I have been using select to handle connections, recently there was a change an our socket library and select was replaced by epoll for linux platform.
my application architecture is such that I make only one or at max 2 socket connections and epoll/select on them in a single thread.
now with recent switch to epoll i noticed that performance of application has diminshed, I was actually surprised and was expecting performance go up or reamin same. I tried looking at various other parts and this is the only peice of code that has changed.
does epoll have performance penalty in terms of speed if used for very small number of sockets (like 1 or 2).
also anoher thing to note that I run around 125 such processes on same box (8 cpu cores).
could this be case that too many processes doing epoll_wait on same machine, this setup was similar when i was using select.
i noticed on box that load average is much higher but cpu usage was quite the same which makes me think that more time is spend in I/O and probaly coming from epoll related changes.
any ideas/pointers on what should i look more to identify the problem.
although absolute latency increased is quite small like average 1 millisec but this is a realtime system and this kind of latencies are generally unaccpetable.
Thanks
Hi,
Updating this question on latest findinds, apart from switching from select to epoll I found another relate change, earlier timeout with select was 10 millis but with epoll the way timeout is way smaller than before (like 1 micro..), can setting too low timeout in select or epoll result on decreased performance in anyway?
thanks

From the sounds of it, throughput may be unaffected with epoll() vs select(), but you're finding extra latency in individual requests that seems to be related to the use of epoll().
I think that in the case of watching only one or two sockets, epoll() should perform much like select(). epoll() is supposed to scale linearly as you watch more descriptors, whereas select() scales badly (& may even have a hard limit on #/descriptors). So it's not that epoll() has a penalty for a small # of descriptors, but it loses its performance advantage over select() in this case.
Can you change the code so you can easily go back & forth between the two event notification mechanisms? Get more data about the performance difference. If you conclusively find that select() has less latency & same throughput in your situation, then I'd just switch back to the "old & deprecated" API without hesitation :) To me it's fairly conclusive if you measure a performance difference from this specific code change. Perhaps previous testing of epoll() versus select() has focused on throughput versus latency of individual requests?

Reducing network latency in communication of high volume intranet applications

We have a set of server applications which receive measurement data from equipment/tools. Message transfer time is currently our main bottleneck, so we are interested in reducing it to improve the process. The communication between the tools and server applications is via TCP/IP sockets made using C++ on Redhat Linux.
Is it possible to reduce the message transfer time using hardware, by changing the TCP/IP configuration settings or by tweaking tcp kernel functions? (we can sacrifice security for speed, since communication is on a secure intranet)

Depending on the workload, disabling Nagle's Algorithm on the socket connection can help a lot.
When working with high volumes of small messages, i found this made a huge difference.
From memory, I believe the socket option for C++ was called TCP_NODELAY

As #Jerry Coffin proposed, you can switch to UDP. UDP is unreliable protocol, this means you can lose your packets, or they can arrive in wrong order, or be duplicated. So you need to handle these cases on application level. Since you can lose some data (as you stated in your comment) no need for retransmission (the most complicated part of any reliable protocol). You just need to drop outdated packets. Use simple sequence numbering and you're done.
Yes, you can use RTP (it has sequence numbering) but you don't need it. RTP looks like an overkill for your simple case. It has many other features and is used mainly for multimedia streaming.
[EDIT] and similar question here

On the hardware side try Intel Server NICs and make sure the TCP offload Engine (ToE) is enabled.
There is also an important decision to make between latency and goodput, if you want better latency at expense of goodput consider reducing the interrupt coalescing period. Consult the Intel documentation for further details as they offer quite a number of configurable parameters.

If you can, the obvious step to reduce latency would be to switch from TCP to UDP.

Yes.
Google "TCP frame size" for details.

Communication between processes

I'm looking for some data to help me decide which would be the better/faster for communication between two independent processes on Linux:
TCP
Named Pipes
Which is worse: the system overhead for the pipes or the tcp stack overhead?
Updated exact requirements:
only local IPC needed
will mostly be a lot of short messages
no cross-platform needed, only Linux

In the past I've used local domain sockets for that sort of thing. My library determined whether the other process was local to the system or remote and used TCP/IP for remote communication and local domain sockets for local communication. The nice thing about this technique is that local/remote connections are transparent to the rest of the application.
Local domain sockets use the same mechanism as pipes for communication and don't have the TCP/IP stack overhead.

I don't really think you should worry about the overhead (which will be ridiculously low). Did you make sure using profiling tools that the bottleneck of your application is likely to be TCP overhead?
Anyways as Carl Smotricz said, I would go with sockets because it will be really trivial to separate the applications in the future.

I discussed this in an answer to a previous post. I had to compare socket, pipe, and shared memory communications. Pipes were definitely faster than sockets (maybe by a factor of 2 if I recall correctly ... I can check those numbers when I return to work). But those measurements were just for the pure communication. If the communication is a very small part of the overall work, then the difference will be negligible between the two types of communication.
Edit
Here are some numbers from the test I did a few years ago. Your mileage may vary (particularly if I made stupid programming errors). In this specific test, a "client" and "server" on the same machine echoed 100 bytes of data back and forth. It made 10,000 requests. In the document I wrote up, I did not indicate the specs of the machine, so it is only the relative speeds that may be of any value. But for the curious, the times given here are the average cost per request:
TCP/IP: .067 ms
Pipe with I/O Completion Ports: .042 ms
Pipe with Overlapped I/O: .033 ms
Shared Memory with Named Semaphore: .011 ms

There will be more overhead using TCP - that will involve breaking the data up into packets, calculating checksums and handling acknowledgement, none of which is necessary when communicating between two processes on the same machine. Using a pipe will just copy the data into and out of a buffer.

I don't know if this suites you, but a very common way of IPC (interprocess communication) under linux is by using the shared memory. It's actually ultra fast (I didn't profiled this, but this is just shared data on RAM with strong processing around it).
The main problem around this approuch is the semaphore, you must build a little system around it so you must make sure a process is not writing at the same time the other one is trying to read.
A very simple starter tutorial is at here
This is not as portable as using sockets, but the concept would be the same, so if you're migrating this to Windows, you will just have to change the shared memory create/attach layer.

Two things to consider:
Connection setup cost
Continuous Communication cost
On TCP:
(1) more costly - 3way handshake overhead required for (potentially) unreliable channel.
(2) more costly - IP level overhead (checksum etc.), TCP overhead (sequence number, acknowledgement, checksum etc.) pretty much all of which aren't necessary on the same machine because the channel is supposed to be reliable and not introduce network related impairments (e.g. packet reordering).
But I would still go with TCP provided it makes sense (i.e. depends on the situation) because of its ubiquity (read: easy cross-platform support) and the overhead shouldn't be a problem in most cases (read: profile, don't do premature optimization).
Updated: if cross-platform support isn't required and the accent is on performance, then go with named/domain pipes as I am pretty sure the platform developers will have optimize-out the unnecessary functionality deemed required for handling network level impairments.

unix domain socket is a very goog compromise. Not the overhead of tcp, but more evolutive than the pipe solution. A point you did not consider is that socket are bidirectionnal, while named pipes are unidirectionnal.

I think the pipes will be a little lighter, but I'm just guessing.
But since pipes are a local thing, there's probably a lot less complicated code involved.
Other people might tell you to try and measure both to find out. It's hard to go wrong with this answer, but you may not be willing to invest the time. That would leave you hoping my guess is correct ;)

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js