Speed of HTTP GET Method in C++ - c++

I have been using certain libraries in a C++ program to connect and fetch different websites. Mainly I used Chillkat and Curl. However, recently I started programming my own HTTP fetcher, using the help of MSDN and the Winsocket2 library.
I programmed my software to open a socket with SOCKET_STREAM type and for Ipv4,
and then I establish a connection with the required website, and send a GET request with "Host:" and "Connection: close" headers to the server.
Everything seems to work fine, However, the performance is not as I expected. The bundled Chillkat library still preforms better then mine. Even though I have optimized mine as much as I can.
I notice that when I send the request, some servers take longer time to respond. And once they do they send everything at once chunked. So how can I make a header-request that initiates a fast response? Speed matters a lot for my program.

If you are seeing performance differences on a modern machine with low volumes, the most likely problem is that you have forgotten to turn off the Nagle algorithm. Use setsockopt() to set TCP_NODELAY to 1. HTTP is not Telnet.
I wouldn't worry about explicit flushing or buffer management or anything like that until you see a performance problem and you have enough volume to notice. Other than writing your request in a single write call.
For download speed, window size makes a difference. You can tune SO_SNDBUF and SO_RCVBUF. Bear in mind that the values that make your benchmarks go fast might make your real-world performance slow.

Honestly, HTTP is a complex standard, and there are many ways to optimize an implementation. However, the chances of your having enough time to optimize it better than an already packaged library such as Chillkat or Curl is highly unlikely. If you did want to go about it, I would suggest reducing the number of headers you send, and flushing the socket buffer (bypassing Nagle's algorithm) after writing the status line to the socket. This will give a properly coded server slightly longer (several ms tops) to respond to your request. But even that may blow up in your face if your network configuration is not "ideal".
Finally, keep in mind that when it comes to networks, there is a very large margin of error, and you may get different results using different tactics with different servers, networks, and even OSs.

Related

Data distribution fairness: is TCP and websocket a good choice?

I am learning about servers and data distribution. Much of what I have read from various sources (here is just one) talks about how market data is distributed over UDP to take advantage of multicasting. Indeed, in this video at this point about building a trading exchange, the presenter mentions how TCP is not the optimal choice to distribute data because it means having to "loop over" every client then send the data to each in turn, meaning that the "first in the list" of clients has a possibly unfair advantage.
I was very surprised then when I learned that I could connect to the Binance feed of market data using a websocket connection, which is TCP, using a command such as
websocat_linux64 wss://stream.binance.com:9443/ws/btcusdt#trade --protocol ws
Many other sources mention Websockets, so they certainly seem to be a common method of delivering market data, indeed this states
"Cryptocurrency trading applications often have real-time market data
streamed to trader front-ends via websockets"
I am confused. If Binance distributes over TCP, is "fairness" really a problem as the YouTube video seems to suggest?
So, overall, my main question is that if I want to distribute data (of any kind generally, but we can keep the market data theme if it helps) to multiple clients (possibly thousands) over the internet, should I use UDP or TCP, and is there any specific technique that could be employed to ensure "fairness" if that is relevant?
I've added the C++ tag as I would use C++, lots of high performance servers are written in C++, and I feel there's a good chance that someone will have done something similar and/or accessed the Binance feeds using C++.
The argument on fairness due to looping, in code, is ridiculous.
The whole field of trading where decisions need to be made quickly, where you need to use new information before someone else does is called: low-latency trading.
This tells you what's important: reducing the latency to a minimum. This is why UDP is used over TCP. TCP has flow control, re-sends data and buffers traffic to deliver it in order. This would make it terrible for low-latency trading.
WebSockets, in addition to being built on top of TCP are heavier and slower simply due to the extra amount of data (and needed processing to read/write it).
So even though the looping would be a tiny marginal latency cost, there's plenty of other reasons to pick UDP over TCP and even more over WebSockets.
So why does Binance does it? Their market is not institutional traders with hardware located at the exchanges. It's for traders that are willing to accept some latency. If you don't trade to the millisecond, then some extra latency is acceptable. It makes it much easier to integrate different piece of software together. It also makes fairness, in latency, not so important. If Alice is 0.253 seconds away and Bob is 0.416 seconds away, does it make any difference who I tell first (by a few microseconds)? Probably not.

Detecting maximum throughput of my network using C++

My application aims to detect the network throughput. Despite the C++ code, I am seeking for a reliable theory of throttling my network which returns exact value of maximum accepted baudrate. Indeed, I could write the code later.
After checking several ideas from internet I didn't find which suits my case.
I tried to send data as much as possible using TCP/IP and then check the baudrate on each sending of 10MB. Please find here the pseudo code for my algorithm:
while (send(...)){
if (tempSendBytes > 10Mb)
if (baudrate > predefinedThreshold)
usleep(calculateNeededTime());
tempSendBytes += sentBytes;
}
But, when the predefinedThreshold is acheived, buffers full up and my program get stuck without returning any error. As a matter of fact, checking the baudrate on each sent message will decrease my bandwidth to its minimum. So, I preferred to check each 10MB.
PS: There are no other technical problems in my code neither a memory leak. In addition, my program runs normally (sending and receiving data 100%) if I decrease predefinedThreshold.
My question:
Is there a way to detect the maximum bandwidth (on both loopback and real network) without buffers overflow neither getting stucked?
Yes, you can detect maximum throughput on both loopback and real interface. The real interface may require a TCP server running remotely with sufficient bandwidth to provide an accurate estimate of max throughput.
If you are strictly looking for theoretical, you may be able to run the test on the real interface the same way you run it on localhost with the server bound to the real interface's IP and the client running on the same computer. I'm not sure though what your OS's networking stack will do to this traffic but it should treat it like it was coming off box.
A lot of factors contribute to max theoretical throughput for TCP. TCP MTU, OS send buffer, OS receive buffer, etc. Wikipedia has a good high level overview, but it sounds like you may have already read it. http://en.wikipedia.org/wiki/Measuring_network_throughput You might also find this TCP tuning overview helpful, http://en.wikipedia.org/wiki/TCP_tuning
IPerf is commonly used to accurately measure bandwidth and the techniques it uses are rather comprehensive. It is written in C++ and its code base may be a good starting point for you. https://github.com/esnet/iperf
I know none of this provides an exact discussion of the theory, but hopefully helps clarify some things.

How can I recv TCP socket data in one package without dividing

Since I create a TCP socket,it is fine when sending small amount data.no fragment. all data came in one package. but when data becomes bigger and bigger. TCP package has been divided into pieces.. it`s really annoying. Is there any option to set on socket, and the socket will automatically put pieces into one package for me ?
It's a byte stream. All the bytes will arrive correctly and in the right order, but not necessarily when you want them. If you need to send anything more complex than one byte, you need another protocol on top of TCP. That's why there are all those other TCP/IP protocols like HTTP, SMTP etc.
No there is not. There are even situations where you might receive 1 byte.
Consider using higher level messaging libraries like ZMQ. It handles all the message packing and unpacking for you.
TCP provides you reliable bi-directional byte stream. It takes care of sequencing, transport-layer packetization, retransmission, and flow-control. Decades of research went into optimizing its performance. Pretty nifty. The small price you pay for all this convenience is that you have to write and read the stream in a loop, watching for a complete application protocol message you can process when receiving, and flushing yet unbuffered bytes when sending.
Welcome to socket programming!
I'll chime in here and say that there's pretty much nothing you can do to solve you issue without adding extra dependencies on libraries which handle application protocols for you. There are some lower level message packing libraries (google's protocol buffers, among others) which may help.
It's probably the most beneficial to get used to reading and writing TCP data in a loop. It's proven and very portable.. even if you pay a small price in actually writing the streaming codecs yourself.
Try it a few times. It's a useful experience which you can re-use, and it's really not as difficult and annoying once you get the hang of it (like anything else, really).
Furthermore, it's fairly easy to unit-test (rather than dealing with esoteric libraries and uncommon protocols with badly/sparsely documented options)..
You can optimize sockets reads to return larger chunks, on platforms that support it, by setting low watermark using setsockopt() and SO_RECVLOWAT. But you will still have to handle the possibility of getting bytes less than the watermark.
I think you want SOCK_SEQPACKET (or possibly SOCK_RDM). See socket(2).

Protection against cracking - specifically ways to make a program harder to decompile

I'm making a commercial product that will have a client and server side. The client is totally dependent on the server , just to make it harder to crack/pirate . Problem is , even so there is a chance that someone will reverse engineer the protocol and make their own server.
I've thought about encrypting the connection either with ssl or with another algorithm so it won't be so easy to figure out the protocol just from sniffing the traffic between the client and the server.
Now the only thing I can think of that pirates would use is to decompile the program, remove the encryption and try to see the "plain text" protocol in order to reverse engineer it.
I have read previous topics and I know that it's impossible to make it impossible to crack , but what tweaks can we programmers bring to our code to make it a huge headache for crackers?
Read how Skype did it:
The binary is decrypted into memory at startup.
The import table is overwritten.
The startup code is erased from memory.
Code integrity checks bust most debuggers: in random points in the code it computes a checksum of some other chunk of code and uses the checksum for an indirect jump to the next instruction. (Explanation: most debuggers implement breakpoints by changing the instruction at the breakpoint address. This check detects that.)
If debugger is detected -- it scrambles the registers and jumps to a random page.
Obfuscates code: call destination addresses are dynamically computed; dummy branches that are never executed; raises SEH where the handler sets some registers and resumes execution.
Keep in mind that these or other techniques would make reverse engineering harder, but not impossible. Also you shall never rely on any of these for security.
IMO your best option is to design your servers to provide some useful functionality (SaS). Your clients will essentially be paying for using that functionality. If your client-app is dumb enough, you won't care about it being open-source.
One thing you need to be aware of is that most packers/cryptors cause false positives with virus scanners. And that can be pretty annoying because people complain all the time that your software contains a virus(they don't get the concept of false positives).
And for protocol-obfuscation don't use SSL. It is trivial for an attacker to intercept the plaintext when you call Send with the plain-text. Use SSL for securing the connection and obfuscate the data before sending them. The obfuscation algorithm doesn't need to be cryptographically secure.
This might be helpful: http://www.woodmann.com/crackz/Tutorials/Protect.htm
IMHO, it's difficult to hide the actual plain code. What most packers do is to make it difficult to patch. However, in your case, Themida could do the trick.
Here are some nice tips about writing a good protection: http://www.inner-smile.com/nocrack.phtml

C++ Socket Server - Unable to saturate CPU

I've developed a mini HTTP server in C++, using boost::asio, and now I'm load testing it with multiple clients and I've been unable to get close to saturating the CPU. I'm testing on a Amazon EC2 instance, and getting about 50% usage of one cpu, 20% of another, and the remaining two are idle (according to htop).
Details:
The server fires up one thread per core
Requests are received, parsed, processed, and responses are written out
The requests are for data, which is read out of memory (read-only for this test)
I'm 'loading' the server using two machines, each running a java application, running 25 threads, sending requests
I'm seeing about 230 requests/sec throughput (this is application requests, which are composed of many HTTP requests)
So, what should I look at to improve this result? Given the CPU is mostly idle, I'd like to leverage that additional capacity to get a higher throughput, say 800 requests/sec or whatever.
Ideas I've had:
The requests are very small, and often fulfilled in a few ms, I could modify the client to send/compose bigger requests (perhaps using batching)
I could modify the HTTP server to use the Select design pattern, is this appropriate here?
I could do some profiling to try to understand what the bottleneck's are/is
boost::asio is not as thread-friendly as you would hope - there is a big lock around the epoll code in boost/asio/detail/epoll_reactor.hpp which means that only one thread can call into the kernel's epoll syscall at a time. And for very small requests this makes all the difference (meaning you will only see roughly single-threaded performance).
Note that this is a limitation of how boost::asio uses the Linux kernel facilities, not necessarily the Linux kernel itself. The epoll syscall does support multiple threads when using edge-triggered events, but getting it right (without excessive locking) can be quite tricky.
BTW, I have been doing some work in this area (combining a fully-multithreaded edge-triggered epoll event loop with user-scheduled threads/fibers) and made some code available under the nginetd project.
As you are using EC2, all bets are off.
Try it using real hardware, and then you might be able to see what's happening. Trying to do performance testing in VMs is basically impossible.
I have not yet worked out what EC2 is useful for, if someone find out, please let me know.
From your comments on network utilization,
You do not seem to have much network movement.
3 + 2.5 MiB/sec is around the 50Mbps ball-park (compared to your 1Gbps port).
I'd say you are having one of the following two problems,
Insufficient work-load (low request-rate from your clients)
Blocking in the server (interfered response generation)
Looking at cmeerw's notes and your CPU utilization figures
(idling at 50% + 20% + 0% + 0%)
it seems most likely a limitation in your server implementation.
I second cmeerw's answer (+1).
230 requests/sec seems very low for such simple async requests. As such, using multiple threads is probably premature optimisation - get it working properly and tuned in a single thread, and see if you still need them. Just getting rid of un-needed locking may get things up to speed.
This article has some detail and discussion on I/O strategies for web server-style performance circa 2003. Anyone got anything more recent?
ASIO is fine for small to medium tasks but it isn't very good at leveraging the power of the underlying system. Neither are raw socket calls, or even IOCP on Windows but if you are experienced you will always be better than ASIO. Either way there is a lot of overhead with all of those methods, just more with ASIO.
For what it is worth. using raw socket calls on my custom HTTP can serve 800K dynamic requests per second with a 4 core I7. It is serving from RAM, which is where you need to be for that level of performance. At this level of performance the network driver and OS are consuming about 40% of the CPU. Using ASIO I can get around 50 to 100K requests per second, its performance is quite variable and mostly bound in my app. The post by #cmeerw mostly explains why.
One way to improve performance is by implementing a UDP proxy. Intercepting HTTP requests and then routing them over UDP to your backend UDP-HTTP server you can bypass a lot of TCP overhead in the operating system stacks. You can also have front ends which pipe through on UDP themselves, which shouldn't be too hard to do yourself. An advantage of a HTTP-UDP proxy is that it allows you to use any good frontend without modification, and you can swap them out at will without any impact. You just need a couple more servers to implement it. This modification on my example lowered the OS CPU usage to 10%, which increased my requests per second to just over a million on that single backend. And FWIW You should always have a frontend-backend setup for any performant site because the frontends can cache data without slowing down the more important dynamic requests backend.
The future seems to be writing your own driver that implements its own network stack so you can get as close to the requests as possible and implement your own protocol there. Which probably isn't what most programmers want to hear as it is more complicated. In my case I would be able to use 40% more CPU and move to over 1 million dynamic requests per second. The UDP proxy method can get you close to optimal performance without needing to do this, however you will need more servers - though if you are doing this many requests per second you will usually need multiple network cards and multiple frontends to handle the bandwidth so having a couple lightweight UDP proxies in there isn't that big a deal.
Hope some of this can be useful to you.
How many instances of io_service do you have? Boost asio has an example that creates an io_service per CPU and use them in the manner of RoundRobin.
You can still create four threads and assign one per CPU, but each thread can poll on its own io_service.