Boost asio::async_write sending thousands of small packets - c++

In my application I have to send about 10 kb/s per connection. All packets are put in a std::deque. One thread iterates over the deque and sends packet data via asio::async_write.
My question is how much connections can I handle simultaneously in one thread? Can I send say 20 Mbytes/s?

The Boost.Asio author Kohlhoff's website has a performance page. Looking at the graph of Linux-perf-11, he gets a throughput of ~300 Mb/sec with 1000 connections on a single CPU, which is way above your target of 10kb/sec per connection.

Related

How to create multiple TCP connections within 1 gRPC stream

I'm using gRPC stream to transfer data from server to client, and am suffering low throughput. One thing specific to my case is: my client only sends control messages(eg, start, pause, resume), and server streams messages back till the end.
One thing I think could be useful is parallelization, and make full utilization of the b/w.
But one consideration is: my messages sent is ordered, which means if I open multiple multiple gRPC streams, I don't have a way to tell their order.
My question is: is there a way in gRPC to open multiple TCP connections?

Video stream server C++

I am trying to develop a multi-threaded server to handle video streams. Multiple users can be connected at the same time. Each user sends a frame from his camera to the server, the server processes it, sends a response to the user, and so on. There were 2 options:
Use UDP protocol. The server in the main thread receives frames, throws them into the thread pool for processing, and from there, when ready, sends a response to the client. This approach is bad because the acceptance of a frame in the main thread is very long and there was an out of sync between users, which led to client hangs.
Use TCP protocol. The server in the main thread listens for the connection of clients and puts into the thread pool not the processing of 1 frame, but the processing of all frames of this client. This approach solved the problem of synchronization, but slowed down the work by 5-6 times. In fact, it is not clear why, because the image processing takes place on the GPU, which forces all frames to be processed sequentially and the thread pool does not play a role in speeding up calculations, but support for multiuser operation.So I concluded that this slowdown was due to TCP (on localhost, there were almost no changes in the operating time - an increase of about 1 fps for TCP).
I would like to figure out whether such a slowdown is really possible and how to be in this case.
All frames are sent in an uncompressed state, which was chosen empirically because the compression took longer.
P.S. It is very disappointing when the main algorithm works in 14 ms and all efforts were directed to its optimization, and the bottleneck turned out to be data transfer - about 250-500 ms.

Queueing or not queueing for low latency

I'm writing a low-latency program in C++ which receives data from a source, processes the data and sends to a target via a TCP socket. I have a separate thread for all these 3 modules, receiver thread, processor thread, sender thread. All these threads are communicating with lock-free queues.
Do you think that sending the message directly and not using the queue for the sender part would give lower latency? Does it affect performance stability?
Thanks
If the three threads are pinned to different physical cores, having a separate sender thread would give lower latency than processor thread doing the send operation, specially if there are retries happening in the send process. Even if it is a best effort async send, you could still save the marginal time it takes to write to the socket.

Multiple tcp sockets, one stalled

I'm trying to get a starting point on where to begin understanding what could cause a socket stall and would appreciate any insights any of you might have.
So, server is a modern dual socket xeon (2 x 6 core # 3.5 ghz) running windows 2012. In a single process, there are 6 blocking tcp sockets with default options, each of which are running on their own threads (not numa/core specified). 5 of them are connected to the same remote server and receiving very heavy loads (hundreds of thousands of small ~75 byte msgs per second). The last socket is connected to a different server with a very light send/receive load for administrative messaging.
The problem I ran into was a 5 second stall in the admin messaging socket. Multiple send calls to the socket returned successfully, however nothing was received from the remote server (should receive a protocol ack within milliseconds) or received BY the remote admin server for 5 seconds. It was as if that socket just turned off for a bit. After the 5 seconds stall passed, all of the acks came in a burst, and afterwards everything continued normally. During this, the other sockets were receiving much higher numbers of messages than normal, however there was no indication of any interruption or stall as the data logs displayed nothing unusual (light logging, maybe 500 msgs/sec).
From what I understand, the socket send call does not ensure that data has gone out on the wire, just that a transfer to the tcp stack was successful. So, I'm trying to understand the different scenarios that could have taken place that would cause a 5 second stall on the admin socket. Is it possible that due to the large amount of data being received the tcp stack was essentially overwhelmed and prioritized those sockets that were being most heavily utilized? What other situations could have potentially caused this?
Thank you!
If the sockets are receiving hundreds of thousands of 75-byte messages per second there is a possibility that the server is at maximum capacity with some resources. Maybe not bandwidth, as with 100K messages you might be consuming around 10Mbps. But it could be CPU utilization.
You should use two tools to understand you problem:
perfmon to see utilization of CPU (user and priviledged https://technet.microsoft.com/en-us/library/aa173932(v=sql.80).aspx) , memory, bandwidth, and disk queue length. You can also check number of interrupts and context switches with perfmon.
A sniffer like Wireshark to see if at TCP level data is being transmitted and responses received.
Something else I would do is to write a timestamp right after the send call and right before and after the read call in the thread in charge of admin socket. Maybe it is a coding problem.
The fact that send calls return successfully doesn't mean data was immediately sent. In TCP data will be stored in the send buffer and from there, TCP stack will send the data to the other end.
If your system is CPU bound (you can see with perfmon if this is true), then you should put attention to the comments written by #EJP, this is something that could happen when the machine is under heavy load. With the tools I mentioned, you can see if the receive window in the admin socket is closed or if it is just that socket read is taking time in the admin socket.

How to design a client server architect

I like to know the server (TCP based) architecture to support large scale of clients(at least10K) to implement Fix server. My points are
How we design it.
How to listen on the open port? Use select or poll or any other function.
How to process the response of the client? On large scale we cannot create the one thread for each client.
Should the processing of response is in the different executable and share the request and response to the server executable through IPC.
There is much more on it. I would appreciate if anyone explains it or provide any link.
Thanks
An excellent resource for information on this topic is The C10K problem. Although the dimensions there seem a little old, the techniques are still applicable today.
The architecture depends on what you want to do with the clients incoming data. My guess is that for every incoming message you would perform some computations and probably also return a response.
In that case I would create 1 main listener thread that receives all the incoming messages (Actually, if your hardware has more than 1 physical network device, I would use a listener thread per device and make sure each one is listening to a specific device).
Get the number of CPUs that you have on your machine and create worker threads for each CPU and bind them each thread to one cpu (Maybe number of working thread should be num_of_cpu-1, to leave an availalbe cpu for the listener and dispatcher).
Each thread has a queue and semaphore, the main listener thread just push the incoming data into those queues. There are many way to perform load balancing (Will talk about it later).
Each working thread just works on the requests given to it, and put the response on another queue that is read by the dispatcher.
The dispatcher - there are 2 options here, use a thread for dispatcher (or thread per network device as for listeners), or have the dispatcher actually be the same thread as the listener.
There is some advantage to put them both on the same thread, since it makes it easier to detect lost socket connection and use the same fds for both reading and writing without thread synchronization. However, it could be that using 2 different threads would give better performance, it need to be tested.
Note about load balancing:
This is a topic of its own.
The simplest thing is to use 1 queue for all working threads, but the problem is that they have to lock in order to pop items and the locking can damage performance. (But you get the most balanced load).
Another quite simple approach would be to have a private queue for every worker and perform round-robin when inserting. After every X cycles check the size of all the queues. If some queues are much larger than others then leave them out for the next X cycles and then recheck them again. This is not the best approach, but a simple one to implement and gives some load balancing while no locking is needed.
By the way - There is a way to implement queue between 2 threads without blocking - but this is also another topic.
I hope it helps,
Guy
If the client and server are on a secure network then the security aspect is to be minimal - to the extent that the transfers are encrypted. If the clients and the server are not on a secure network - you first want the server and client to authenticate each other and then initiate encrypted data transfer. For data transfer, server-side authentication should suffice. At the end of this authentication use the session key to generate encrypted data stream (symmetric). consider using TFTP it is simple to implement and scales reasonably well.