Methodologies to handling multiple clients in C++ winsock - c++

I'm developing a peer to peer message parsing application. So one peer may need to handle many clients. And also there is a possibility to send and receive large data (~20 MB data as one message). There can be situations like many peers send large data to the same peer. I heard there are many solutions to handle these kind of a situation.
Use thread per peer
Using a loop to go through the peers and if there are data we can
recive
Using select function
etc.
What is the most suitable methodology or most common and accepted way to handle these kind of a situation? Any advice or hint are welcome.
Updated: Is there a good peer to peer distributed computing library or framework for C++ on windows platform

Don't use a thread per peer; past the number of processors, additional threads is likely only to hurt performance. You'd also have been expected to tweak the dwStackSize so that 1000 idle peers doesn't cost you 1000MB of RAM.
You can use a thread-pool (X threads handling Y sockets) to get a performance boost (or, ideally, IO Completion Ports), but this tends to work incredibly well for certain kinds of applications, and not at all for other kinds of applications. Unless you're certain that yours is suited for this, I wouldn't justify taking the risk.
It's entirely permissible to use a single thread and poll/send from a large quantity of sockets. I don't know precisely when large would have a concernable overhead, but I'd (conservatively) ballpark it somewhere between 2k-5k sockets (on below average hardware).
The workaround for WSAEWOULDBLOCK is to have a std::queue<BYTE> of bytes (not a queue of "packet objects") for each socket in your application (you populate this queue with the data you want to send), and have a single background-thread whose sole purpose is to drain the queues into the respective socket send (X bytes at a time); you can use blocking socket for this now (since it's a background-worker), but if you do use a non-blocking socket and get WSAEWOULDBLOCK you can just keep trying to drain the queue (here it won't obstruct the flow of your application).

You could use libtorrent.org which is built on top of boost (boost-asio ). It's focusing on efficiency and scalability.
I have not much experience in developing a socket in C++ but in C# I had really good experience accepting connections asynchronously and pass them to an own thread from a threadpool.

Related

Is there a better way to use asynchronous TCP sockets in C++ rather than poll or select?

I recently started writing some C++ code that uses sockets, which I'd like to be asynchronous. I've read many posts about how poll and select can be used to make my sockets asynchronous (using poll or select to wait for a send or recv buffer), but on my server side I have an array of struct pollfd, where every time the listening socket accepts a connection, it adds it to the array of struct pollfd so that it can monitor that socket's recv (POLLIN).
My problem is that if I have 5000 sockets connected to my listening socket on my server, then the array of struct pollfd would be of size 5000, since it would be monitoring all the connected sockets BUT the only way I know how to check if a recv for a socket is ready, is by looping through all the items in the array of struct pollfd to find the ones whose revents equals POLLIN. This just seems kind of inefficient, when the number of connected sockets because very large. Is there a better way to do this?
How does the boost::asio library handle async_accept, async_send, etc...? How should I handle it?
What the heck, I will go ahead and write up an answer.
I am going to ignore the "asynchronous" vs "non-blocking" terminology because I believe it is irrelevant to your question.
You are worried about performance when handling thousands of network clients, and you are right to be worried. You have rediscovered the C10K problem. Back when the Web was young, people saw a need for a small number of fast servers to handle a large number of (relatively) slow clients. The existing select/poll type interfaces require linear scans -- in both kernel and user space -- across all sockets to determine which are ready. If many sockets are often idle, your server can wind up spending more time figuring out what work to do than doing actual work.
Fast-forward to today, where we have basically two approaches for dealing with this problem:
1) Use one thread per socket and just issue blocking reads and writes. This is usually the simplest to code, in my opinion, and modern operating systems are quite good at letting idle threads sleep peacefully out of the way without imposing any significant performance overhead. In my experience, this approach works very well for hundreds of clients; I cannot personally say how it will work for thousands.
2) Use one of the platform-specific interfaces that were introduced to tackle the C10K problem. That means epoll (Linux), kqueue (BSD/Mac), or completion ports (Windows). (If you think epoll is the same as poll, look again.) All of these will only notify your application about sockets that are actually ready, avoiding the wasteful linear scan across idle connections. There are several libraries that make these platform-specific interfaces easier to use, including libevent, libev, and Boost.Asio. You will find that all of them ultimately invoke epoll on Linux, kqueue on BSD, and so on, whenever such interfaces are available.

Can single-buffer blocking WSASend deliver partial data?

I've pretty much always used send() with sockets and now I'm moving onto the WSA functions. With send(), I have a sendall() helper that ensured all data is delivered even if it didn't happen in one try and a partial send occurred on first call.
So, instead of learning the hard way or over-complicating code when I don't have to, decided to ask you:
Can a blocking WSASend() send partial data or does it send everything before it returns or fails? Or should I check the bytes sent vs. expected to send and keep at it until everything is delivered?
ANSWER: Overlapped WSASend() does not send partial data but if it does, it means the connection has terminated. I've never encountered the case yet.
From the WSASend docs:
If the socket is non-blocking and stream-oriented, and there is not sufficient space in the transport's buffer, WSASend will return with only part of the application's buffers having been consumed. Given the same buffer situation and a blocking socket, WSASend will block until all of the application buffer contents have been consumed.
I haven't tried this behavior though. BTW, why do you rewrite your code to use WSA functions? Switching from standard bsd socket api just to use the socket basically with the same blocking behavior doesn't really seem to be a good idea for me. Just keep the old blocking code with send with the "retry code", this way its portable and bulletproof. It is not saving 1-2 comparisons is that makes your IO code performant.
Switch to specialized WSA functions only if you are trying to exploit some windows specific strengths, or if you want to use for non-blocking sockets with WSAWaitForMultipleObjects that is a bit better than the standard select but even in that case you can simply go with send and recv as I did it.
In my opinion using epoll/kqueue/iocp (or a library that abstracts these away) with sockets are the way to go. There are some very basic tasks that can be done with blocking sockets but if you cross the line and you need nonblocking socks then switching straight to epoll/kqueue/iocp is the way to go instead of programming painful select or WSAWaitForMultipleObjects based apis. epoll/kqueue/iocp are not only better but also easier to program than the select based alternatives. Really. They are more modern apis that were invented based on more experience. (Although they are not crossplatform, but even select has portability issues...).
The previously mentioned apis for linux/bsd/windows are based on the same concept but in my opinion the simplest and easiest to learn is the epoll api of linux. It is ways better than a select call but its 100x easier to program once you get the idea. If you start using IOCP on windows than it my seem a bit more complicated.
If you haven't yet used these apis then definitely give epoll a go if you are familiar with linux and then on windows implement the same with IOCP that is based on a similar concept with a bit more complicated overlapped IO programming. With IOCP you will have a reason for using WSASend because you can not start overlapped IO on a socket with send but you can do that with WSASend (or WriteFile).
EDIT: If you are going for max performance with IOCP then here are some additional hints:
Drop blocking operations. This is very important. A serious networking engine can not afford blocking IO. It simply doesn't scale on any of the platforms. Do overlapped operations for both send and receive, overlapped IO is the big gun of windows.
Setup a thread pool that processes the completed IO operations. Setup test clients that bomb your server with real-world-usage-like messages and parallel connection counts and under stress tweak the buffer sizes and thread counts for your actual target hardware.
Set the SO_RCVBUF and SO_SNDBUF sizes of your sockets to zero and play around with the size of the buffers that you are using to send and receive data. Setting the rcv/send buf of the socket handle to zero allows the tcp stack to receive/send data directly to/from your buffers avoiding an additional copy between your userspace buffers and the socket buffers. The optimal size for these buffers is also subject to tweaking. I usually use at least a few ten K buffers sizes but sometimes in case of large volume transfers 1-2M buffer sizes are better depending on the number of parallel busy connections. Again, tweak the values while stressing the server with some test clients that do activity similar to real world clients. When you are ready with the first working version of your network engine on top of it lets build a test client that can simulate many (maybe thousands of) parallel clients depending on the real world usage of your server.
You will need "per connection software send buffers" inside your network engine and you may (or may not) want to control the max size of the send buffers. In case of reaching the max send buffer size you may want to block or discard messages/data depending on what you want to do, encapsulate this special buffer and provide two nice interfaces to it: one for the threads that are putting data into this buffer and another interface that is used by the IOCP sender code. This buffer is usually a very critical part of the whole thing and I usually had a lot of bugs around this part of the code so make sure to design its interface nicely to minimize the number of bugs. Depending on how your application constructs and puts messages into the queue you can play around a lot with the internal implementation (size of storage chunks, nagle-like optimizations, ...).

Designing a multi-client tcp server to process data

I am attempting to rewrite my current project to include more features and stability, and need some help designing it. Here is the jist of it (for linux):
TCP_SERVER receives connection (auth packet)
TCP_SERVER starts a new (thread/fork) to handle the new client
TCP_SERVER will be receiving many packets from client > which will be added to a circular buffer
A separate thread will be created for that client to process those packets and build a list of objects
Another thread should be created to send parts of the list of objects to another client
The reason to separate all the processing into threads is because server will be getting many packets and the processing wont be able to keep up (which needs to be quick, as its time sensitive) (im not sure if tcp will drop packets if the internal buffer gets too large?), and another thread to send to another client to keep the processing fast as possible.
So for each new connection, 3 threads should be created. 1 to receive packets, 1 to process them, and 1 to send the processed data to another client (which is technically the same person/ip just on a different device)
And i need help designing this, as how to structure this, what to use (forks/threads), what libraries to use.
Trying to do this yourself is going to cause you a world of pain. Focus on your actual application, and leverage an existing socket handling framework. For example, you said:
for each new connection, 3 threads should be created
That statement says the following:
1. You haven't done this before, at scale, and haven't realized the impact all these threads will have.
2. You've never benchmarked thread creation or synchronous operations.
3. The number of things that can go wrong with this approach is pretty overwhelming.
Give some serious thought to using an existing library that does most of this for you. Getting the scaffolding right around this can literally take years, and you're better off focusing on your code rather than all the random plumbing.
The Boost C++ libraries seem to have a nice Async C++ socket handling infrastructure. Combine this with some of the existing C++ thread pools and you could likely have a highly performant solution up fairly quickly.
I would also question your use of C++ for this. Java and C# both do highly scalable socket servers pretty well, and some of the higher level language tooling (Spring, Guarva, etc) can be very, very valuable. If you ever want to secure this, via TLS or another mechanism, you'll also probably find this much easier in Java or C# than in C++.
Some of the major things you'll care about:
1. True Async I/O will be a huge perf and scalability win. Try really hard to do this. The boost asio library looks pretty nice.
2. Focus on your features and stability, rather than building a new socket handling platform.
3. Threads are expensive, avoid creating them. Thread pools are your friend.
You plan to create one-or-more threads for every connection your server handles. Threads are not free, they come with a memory and CPU overhead, and when you have many active threads you also begin to have resource contention.
What usage pattern do you anticipate? Do you expect that when you have 8 connections, all 8 network threads will be consuming 100% of a cpu core pushing/pulling packets? Or do you expect them to have a relatively low turn-around?
As you add more threads, you will begin to have to spend more time competing for resources in things like mutexes etc.
A better pattern is to have one or more thread for network io - most os'es have mechanisms for saying "tell me when one or more of these network connections has io" which is an efficiency saving over having lots of individual threads all doing the same thing for just one connection.
Then for actual processing, spin up a pool of worker threads to do actual work, allowing you to minimize the competition for resources. You can monitor work load to determine if you need to spin up more to meet delivery requirements.
You might also want to look into something to implement the network IO infrastructure for you; I've had really good performance results with libevent but then I've only had to deal with very high performance/reliability networking systems.

Can I use Boost::Asio and not worry about network programming problems?

I have to make a server in my current project but I don't have any or little experience in this area. My question is, can I just use Asio in my project and it will simply handle any problems a normal server has to face (partial reads, multithreading problems, ...)?
(My server will have to handle hundreds of clients at the same time)
ASIO takes care of the low-level socket programming and polling code. You still have to provide all the functionality to process raw network data. Ultimately, you get an unpredictable number of bytes from the network any time a read callback is called, and it is up to you to take those bytes and reconstruct your application message from them.
But indeed, as far as receiving an unspecified number of bytes is concerned, you won't have to worry about how that is implemented.
Multithreading is "easy" in the sense that you can run the ASIO processor multiple times concurrently, but it is your responsibility to provide a read callback that can deal with being run multiple times at once.
Asio is intentionally not multithreaded. It handles concurrency by multiplexing via the operating system's select(), kqueue, epoll, or other mechanism.
As for partial receives, there is no automatic way to get TCP to respect message boundaries. Asio can't do anything about that, so you'll need some technique at the application level to indicate completion. HTTP traditionally handles this by closing the socket when it's finished, though it's also possible to pre-send the size of the message.

C++ Sockets Send() Thread-Safety

I am coding sockets server for 1000 clients maxmimum, the server is about my game, i'm using non-blocking sockets and about 10 threads that receive data simultaneously from different sockets (first thread receives from 0-100,second from 101-200 and so on..)
but if thread 1 wants to send data to all 1000 clients and thread 2 also wants to send data to all 1000 clients at the same time, is that safe? are there any chances of the data being messed in the other (client) side?
if yes, i guess the only problem that can happen is that sometimes client would receive 2 or 10 packets as 1 packet, is that correct? if yes, is there any solution to that :(
The usual pattern of dealing with many sockets is to have a dedicated thread polling for I/O events with select(2), poll(2), or better kqueue(2) or epoll(4) (depending on the platform) acting as socket event dispatcher. The sockets are usually handled in non-blocking mode. Then one might have pool of threads reacting to the events and either do reads and writes directly or via lower level buffers/queues.
All sorts of techniques are applicable here - from queues to event subscription whiteboards. It gets tricky with multiplexing accepts/reads/writes/EOFs on the I/O level and with event arbitration on the application level. Several libraries like libevent and boost::asio help structure the lower level (the ACE library is also in this space, but I'd hate recommending it to anybody). You would have to come up with application-level protocols and state machines yourself (again boost::statechart might be of help).
Some good links to get better understanding of what you are up against (this is probably the millionth time they are mentioned here on SO):
The C10K problem
High-Performance Server Architecture
Apologies for not offering a concrete solution, but this is a very wide design question and most decisions depend heavily on the context (lots of fun though). Hope this helps a bit.
Since you are sending data using different sockets, there must not be any problem. Rather when these different threads access same data you have to ensure data integrity.
Are you using UDP or TCP sockets?
If UDP, each write should be encapsulated in a separate packet and should be carried to the other side intact. The order may be swapped (as it may for any UDP packet) but they should be whole.
If TCP, there's no concept of packets on the transport layer and any 10 writes on one side may be bundled up on the other side in one read. TCP writes may also only accept part of your buffer so even if the send() function is atomic, your write isn't necessarily. In this case you'd need to synchronize it.
send() is not atomic in most implementations, so sending to 1000 different sockets from multiple threads could lead to mixed-up messages arriving on the client side, and all kinds of weirdness. (I know nothing, see Nicolai's and Robert's comments below the rest of my comment still stands though (in terms of being a solution to your problem))
What I would do is use threads for sending like you use them for receiving. One thread to manage sending to one (or more) sockets that ensures that you don't write to one socket from multiple threads at the same time.
Also look here for some additional discussion and more interesting links.
If you're on windows, the winsock programmers faq is an invaluable resource, for your issue see here.