Select v/s Multithreading in an Multicore Environment

Select v/s Multithreading in an Multicore Environment - c++

I am writing an application which works on 3 tiers Each having multiple instances and forming a mesh like relationship.
Eg. Let us consider a Matrix type situation:
L11 L12
L21 L22 L23 L24
L31 L32 L33
L11 has connections with L21,L22,L23,L24
L21 has connections with L21,L22,L23,L24
L21 has connection with L11,L12,L31,L32,L33... and so on
Presently I am using UNIX domain sockets for communication with select call(to determine where the data has come from). How does a timeout on select call affect CPU usage??
Will it be better to spawn a new thread per connection or persist with the select model. Typically the horizontal scalability at L1 may be around 5, L2 will be around 5 and L3 around 10.
Is there a penalty for select system call if the bit fields passed to select is high?
In case of Multi threaded application whether to go with a Blocking system socket or Non-Blocking?
Is there a difference between UDP/TCP in UNIX-Domain socket??
Note: I will be using systems of 12-24 cores ... Probably this is what made me think why not use separate threads and not select call ..
Note 2: Will there be buffer overflows in Unix Domain Sockets ?? As
far as i read these are lossless and inorder ...

I am writing an application which works on 3 tiers Each having multiple instances and forming a mesh like relationship.
Are all of the nodes in your Matrix running on the same host, or are they running on multiple hosts across a network?
How does a timeout on select call affect CPU usage??
Only in the obvious way, in that every time your select() call times out you're going to have to go for another (otherwise maybe avoidable) spin around your event loop. So, for example always passing a timeout of zero (or near-zero) would be bad because then your thread would be wasting CPU cycles busy-looping. But the occasional timeout shouldn't be a problem. My recommendation is to be purely event-based as much as possible, and only use timeouts when they are unavoidable (e.g. because such-and-such an event has to happen at a particular time, unrelated to any I/O traffic).
Will it be better to spawn a new thread per connection or persist with the select model. Typically the horizontal scalability at L1 may be around 5, L2 will be around 5 and L3 around 10.
For this scale of operation (dozens of connections but not hundreds) select() will work fine. It's only when you get up into hundreds or thousands of connections that select() starts to fall down, and even then you could move to something like kqueue() or epoll() rather than going multithreaded.
Is there a penalty for select system call if the bit fields passed to select is high?
I'm not 100% sure what you mean by "bit fields is high", but if you mean "there are lots of sockets specified", then the main penalty is that select() will only work with file descriptors whose integer values are less than FD_SETSIZE (typically 1024) -- which means by implication that select() also can't track more than FD_SETSIZE sockets at a time. (Exception: Under Windows, select() will handle sockets with arbitrary integer values, although it still won't handle more than FD_SETSIZE sockets at a time)
In case of Multi threaded application whether to go with a Blocking system socket or Non-Blocking?
If it were me, I would still use non-blocking I/O even in a multithreaded application, because blocking I/O makes certain things like a clean shutdown of the application problematic. For example, if you have a thread that is blocked inside recv(), how do you safely get that thread to exit so that you can clean up any shared resources? You can't send a signal because you don't know which thread would receive it, and you can't rely on recv() returning naturally within a finite amount of time either. On the other hand, if your target thread only ever blocks inside select(), you can send the thread a byte on one of the FDs it is select()'ing on to wake it up and tell it go away.
Is there a difference between UDP/TCP in UNIX-Domain socket??
Unix-Domain sockets don't use UDP or TCP (since they never go across the network). They do use SOCK_STREAM and SOCK_DGRAM, though, which behave similarly (in most respects) to using TCP and UDP sockets to localhost.
Note: I will be using systems of 12-24 cores ... Probably this is what made me think why not use separate threads and not select call ..
The question to ask yourself is, is your application likely to be compute-bound or I/O bound? If it's compute-bound (and the computations can be parallelized) you might benefit from farming out the computations to multiple threads. If it's I/O bound, OTOH, there won't be any advantage to going multithreaded, because in that case the bottleneck is likely to be your network card (and/or the network itself), and your gigabit-Ethernet jack is still going to be sending a maximum of 1Gb/sec whether it is fed by one thread or many. If all your communication is with other processes on the same host, OTOH, there might be some benefit to multiple threads, but my guess is that it would be minimal, since it looks like you will already be using all of your cores for the dozen or so processes that you have.
Note 2: Will there be buffer overflows in Unix Domain Sockets ?? As far as i read these are lossless and inorder ...
There shouldn't be. SOCK_STREAM Unix-domain sockets work very much like TCP sockets in my experience. (I've never used SOCK_DGRAM Unix-domain sockets, but I would imagine they'd behave similarly to UDP sockets, and would even drop packets sometimes e.g. when the receiving process can't keep up with the sender)

Related

Is there a better way to use asynchronous TCP sockets in C++ rather than poll or select?

I recently started writing some C++ code that uses sockets, which I'd like to be asynchronous. I've read many posts about how poll and select can be used to make my sockets asynchronous (using poll or select to wait for a send or recv buffer), but on my server side I have an array of struct pollfd, where every time the listening socket accepts a connection, it adds it to the array of struct pollfd so that it can monitor that socket's recv (POLLIN).
My problem is that if I have 5000 sockets connected to my listening socket on my server, then the array of struct pollfd would be of size 5000, since it would be monitoring all the connected sockets BUT the only way I know how to check if a recv for a socket is ready, is by looping through all the items in the array of struct pollfd to find the ones whose revents equals POLLIN. This just seems kind of inefficient, when the number of connected sockets because very large. Is there a better way to do this?
How does the boost::asio library handle async_accept, async_send, etc...? How should I handle it?

What the heck, I will go ahead and write up an answer.
I am going to ignore the "asynchronous" vs "non-blocking" terminology because I believe it is irrelevant to your question.
You are worried about performance when handling thousands of network clients, and you are right to be worried. You have rediscovered the C10K problem. Back when the Web was young, people saw a need for a small number of fast servers to handle a large number of (relatively) slow clients. The existing select/poll type interfaces require linear scans -- in both kernel and user space -- across all sockets to determine which are ready. If many sockets are often idle, your server can wind up spending more time figuring out what work to do than doing actual work.
Fast-forward to today, where we have basically two approaches for dealing with this problem:
1) Use one thread per socket and just issue blocking reads and writes. This is usually the simplest to code, in my opinion, and modern operating systems are quite good at letting idle threads sleep peacefully out of the way without imposing any significant performance overhead. In my experience, this approach works very well for hundreds of clients; I cannot personally say how it will work for thousands.
2) Use one of the platform-specific interfaces that were introduced to tackle the C10K problem. That means epoll (Linux), kqueue (BSD/Mac), or completion ports (Windows). (If you think epoll is the same as poll, look again.) All of these will only notify your application about sockets that are actually ready, avoiding the wasteful linear scan across idle connections. There are several libraries that make these platform-specific interfaces easier to use, including libevent, libev, and Boost.Asio. You will find that all of them ultimately invoke epoll on Linux, kqueue on BSD, and so on, whenever such interfaces are available.

Should I use multiple threads for a multi socket client?

I understand that for most cases using threads in Qt networking is overkill and unnecessary, especially if you do it the proper way and use the readyRead() signal. However, my "client" application will have multiple sockets open (about 5) at one time. It is possible for there to be data coming in on all sockets at the same time. I am really not going to be doing any intense processing with the incoming data. Simply reading it in and then sending out a signal to update the GUI with the newly received data. Do you think a single thread application should be able to handle all of the data coming in?
I understand that I haven't shown you any code and that my description is pretty vague and it could very well depend on how it performs once implemented, but from a general design perspective and your guys' expertise, what is your opinion?

Unless you are receiving really high-bandwidth streams (e.g. megabytes per second rather than kilobytes per second), a single-threaded design should be sufficient. Keep in mind that the OS's networking stack is running "in the background" at all times, receiving TCP packets and storing the received data inside fixed-size in-kernel memory buffers. This happens in parallel with your program's execution, so in most cases the fact that your program is single-threaded and busy dealing with a GUI update (or another socket) won't hamper your computer's reception of TCP packets.
The case where a single-threaded design would cause a slowdown of TCP traffic is if your program (via Qt) didn't call recv() quickly enough, such that the kernel's TCP-receive buffer for a socket became entirely filled with data. At that point the kernel would have no choice but to start dropping incoming TCP packets for that socket, which would cause the server to have to re-send those TCP packets, and that would cause the socket's TCP receive rate to slow down, at least temporarily. However, that problem can be avoided by making sure the buffers never (or at least rarely) get full.
The obvious way to do that is to ensure that your program reads all of the incoming data as quickly as possible -- something that QTCPSocket does by default. The only thing you need to do is make sure that your GUI updates don't take an inordinate amount of time -- and Qt's widget-update routines are fairly efficient, so they shouldn't, unless you have a really elaborate GUI or an inefficient custom paintEvent() routine or etc.
If that's not sufficient, the next thing you could do (if necessary) is tell the OS's TCP stack to increase the size of its in-kernel TCP receive buffer, e.g. by doing:
int fd = myQTCPSocketObject.descriptor();
int newBufSizeBytes = 128*1024; // request 128kB kernel recv-buffer for this socket
if (setsockopt(fd, SOL_SOCKET, SO_RCVBUF, &newBufSizeBytes, sizeof(newBufSizeBytes)) != 0) perror("setsockopt");
Doing that would give your (single) thread more time to react before incoming packets start getting dropped for lack of in-kernel buffer space.
If, after trying all that, you still aren't getting the network performance you need, then you can try going multithreaded. I doubt it will come to that, but if it does, it needn't affect your program's design too much; you'd just write a wrapper class (called SocketThread or something) that holds your QTCPSocket object and runs an internal thread that handles the reading from the socket, and emits a bytesReceived(QByteArray) signal whenever the thread reads data from the socket. The rest of your code would remain approximately the same; just modify it to hold the SocketThread object instead of a QTCPSocket, and connect the SocketThread's bytesReceived(QByteArray) signal to a corresponding slot (via a QueuedConnection, of course, for thread-safety) and use that instead of responding directly to readReady().

Implement it without threads, using a thread-considerate design(*), measure the delay your data experiences, decide if it is within acceptable bounds. Then decide if you need to use threads to capture it more rapidly.
From your description, the key bottleneck is going to be GUI reception of the "data ready" signal, render it. If you use the approach of sending lots of these signals, your GUI is goign to be doing more re-renders.
If you use a single-thread approach, you can marshal the network reads and get all the updates and then refresh the GUI directly. As you've described it, this sounds like it will have the least degree of contention.
(* try to avoid constructs which will require an entire rewrite if you go threaded, but don't put so much effort into making it thread-proof that it will actually need threads to make it efficient, e.g. don't wrap everything with mutex calls)

I do not know much about Qt, but this could be a typical scenario where you use select() to multiplex multiple socket accesses with a single thread.
If the thread for selecting is used mainly for handling the data from/to the sockets you will be very fast(as you will have less context switches). So if you are not transfer really huge amounts of data it is likely possible that you will be faster will a single threaded solution.
That being said, i would go with the solution that fits the most for your needs, something that you can implement in a fair amount of time. Implementing select (async) can be quite a hassle, an overkill that might not be needed.
It's a C-like approach, but i hope i could help anyway.

Multiple threads and blocking sockets

I'm writing a TCP server on Windows Server 2k8. This servers receives data, parses it and then sinks it on a database. I'm doing now some tests but the results surprises me.
The application is written in C++ and uses, directly, Winsocks and the Windows API. It creates a thread for each client connection. For each client it reads and parses the data, then insert it into the database.
Clients always connect in simultaneous chunks - ie. every once in a while about 5 (I control this parameter) clients will connect simultaneously and feed the server with data.
I've been timing both the reading-and-parsing and the database-related stages of each thread.
The first stage (reading-and-parsing) has a curious behavoir. The amount of time each thread takes is roughly equal to each thread but is also proportional to the number of threads connecting. The server is not CPU starved: it has 8 cores and always less than 8 threads are connected to it.
For example, with 3 simultaneous threads 100k rows (for each thread) will be read-and-parsed in about 4,5s. But with 5 threads it will take 9,1s on average!
A friend of mine suggested this scaling behavoir might be related to the fact I'm using blocking sockets. Is this right? If not, what might be the reason for this behavoir?
If it is, I'd be glad if someone can point me out good resources for understanding non blocking sockets on Windows.
Edit:
Each client thread reads a line (ie., all chars untils a '\n') from the socket, then parses it, then read again, until the parse fails or a terminator character is found. My readline routine is based on this:
http://www.cis.temple.edu/~ingargio/cis307/readings/snaderlib/readline.c
With static variables being declared as __declspec(thread).
The parsing, assuming from the non networking version, is efficient (about 2s for 100k rows). I assume therefore the problem is in the multhreaded/network version.

If your lines are ~120–150 characters long, you are actually saturating the network!
There's no issue with sockets. Simply transfering 3 times 100k lines, 150 bytes each, over a 100 Mbps line (1 take 10 bytes/byte to account for headers) will take... 4.5 s! There is no problem with sockets, blocking or otherwise. You've simply hit the limit of how much data you can feed it.

Non-blocking sockets are only useful if you want one thread to service multiple connections. Using non-blocking sockets and a poll/select loop means that your thread does not sit idle while waiting for new connections.
In your case this is not an issue since there is only one connection per thread so there is no issue if your thread is waiting for input.
Which leads to your original questions of why things slow down when you have more connections. Without further information, the most likely culprit is that you are network limited: ie your network cannot feed your server data fast enough.
If you are interested in non-blocking sockets on Windows do a search on MSDN for OVERLAPPED APIs

You could be running into other threading related issues, like deadlocks/race conditions/false sharing, which could be destroying the performance.

One thing to keep in mind is that although you have one thread per client, Windows will not automatically ensure they all run on different cores. If some of them are running on the same core, it is possible (although unlikely) to have your server in a sort of CPU-starved state, with some cores at 100% load and others idle. There are simply no guarantees as to how the OS spreads the load (in the default case).
To explicitly assign threads to particular cores, you can use SetThreadAffinityMask. It may or may not be worth having a bit of a play around with this to see if it helps.
On the other hand, this may not have anything to do with it at all. YMMV.

Is select() Ok to implement single socket read/write timeout?

I have an application processing network communication with blocking calls. Each thread manages a single connection. I've added a timeout on the read and write operation by using select prior to read or write on the socket.
Select is known to be inefficient when dealing with large number of sockets. But is it ok, in term of performance to use it with a single socket or are there more efficient methods to add timeout support on single sockets calls ? The benefit of select is to be portable.

Yes that's no problem, and you do want some timeout mechanisms to not leak resources from bad behaving clients etc.
Note that having a large number of threads is even more inefficient than having select dealing with a large number of sockets.

If yo uthink select is inefficient with a large number of sockets, try handling a large number of sockets with one thread per sockt. You are in for a world of pain. Like you will have problems scaling to 1000 threads.
What I have done in the past is that:
Group sockets in groups of X (512, 1024).
Have one thread or two run along those groups and select () - then hand off the sockets with new data into a queue.
have a number of worker threads work off those sockets with new data. How many depends how much I need to max out the CPU ;)
This way I dont ahve a super über select () with TONS of items, and I also dont waste ridiculous amountf of memory on threads (hint: every thread needs it's own stack. With ONLY 2mb, that is 2gb for 1000 sockets - talk about inefficient) and waste hugh amounts of CPU on doing useless context switches.

The question with threads/select is whether you want to avoid clients blocking each other. If this is not an issue, then work single-threaded. If it is, choose an appropriate threading scheme (1 thread per connection, worker threads per connection, worker thread per request,...).
When working with 1 thread per connection, a select per read/write is a decent solution, but generally speaking, it is better to work with non-blocking sockets in combination with select to avoid blocking in situations where only a part of the expected message arrives and then do a select after writing.

Multiple threads for multiple ports?

I have a program (process) which needs to listen on 3 ports... Two are TCP and the other UDP.
The two TCP ports are going to be receiving large amounts of data every so often (could be as little as every 5 minutes or as often as every 20 seconds). The third (UDP) port is receiving constant data. Now, does it make sense to have these listening on different threads?
For instance, when I receive a large amount of data from one of the TCP ports, I don't want my UDP stream interrupted... are these common concerns for network programming?
I'll be using the Boost library on Windows if that has any bearing.
I'm just looking for some thoughts/ideas/guidance on this issue and how to manage multiple connections.

In general, avoid threads unless necessary. On a modern machine you will get better performance by using a single thread and using I/O readiness/completion features. On Windows this is IO Completion Ports, on Mac OS X and FreeBSD: kqueue(2), on Solaris: Event ports, on Linux epoll, on VMS QIO. Etc.
In boost, this is abstracted by boost::asio.
The place where threads would be useful is where you must do significant processing or perform a blocking operating system call that would add unacceptable latency to the rest of the networking processing.

Using threads, one per receiving connection, will help keep the throughput high and prevent one port's data from blocking the processing of another ports.
This is a good idea, especially since you're talking about only having 3 connections. Spawning three threads to handle the communication will make your application much easier to maintain and keep performant.

First,
...are these common concerns for network programming?
Yes, threading issues are very common concerns in network programming.
Second, your design idea of using three threads for listening on three different ports would work, allowing you to listen on all three ports simultaneously. As pointed out in the comments, this isn't the only way to do it.
Last, one design that's common in network programming is to have one thread listen for a connection, then spawn a new helper thread to handle processing the connection. The original listener thread just passes the socket connection off to the helper, then goes right back to listening for new connections.

Multiple threads do not scale. Operating system threads are far too heavy weight at the scale of thousands or tens of thousands of connections. Granted, you only have 3, so it's no big deal. In fact, I might even recommend it in your case in the name of simplicity if you are certain your application will not need to scale.
But at scale, you'd want to use select()/poll()/epoll()/libevent/etc. On modern Linux systems, epoll() is by far the most robust and is blazingly fast. One thread polls for socket readiness on all sockets simultaneously, and then sockets that signal as ready are either handled immediately by the single thread, or more often than not handed off to a thread pool. The thread pool has a limited number of threads (usually some ratio of the number of compute cores available on the local machine), wherein a free thread is grabbed whenever a socket is ready.
Again, you can use a thread per connection, but if you're interested in learning the proper way to build scalable network systems, don't. Building a select()/poll() style server is a fantastic learning experience.
For reference, see the C10K problem:
http://www.kegel.com/c10k.html

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js