Multiple threads and blocking sockets - c++

I'm writing a TCP server on Windows Server 2k8. This servers receives data, parses it and then sinks it on a database. I'm doing now some tests but the results surprises me.
The application is written in C++ and uses, directly, Winsocks and the Windows API. It creates a thread for each client connection. For each client it reads and parses the data, then insert it into the database.
Clients always connect in simultaneous chunks - ie. every once in a while about 5 (I control this parameter) clients will connect simultaneously and feed the server with data.
I've been timing both the reading-and-parsing and the database-related stages of each thread.
The first stage (reading-and-parsing) has a curious behavoir. The amount of time each thread takes is roughly equal to each thread but is also proportional to the number of threads connecting. The server is not CPU starved: it has 8 cores and always less than 8 threads are connected to it.
For example, with 3 simultaneous threads 100k rows (for each thread) will be read-and-parsed in about 4,5s. But with 5 threads it will take 9,1s on average!
A friend of mine suggested this scaling behavoir might be related to the fact I'm using blocking sockets. Is this right? If not, what might be the reason for this behavoir?
If it is, I'd be glad if someone can point me out good resources for understanding non blocking sockets on Windows.
Edit:
Each client thread reads a line (ie., all chars untils a '\n') from the socket, then parses it, then read again, until the parse fails or a terminator character is found. My readline routine is based on this:
http://www.cis.temple.edu/~ingargio/cis307/readings/snaderlib/readline.c
With static variables being declared as __declspec(thread).
The parsing, assuming from the non networking version, is efficient (about 2s for 100k rows). I assume therefore the problem is in the multhreaded/network version.

If your lines are ~120–150 characters long, you are actually saturating the network!
There's no issue with sockets. Simply transfering 3 times 100k lines, 150 bytes each, over a 100 Mbps line (1 take 10 bytes/byte to account for headers) will take... 4.5 s! There is no problem with sockets, blocking or otherwise. You've simply hit the limit of how much data you can feed it.

Non-blocking sockets are only useful if you want one thread to service multiple connections. Using non-blocking sockets and a poll/select loop means that your thread does not sit idle while waiting for new connections.
In your case this is not an issue since there is only one connection per thread so there is no issue if your thread is waiting for input.
Which leads to your original questions of why things slow down when you have more connections. Without further information, the most likely culprit is that you are network limited: ie your network cannot feed your server data fast enough.
If you are interested in non-blocking sockets on Windows do a search on MSDN for OVERLAPPED APIs

You could be running into other threading related issues, like deadlocks/race conditions/false sharing, which could be destroying the performance.

One thing to keep in mind is that although you have one thread per client, Windows will not automatically ensure they all run on different cores. If some of them are running on the same core, it is possible (although unlikely) to have your server in a sort of CPU-starved state, with some cores at 100% load and others idle. There are simply no guarantees as to how the OS spreads the load (in the default case).
To explicitly assign threads to particular cores, you can use SetThreadAffinityMask. It may or may not be worth having a bit of a play around with this to see if it helps.
On the other hand, this may not have anything to do with it at all. YMMV.

Related

multilevel threads takes time

I have created a module to transfer the data using multiple sockets using TCP client server communication. This transfers the file of 20MB in 10 secs.
Multiple sockets sends/receives the data in each of their a separate thread.
When I launch the module from another worker thread the time taken to send the same file increases to 40 secs.
Please let me know any solutions to avoid the time lagging.
Are you synchronizing the threads to read the content from the file at Client side and write back to file at server side? This adds time.
Along with this, by default you will having context switching time between multiple threads at both client and server.
One problem may be disk caching and seeking. If you are not already doing this, try interleaving blocks transferred by different threads more finely (like, say, to 4KB, so bytes 0...4095 transferred by 1st thread, 4096...8191 by 2nd thread, etc).
Also avoid mutexes, for example by having each thread know what it's supposed to read and send, or write and receive, when thread starts, so no inter-thread communication is needed. Aborting the whole transfer can be done by an atomic flag variable (checked by each thread after transferring a block) instead of mutexes.
Also on receiving end, make sure to do buffering in memory, so that you write to destination file sequentially. That is, if one thread transfers blocks faster than some other, those "early" blocks are just kept in memory until all the preceding blocks have been received and written.
If buffer size becomes an issue here, you may need to implement some inter-thread synchronization at one end (doesn't matter much, wether you slow down receiving or sending), to prevent the fastest thread getting too far ahead of the slowest thread, but for file sizes in the order of tens of megabytes, on PCs, this should not become an issue.

Select v/s Multithreading in an Multicore Environment

I am writing an application which works on 3 tiers Each having multiple instances and forming a mesh like relationship.
Eg. Let us consider a Matrix type situation:
L11 L12
L21 L22 L23 L24
L31 L32 L33
L11 has connections with L21,L22,L23,L24
L21 has connections with L21,L22,L23,L24
L21 has connection with L11,L12,L31,L32,L33... and so on
Presently I am using UNIX domain sockets for communication with select call(to determine where the data has come from). How does a timeout on select call affect CPU usage??
Will it be better to spawn a new thread per connection or persist with the select model. Typically the horizontal scalability at L1 may be around 5, L2 will be around 5 and L3 around 10.
Is there a penalty for select system call if the bit fields passed to select is high?
In case of Multi threaded application whether to go with a Blocking system socket or Non-Blocking?
Is there a difference between UDP/TCP in UNIX-Domain socket??
Note: I will be using systems of 12-24 cores ... Probably this is what made me think why not use separate threads and not select call ..
Note 2: Will there be buffer overflows in Unix Domain Sockets ?? As
far as i read these are lossless and inorder ...
I am writing an application which works on 3 tiers Each having multiple instances and forming a mesh like relationship.
Are all of the nodes in your Matrix running on the same host, or are they running on multiple hosts across a network?
How does a timeout on select call affect CPU usage??
Only in the obvious way, in that every time your select() call times out you're going to have to go for another (otherwise maybe avoidable) spin around your event loop. So, for example always passing a timeout of zero (or near-zero) would be bad because then your thread would be wasting CPU cycles busy-looping. But the occasional timeout shouldn't be a problem. My recommendation is to be purely event-based as much as possible, and only use timeouts when they are unavoidable (e.g. because such-and-such an event has to happen at a particular time, unrelated to any I/O traffic).
Will it be better to spawn a new thread per connection or persist with the select model. Typically the horizontal scalability at L1 may be around 5, L2 will be around 5 and L3 around 10.
For this scale of operation (dozens of connections but not hundreds) select() will work fine. It's only when you get up into hundreds or thousands of connections that select() starts to fall down, and even then you could move to something like kqueue() or epoll() rather than going multithreaded.
Is there a penalty for select system call if the bit fields passed to select is high?
I'm not 100% sure what you mean by "bit fields is high", but if you mean "there are lots of sockets specified", then the main penalty is that select() will only work with file descriptors whose integer values are less than FD_SETSIZE (typically 1024) -- which means by implication that select() also can't track more than FD_SETSIZE sockets at a time. (Exception: Under Windows, select() will handle sockets with arbitrary integer values, although it still won't handle more than FD_SETSIZE sockets at a time)
In case of Multi threaded application whether to go with a Blocking system socket or Non-Blocking?
If it were me, I would still use non-blocking I/O even in a multithreaded application, because blocking I/O makes certain things like a clean shutdown of the application problematic. For example, if you have a thread that is blocked inside recv(), how do you safely get that thread to exit so that you can clean up any shared resources? You can't send a signal because you don't know which thread would receive it, and you can't rely on recv() returning naturally within a finite amount of time either. On the other hand, if your target thread only ever blocks inside select(), you can send the thread a byte on one of the FDs it is select()'ing on to wake it up and tell it go away.
Is there a difference between UDP/TCP in UNIX-Domain socket??
Unix-Domain sockets don't use UDP or TCP (since they never go across the network). They do use SOCK_STREAM and SOCK_DGRAM, though, which behave similarly (in most respects) to using TCP and UDP sockets to localhost.
Note: I will be using systems of 12-24 cores ... Probably this is what made me think why not use separate threads and not select call ..
The question to ask yourself is, is your application likely to be compute-bound or I/O bound? If it's compute-bound (and the computations can be parallelized) you might benefit from farming out the computations to multiple threads. If it's I/O bound, OTOH, there won't be any advantage to going multithreaded, because in that case the bottleneck is likely to be your network card (and/or the network itself), and your gigabit-Ethernet jack is still going to be sending a maximum of 1Gb/sec whether it is fed by one thread or many. If all your communication is with other processes on the same host, OTOH, there might be some benefit to multiple threads, but my guess is that it would be minimal, since it looks like you will already be using all of your cores for the dozen or so processes that you have.
Note 2: Will there be buffer overflows in Unix Domain Sockets ?? As far as i read these are lossless and inorder ...
There shouldn't be. SOCK_STREAM Unix-domain sockets work very much like TCP sockets in my experience. (I've never used SOCK_DGRAM Unix-domain sockets, but I would imagine they'd behave similarly to UDP sockets, and would even drop packets sometimes e.g. when the receiving process can't keep up with the sender)

Threads/Sockets limits in Linux

First of all: sorry for my english.
Guys, I have a trouble with POSIX sockets and/or pthreads. I'm developing on embedded device(ARM9 CPU). On the device will work multithread tcp server. And it will be able to process a lot of incoming connections. Server gets connection from client and increase counter variable(unsigned int counter). Clients routines will run in separate threads. All clients will use 1 singleton class instance(in this class will be opened and closed same files). Clients works with files, then client thread closes connection socket, and calls pthread_exit().
So, my tcp server can't handle more than 250 threads(counter = 249 +1(server thread). And I got "Resource temporary unavailable". What's the problem?
Whenever you hit the thread limit - or as mentioned run out of virtual process address space due to the number of threads - you're.... doing it wrong. More threads don't scale. Especially not when doing embedded programming. You can handle requests on a thread pool instead. Use poll(2) to handle many connections on fewer threads. This is prettty well-trod territory and libraries (like ACE, asio) have been leveraging this model for good reason
The 'thread-per-request' model is mainly popular because of it's (perceived) simple design.
As long as you keep connections on a single logical thread (sometimes known as a strand) there is no real difference, though.
Also, if the handling of a request involves no blocking operations, you can never do better than polling and handling on a single thread after all: you can use the 'backlog' feature of bind/accept to let the kernel worry about pending connections for you! (Note: this assumed a single core CPU, on a dual core CPU this kind of processing would be optimal with one thread per CPU)
Edit Addition Re:
ulimit shows how much threads can OS handle, right? If yes, ulimit does not solve my problem because my app uses ~10-15 threads in same time.
If that's the case, you should really double check that you are joining or detaching all threads properly. Also think of the synchronization objects; if you consistently forget to call the relevant pthread *_destroy functions, you'll run into the limits even without needing it. That would of course be a resource leak. Some tools may be able to help you spot them (vlagrind/helgrind come to mind)
Use ulimit -n to check the number of file system handles. You can increase it for your current session if the number is too low.
Also you can edit /etc/security/limits.conf and to set a permanent limit
Usually, the first limit you are hitting on 32-bit systems is that you are running out of virtual address space when using default stack sizes.
Try explicitly specifying the stack size when creating threads (to less than 1 MB) or setting the default stack size with "ulimit -s".
Also note that you need to either pthread_detach or pthread_join your threads so that all resources will be freed.

Is select() Ok to implement single socket read/write timeout?

I have an application processing network communication with blocking calls. Each thread manages a single connection. I've added a timeout on the read and write operation by using select prior to read or write on the socket.
Select is known to be inefficient when dealing with large number of sockets. But is it ok, in term of performance to use it with a single socket or are there more efficient methods to add timeout support on single sockets calls ? The benefit of select is to be portable.
Yes that's no problem, and you do want some timeout mechanisms to not leak resources from bad behaving clients etc.
Note that having a large number of threads is even more inefficient than having select dealing with a large number of sockets.
If yo uthink select is inefficient with a large number of sockets, try handling a large number of sockets with one thread per sockt. You are in for a world of pain. Like you will have problems scaling to 1000 threads.
What I have done in the past is that:
Group sockets in groups of X (512, 1024).
Have one thread or two run along those groups and select () - then hand off the sockets with new data into a queue.
have a number of worker threads work off those sockets with new data. How many depends how much I need to max out the CPU ;)
This way I dont ahve a super über select () with TONS of items, and I also dont waste ridiculous amountf of memory on threads (hint: every thread needs it's own stack. With ONLY 2mb, that is 2gb for 1000 sockets - talk about inefficient) and waste hugh amounts of CPU on doing useless context switches.
The question with threads/select is whether you want to avoid clients blocking each other. If this is not an issue, then work single-threaded. If it is, choose an appropriate threading scheme (1 thread per connection, worker threads per connection, worker thread per request,...).
When working with 1 thread per connection, a select per read/write is a decent solution, but generally speaking, it is better to work with non-blocking sockets in combination with select to avoid blocking in situations where only a part of the expected message arrives and then do a select after writing.

Multiple threads for multiple ports?

I have a program (process) which needs to listen on 3 ports... Two are TCP and the other UDP.
The two TCP ports are going to be receiving large amounts of data every so often (could be as little as every 5 minutes or as often as every 20 seconds). The third (UDP) port is receiving constant data. Now, does it make sense to have these listening on different threads?
For instance, when I receive a large amount of data from one of the TCP ports, I don't want my UDP stream interrupted... are these common concerns for network programming?
I'll be using the Boost library on Windows if that has any bearing.
I'm just looking for some thoughts/ideas/guidance on this issue and how to manage multiple connections.
In general, avoid threads unless necessary. On a modern machine you will get better performance by using a single thread and using I/O readiness/completion features. On Windows this is IO Completion Ports, on Mac OS X and FreeBSD: kqueue(2), on Solaris: Event ports, on Linux epoll, on VMS QIO. Etc.
In boost, this is abstracted by boost::asio.
The place where threads would be useful is where you must do significant processing or perform a blocking operating system call that would add unacceptable latency to the rest of the networking processing.
Using threads, one per receiving connection, will help keep the throughput high and prevent one port's data from blocking the processing of another ports.
This is a good idea, especially since you're talking about only having 3 connections. Spawning three threads to handle the communication will make your application much easier to maintain and keep performant.
First,
...are these common concerns for network programming?
Yes, threading issues are very common concerns in network programming.
Second, your design idea of using three threads for listening on three different ports would work, allowing you to listen on all three ports simultaneously. As pointed out in the comments, this isn't the only way to do it.
Last, one design that's common in network programming is to have one thread listen for a connection, then spawn a new helper thread to handle processing the connection. The original listener thread just passes the socket connection off to the helper, then goes right back to listening for new connections.
Multiple threads do not scale. Operating system threads are far too heavy weight at the scale of thousands or tens of thousands of connections. Granted, you only have 3, so it's no big deal. In fact, I might even recommend it in your case in the name of simplicity if you are certain your application will not need to scale.
But at scale, you'd want to use select()/poll()/epoll()/libevent/etc. On modern Linux systems, epoll() is by far the most robust and is blazingly fast. One thread polls for socket readiness on all sockets simultaneously, and then sockets that signal as ready are either handled immediately by the single thread, or more often than not handed off to a thread pool. The thread pool has a limited number of threads (usually some ratio of the number of compute cores available on the local machine), wherein a free thread is grabbed whenever a socket is ready.
Again, you can use a thread per connection, but if you're interested in learning the proper way to build scalable network systems, don't. Building a select()/poll() style server is a fantastic learning experience.
For reference, see the C10K problem:
http://www.kegel.com/c10k.html