I have an application processing network communication with blocking calls. Each thread manages a single connection. I've added a timeout on the read and write operation by using select prior to read or write on the socket.
Select is known to be inefficient when dealing with large number of sockets. But is it ok, in term of performance to use it with a single socket or are there more efficient methods to add timeout support on single sockets calls ? The benefit of select is to be portable.
Yes that's no problem, and you do want some timeout mechanisms to not leak resources from bad behaving clients etc.
Note that having a large number of threads is even more inefficient than having select dealing with a large number of sockets.
If yo uthink select is inefficient with a large number of sockets, try handling a large number of sockets with one thread per sockt. You are in for a world of pain. Like you will have problems scaling to 1000 threads.
What I have done in the past is that:
Group sockets in groups of X (512, 1024).
Have one thread or two run along those groups and select () - then hand off the sockets with new data into a queue.
have a number of worker threads work off those sockets with new data. How many depends how much I need to max out the CPU ;)
This way I dont ahve a super über select () with TONS of items, and I also dont waste ridiculous amountf of memory on threads (hint: every thread needs it's own stack. With ONLY 2mb, that is 2gb for 1000 sockets - talk about inefficient) and waste hugh amounts of CPU on doing useless context switches.
The question with threads/select is whether you want to avoid clients blocking each other. If this is not an issue, then work single-threaded. If it is, choose an appropriate threading scheme (1 thread per connection, worker threads per connection, worker thread per request,...).
When working with 1 thread per connection, a select per read/write is a decent solution, but generally speaking, it is better to work with non-blocking sockets in combination with select to avoid blocking in situations where only a part of the expected message arrives and then do a select after writing.
Related
Is curl_multi interface spawning new threads internally to handle multiple requests concurrently? Is it equal to spawning threads manually and just using curl_easy handles? What is more performant. I need to make up to 1000 concurrent requests.
https://curl.haxx.se/libcurl/c/multithread.html
Is using curl_multi equal to the example in the link above?
From: https://curl.haxx.se/libcurl/c/libcurl-multi.html
Enable multiple simultaneous transfers in the same thread without making it complicated for the application.
What does it mean? How does it handle multiple transfers in the same thread? I could as well create 100 threads with 100 curl_easy handles and make requests there.
Maybe the question should be: When to use multiple threads and when to use curl_multi.
There's no easy or simple answer. libcurl allows you and your application to make the decision and supports working in either mode.
The libcurl multi interface is a single-core single-thread way to do a large amount of parallel transfers in the same thread. It allows for easy reuse of caches, connections and more. That has its clear advantages but will make it CPU-bound in that single CPU.
Doing multi-threaded transfers will make each thread/handle has its own cache and connection pool etc which will change when they will be useful, but will make the transfers less likely to be CPU-bound when you can spread them out over a larger set of cores/CPUs.
Which the right design decision is for you, is not easy for us to tell.
No all the connections on a single curl_multi handle operate on the same thread. It uses a single select/poll/epoll event loop and non-blocking sockets to process all the connection concurrently on the same thread.
I'm running an fully operational IOCP TCP socket application. Today I was thinking about the Critical Section design and now I have one endless question in my head: global or per client Critical Section? I came to this because as I see there is no point to use multiple working threads if every threads depends on a single lock, right? I mean... now I don't see any performance issue with 100 simultaneous clients, but what if was 10000?
My shared resource is per client pre allocated struct, so, each client have your own IO context, socket and stuff. There is no inter-client resource share, so I think that is another point for use the per client CS. I use one accept thread and 8 (processors * 2) working threads. This applications is basicaly designed for small (< 1KB) packets but sometimes for file streaming.
The "correct" answer probably depends on your design, the number of concurrent clients and the performance that you require from the hardware that you have available.
In general, I find it best to go with the simplest thing that works and then profile to locate hot spots.
However... You say that you have no inter-client shared resources so I assume the only synchronisation that you need to do is around 'per-connection' state.
Since it's per connection the obvious (to me) design would be for the per-connection state to contain its own critical section. What do you perceive to be the downside of this approach?
The problem with a single shared lock is that you introduce contention between connections (and threads) that have no reason to block each other. This will adversely affect performance and will likely become a hot-spot as connection numbers rise.
Once you have a per connection lock you might want to look at avoiding using it as often as possible by having the IOCP threads simply lock to place completions in a per connection queue for processing. This has the advantage of allowing a single IOCP thread to work on each connection and preventing a single connection from having additional IOCP threads blocking on it. It also works well with 'skip completion port on success' processing.
I am writing an application which works on 3 tiers Each having multiple instances and forming a mesh like relationship.
Eg. Let us consider a Matrix type situation:
L11 L12
L21 L22 L23 L24
L31 L32 L33
L11 has connections with L21,L22,L23,L24
L21 has connections with L21,L22,L23,L24
L21 has connection with L11,L12,L31,L32,L33... and so on
Presently I am using UNIX domain sockets for communication with select call(to determine where the data has come from). How does a timeout on select call affect CPU usage??
Will it be better to spawn a new thread per connection or persist with the select model. Typically the horizontal scalability at L1 may be around 5, L2 will be around 5 and L3 around 10.
Is there a penalty for select system call if the bit fields passed to select is high?
In case of Multi threaded application whether to go with a Blocking system socket or Non-Blocking?
Is there a difference between UDP/TCP in UNIX-Domain socket??
Note: I will be using systems of 12-24 cores ... Probably this is what made me think why not use separate threads and not select call ..
Note 2: Will there be buffer overflows in Unix Domain Sockets ?? As
far as i read these are lossless and inorder ...
I am writing an application which works on 3 tiers Each having multiple instances and forming a mesh like relationship.
Are all of the nodes in your Matrix running on the same host, or are they running on multiple hosts across a network?
How does a timeout on select call affect CPU usage??
Only in the obvious way, in that every time your select() call times out you're going to have to go for another (otherwise maybe avoidable) spin around your event loop. So, for example always passing a timeout of zero (or near-zero) would be bad because then your thread would be wasting CPU cycles busy-looping. But the occasional timeout shouldn't be a problem. My recommendation is to be purely event-based as much as possible, and only use timeouts when they are unavoidable (e.g. because such-and-such an event has to happen at a particular time, unrelated to any I/O traffic).
Will it be better to spawn a new thread per connection or persist with the select model. Typically the horizontal scalability at L1 may be around 5, L2 will be around 5 and L3 around 10.
For this scale of operation (dozens of connections but not hundreds) select() will work fine. It's only when you get up into hundreds or thousands of connections that select() starts to fall down, and even then you could move to something like kqueue() or epoll() rather than going multithreaded.
Is there a penalty for select system call if the bit fields passed to select is high?
I'm not 100% sure what you mean by "bit fields is high", but if you mean "there are lots of sockets specified", then the main penalty is that select() will only work with file descriptors whose integer values are less than FD_SETSIZE (typically 1024) -- which means by implication that select() also can't track more than FD_SETSIZE sockets at a time. (Exception: Under Windows, select() will handle sockets with arbitrary integer values, although it still won't handle more than FD_SETSIZE sockets at a time)
In case of Multi threaded application whether to go with a Blocking system socket or Non-Blocking?
If it were me, I would still use non-blocking I/O even in a multithreaded application, because blocking I/O makes certain things like a clean shutdown of the application problematic. For example, if you have a thread that is blocked inside recv(), how do you safely get that thread to exit so that you can clean up any shared resources? You can't send a signal because you don't know which thread would receive it, and you can't rely on recv() returning naturally within a finite amount of time either. On the other hand, if your target thread only ever blocks inside select(), you can send the thread a byte on one of the FDs it is select()'ing on to wake it up and tell it go away.
Is there a difference between UDP/TCP in UNIX-Domain socket??
Unix-Domain sockets don't use UDP or TCP (since they never go across the network). They do use SOCK_STREAM and SOCK_DGRAM, though, which behave similarly (in most respects) to using TCP and UDP sockets to localhost.
Note: I will be using systems of 12-24 cores ... Probably this is what made me think why not use separate threads and not select call ..
The question to ask yourself is, is your application likely to be compute-bound or I/O bound? If it's compute-bound (and the computations can be parallelized) you might benefit from farming out the computations to multiple threads. If it's I/O bound, OTOH, there won't be any advantage to going multithreaded, because in that case the bottleneck is likely to be your network card (and/or the network itself), and your gigabit-Ethernet jack is still going to be sending a maximum of 1Gb/sec whether it is fed by one thread or many. If all your communication is with other processes on the same host, OTOH, there might be some benefit to multiple threads, but my guess is that it would be minimal, since it looks like you will already be using all of your cores for the dozen or so processes that you have.
Note 2: Will there be buffer overflows in Unix Domain Sockets ?? As far as i read these are lossless and inorder ...
There shouldn't be. SOCK_STREAM Unix-domain sockets work very much like TCP sockets in my experience. (I've never used SOCK_DGRAM Unix-domain sockets, but I would imagine they'd behave similarly to UDP sockets, and would even drop packets sometimes e.g. when the receiving process can't keep up with the sender)
I'm writing a TCP server on Windows Server 2k8. This servers receives data, parses it and then sinks it on a database. I'm doing now some tests but the results surprises me.
The application is written in C++ and uses, directly, Winsocks and the Windows API. It creates a thread for each client connection. For each client it reads and parses the data, then insert it into the database.
Clients always connect in simultaneous chunks - ie. every once in a while about 5 (I control this parameter) clients will connect simultaneously and feed the server with data.
I've been timing both the reading-and-parsing and the database-related stages of each thread.
The first stage (reading-and-parsing) has a curious behavoir. The amount of time each thread takes is roughly equal to each thread but is also proportional to the number of threads connecting. The server is not CPU starved: it has 8 cores and always less than 8 threads are connected to it.
For example, with 3 simultaneous threads 100k rows (for each thread) will be read-and-parsed in about 4,5s. But with 5 threads it will take 9,1s on average!
A friend of mine suggested this scaling behavoir might be related to the fact I'm using blocking sockets. Is this right? If not, what might be the reason for this behavoir?
If it is, I'd be glad if someone can point me out good resources for understanding non blocking sockets on Windows.
Edit:
Each client thread reads a line (ie., all chars untils a '\n') from the socket, then parses it, then read again, until the parse fails or a terminator character is found. My readline routine is based on this:
http://www.cis.temple.edu/~ingargio/cis307/readings/snaderlib/readline.c
With static variables being declared as __declspec(thread).
The parsing, assuming from the non networking version, is efficient (about 2s for 100k rows). I assume therefore the problem is in the multhreaded/network version.
If your lines are ~120–150 characters long, you are actually saturating the network!
There's no issue with sockets. Simply transfering 3 times 100k lines, 150 bytes each, over a 100 Mbps line (1 take 10 bytes/byte to account for headers) will take... 4.5 s! There is no problem with sockets, blocking or otherwise. You've simply hit the limit of how much data you can feed it.
Non-blocking sockets are only useful if you want one thread to service multiple connections. Using non-blocking sockets and a poll/select loop means that your thread does not sit idle while waiting for new connections.
In your case this is not an issue since there is only one connection per thread so there is no issue if your thread is waiting for input.
Which leads to your original questions of why things slow down when you have more connections. Without further information, the most likely culprit is that you are network limited: ie your network cannot feed your server data fast enough.
If you are interested in non-blocking sockets on Windows do a search on MSDN for OVERLAPPED APIs
You could be running into other threading related issues, like deadlocks/race conditions/false sharing, which could be destroying the performance.
One thing to keep in mind is that although you have one thread per client, Windows will not automatically ensure they all run on different cores. If some of them are running on the same core, it is possible (although unlikely) to have your server in a sort of CPU-starved state, with some cores at 100% load and others idle. There are simply no guarantees as to how the OS spreads the load (in the default case).
To explicitly assign threads to particular cores, you can use SetThreadAffinityMask. It may or may not be worth having a bit of a play around with this to see if it helps.
On the other hand, this may not have anything to do with it at all. YMMV.
I have a program (process) which needs to listen on 3 ports... Two are TCP and the other UDP.
The two TCP ports are going to be receiving large amounts of data every so often (could be as little as every 5 minutes or as often as every 20 seconds). The third (UDP) port is receiving constant data. Now, does it make sense to have these listening on different threads?
For instance, when I receive a large amount of data from one of the TCP ports, I don't want my UDP stream interrupted... are these common concerns for network programming?
I'll be using the Boost library on Windows if that has any bearing.
I'm just looking for some thoughts/ideas/guidance on this issue and how to manage multiple connections.
In general, avoid threads unless necessary. On a modern machine you will get better performance by using a single thread and using I/O readiness/completion features. On Windows this is IO Completion Ports, on Mac OS X and FreeBSD: kqueue(2), on Solaris: Event ports, on Linux epoll, on VMS QIO. Etc.
In boost, this is abstracted by boost::asio.
The place where threads would be useful is where you must do significant processing or perform a blocking operating system call that would add unacceptable latency to the rest of the networking processing.
Using threads, one per receiving connection, will help keep the throughput high and prevent one port's data from blocking the processing of another ports.
This is a good idea, especially since you're talking about only having 3 connections. Spawning three threads to handle the communication will make your application much easier to maintain and keep performant.
First,
...are these common concerns for network programming?
Yes, threading issues are very common concerns in network programming.
Second, your design idea of using three threads for listening on three different ports would work, allowing you to listen on all three ports simultaneously. As pointed out in the comments, this isn't the only way to do it.
Last, one design that's common in network programming is to have one thread listen for a connection, then spawn a new helper thread to handle processing the connection. The original listener thread just passes the socket connection off to the helper, then goes right back to listening for new connections.
Multiple threads do not scale. Operating system threads are far too heavy weight at the scale of thousands or tens of thousands of connections. Granted, you only have 3, so it's no big deal. In fact, I might even recommend it in your case in the name of simplicity if you are certain your application will not need to scale.
But at scale, you'd want to use select()/poll()/epoll()/libevent/etc. On modern Linux systems, epoll() is by far the most robust and is blazingly fast. One thread polls for socket readiness on all sockets simultaneously, and then sockets that signal as ready are either handled immediately by the single thread, or more often than not handed off to a thread pool. The thread pool has a limited number of threads (usually some ratio of the number of compute cores available on the local machine), wherein a free thread is grabbed whenever a socket is ready.
Again, you can use a thread per connection, but if you're interested in learning the proper way to build scalable network systems, don't. Building a select()/poll() style server is a fantastic learning experience.
For reference, see the C10K problem:
http://www.kegel.com/c10k.html