Threads/Sockets limits in Linux

Threads/Sockets limits in Linux - c++

First of all: sorry for my english.
Guys, I have a trouble with POSIX sockets and/or pthreads. I'm developing on embedded device(ARM9 CPU). On the device will work multithread tcp server. And it will be able to process a lot of incoming connections. Server gets connection from client and increase counter variable(unsigned int counter). Clients routines will run in separate threads. All clients will use 1 singleton class instance(in this class will be opened and closed same files). Clients works with files, then client thread closes connection socket, and calls pthread_exit().
So, my tcp server can't handle more than 250 threads(counter = 249 +1(server thread). And I got "Resource temporary unavailable". What's the problem?

Whenever you hit the thread limit - or as mentioned run out of virtual process address space due to the number of threads - you're.... doing it wrong. More threads don't scale. Especially not when doing embedded programming. You can handle requests on a thread pool instead. Use poll(2) to handle many connections on fewer threads. This is prettty well-trod territory and libraries (like ACE, asio) have been leveraging this model for good reason
The 'thread-per-request' model is mainly popular because of it's (perceived) simple design.
As long as you keep connections on a single logical thread (sometimes known as a strand) there is no real difference, though.
Also, if the handling of a request involves no blocking operations, you can never do better than polling and handling on a single thread after all: you can use the 'backlog' feature of bind/accept to let the kernel worry about pending connections for you! (Note: this assumed a single core CPU, on a dual core CPU this kind of processing would be optimal with one thread per CPU)
Edit Addition Re:
ulimit shows how much threads can OS handle, right? If yes, ulimit does not solve my problem because my app uses ~10-15 threads in same time.
If that's the case, you should really double check that you are joining or detaching all threads properly. Also think of the synchronization objects; if you consistently forget to call the relevant pthread *_destroy functions, you'll run into the limits even without needing it. That would of course be a resource leak. Some tools may be able to help you spot them (vlagrind/helgrind come to mind)

Use ulimit -n to check the number of file system handles. You can increase it for your current session if the number is too low.
Also you can edit /etc/security/limits.conf and to set a permanent limit

Usually, the first limit you are hitting on 32-bit systems is that you are running out of virtual address space when using default stack sizes.
Try explicitly specifying the stack size when creating threads (to less than 1 MB) or setting the default stack size with "ulimit -s".
Also note that you need to either pthread_detach or pthread_join your threads so that all resources will be freed.

Related

IOCP Critical Section Design

I'm running an fully operational IOCP TCP socket application. Today I was thinking about the Critical Section design and now I have one endless question in my head: global or per client Critical Section? I came to this because as I see there is no point to use multiple working threads if every threads depends on a single lock, right? I mean... now I don't see any performance issue with 100 simultaneous clients, but what if was 10000?
My shared resource is per client pre allocated struct, so, each client have your own IO context, socket and stuff. There is no inter-client resource share, so I think that is another point for use the per client CS. I use one accept thread and 8 (processors * 2) working threads. This applications is basicaly designed for small (< 1KB) packets but sometimes for file streaming.

The "correct" answer probably depends on your design, the number of concurrent clients and the performance that you require from the hardware that you have available.
In general, I find it best to go with the simplest thing that works and then profile to locate hot spots.
However... You say that you have no inter-client shared resources so I assume the only synchronisation that you need to do is around 'per-connection' state.
Since it's per connection the obvious (to me) design would be for the per-connection state to contain its own critical section. What do you perceive to be the downside of this approach?
The problem with a single shared lock is that you introduce contention between connections (and threads) that have no reason to block each other. This will adversely affect performance and will likely become a hot-spot as connection numbers rise.
Once you have a per connection lock you might want to look at avoiding using it as often as possible by having the IOCP threads simply lock to place completions in a per connection queue for processing. This has the advantage of allowing a single IOCP thread to work on each connection and preventing a single connection from having additional IOCP threads blocking on it. It also works well with 'skip completion port on success' processing.

OpenSSL library stress tolerance

My coworkers and I have a good feeling that OpenSSL more or less needs to get pitched from our application but I want some opinions on whether it really is this bad or whether there are issues in our use of this library that could be causing us trouble.
The setting: A multi-threaded C++ application that maintains a persistent SSL connection for each user.
At 500 users it has worked fine. I'm trying to increase the limit to 1000 and around 960 had a segfault in SSL_read. This read is the first I/O operation for this particular connection. I had to increase the file limit in ulimit from 1024 to 4096 to get up this high. So my questions are:
1) Is it possible the library needs to be configured to know to accept this many connections?
2) Is it a threading issue that may be solved with light use of mutexes? I can't afford to turn the entire SSL_read into a critical strip though.
3) Just a bad buggy library and needs to be thrown out?

Based on your comments, 1 thread per connection doesnt seem like an efficient usage of threads.
I would suggest a thread pool and use worker threads to handle received packets. The received packets could be enqueued in a queue and the worker threads would process packets from the queue. The openSsl connections could be stored in a container common to all threads. Care will have to be taken to handle the packets in order. And yes, synchronization (mutex) will be needed.

Multiple threads and blocking sockets

I'm writing a TCP server on Windows Server 2k8. This servers receives data, parses it and then sinks it on a database. I'm doing now some tests but the results surprises me.
The application is written in C++ and uses, directly, Winsocks and the Windows API. It creates a thread for each client connection. For each client it reads and parses the data, then insert it into the database.
Clients always connect in simultaneous chunks - ie. every once in a while about 5 (I control this parameter) clients will connect simultaneously and feed the server with data.
I've been timing both the reading-and-parsing and the database-related stages of each thread.
The first stage (reading-and-parsing) has a curious behavoir. The amount of time each thread takes is roughly equal to each thread but is also proportional to the number of threads connecting. The server is not CPU starved: it has 8 cores and always less than 8 threads are connected to it.
For example, with 3 simultaneous threads 100k rows (for each thread) will be read-and-parsed in about 4,5s. But with 5 threads it will take 9,1s on average!
A friend of mine suggested this scaling behavoir might be related to the fact I'm using blocking sockets. Is this right? If not, what might be the reason for this behavoir?
If it is, I'd be glad if someone can point me out good resources for understanding non blocking sockets on Windows.
Edit:
Each client thread reads a line (ie., all chars untils a '\n') from the socket, then parses it, then read again, until the parse fails or a terminator character is found. My readline routine is based on this:
http://www.cis.temple.edu/~ingargio/cis307/readings/snaderlib/readline.c
With static variables being declared as __declspec(thread).
The parsing, assuming from the non networking version, is efficient (about 2s for 100k rows). I assume therefore the problem is in the multhreaded/network version.

If your lines are ~120–150 characters long, you are actually saturating the network!
There's no issue with sockets. Simply transfering 3 times 100k lines, 150 bytes each, over a 100 Mbps line (1 take 10 bytes/byte to account for headers) will take... 4.5 s! There is no problem with sockets, blocking or otherwise. You've simply hit the limit of how much data you can feed it.

Non-blocking sockets are only useful if you want one thread to service multiple connections. Using non-blocking sockets and a poll/select loop means that your thread does not sit idle while waiting for new connections.
In your case this is not an issue since there is only one connection per thread so there is no issue if your thread is waiting for input.
Which leads to your original questions of why things slow down when you have more connections. Without further information, the most likely culprit is that you are network limited: ie your network cannot feed your server data fast enough.
If you are interested in non-blocking sockets on Windows do a search on MSDN for OVERLAPPED APIs

You could be running into other threading related issues, like deadlocks/race conditions/false sharing, which could be destroying the performance.

One thing to keep in mind is that although you have one thread per client, Windows will not automatically ensure they all run on different cores. If some of them are running on the same core, it is possible (although unlikely) to have your server in a sort of CPU-starved state, with some cores at 100% load and others idle. There are simply no guarantees as to how the OS spreads the load (in the default case).
To explicitly assign threads to particular cores, you can use SetThreadAffinityMask. It may or may not be worth having a bit of a play around with this to see if it helps.
On the other hand, this may not have anything to do with it at all. YMMV.

How much overhead is there when creating a thread?

I just reviewed some really terrible code - code that sends messages on a serial port by creating a new thread to package and assemble the message in a new thread for every single message sent. Yes, for every message a pthread is created, bits are properly set up, then the thread terminates. I haven't a clue why anyone would do such a thing, but it raises the question - how much overhead is there when actually creating a thread?

To resurrect this old thread, I just did some simple test code:
#include <thread>
int main(int argc, char** argv)
{
for (volatile int i = 0; i < 500000; i++)
std::thread([](){}).detach();
return 0;
}
I compiled it with g++ test.cpp -std=c++11 -lpthread -O3 -o test. I then ran it three times in a row on an old (kernel 2.6.18) heavily loaded (doing a database rebuild) slow laptop (Intel core i5-2540M). Results from three consecutive runs: 5.647s, 5.515s, and 5.561s. So we're looking at a tad over 10 microseconds per thread on this machine, probably much less on yours.
That's not much overhead at all, given that serial ports max out at around 1 bit per 10 microseconds. Now, of course there's various additional thread losses one can get involving passed/captured arguments (although function calls themselves can impose some), cache slowdowns between cores (if multiple threads on different cores are battling over the same memory at the same time), etc. But in general I highly doubt the use case you presented will adversely impact performance at all (and could provide benefits, depending), despite having you already preemptively labeled the concept "really terrible code" without even knowing how much time it takes to launch a thread.
Whether it's a good idea or not depends a lot on the details of your situation. What else is the calling thread responsible for? What precisely is involved in preparing and writing out the packets? How frequently are they written out (with what sort of distribution? uniform, clustered, etc...?) and what's their structure like? How many cores does the system have? Etc. Depending on the details, the optimal solution could be anywhere from "no threads at all" to "shared thread pool" to "thread for each packet".
Note that thread pools aren't magic and can in some cases be a slowdown versus unique threads, since one of the biggest slowdowns with threads is synchronizing cached memory used by multiple threads at the same time, and thread pools by their very nature of having to look for and process updates from a different thread have to do this. So either your primary thread or child processing thread can get stuck having to wait if the processor isn't sure whether the other process has altered a section of memory. By contrast, in an ideal situation, a unique processing thread for a given task only has to share memory with its calling task once (when it's launched) and then they never interfere with each other again.

I have always been told that thread creation is cheap, especially when compared to the alternative of creating a process. If the program you are talking about does not have a lot of operations that need to run concurrently then threading might not be necessary, and judging by what you wrote this might well be the case. Some literature to back me up:
http://www.personal.kent.edu/~rmuhamma/OpSystems/Myos/threads.htm
Threads are cheap in the sense that
They only need a stack and storage for registers therefore, threads are cheap to create.
Threads use very little resources of an operating system in
which they are working. That is,
threads do not need new address space,
global data, program code or operating
system resources.
Context switching are fast when working with threads. The reason is
that we only have to save and/or
restore PC, SP and registers.
More of the same here.
In Operating System Concepts 8th Edition (page 155) the authors write about the benefits of threading:
Allocating memory and resources for process creation is costly. Because
threads share the resource of the
process to which they belong, it is
more economical to create and
context-switch threads. Empirically
gauging the difference in overhead can
be difficult, but in general it is
much more time consuming to create and
manage processes than threads. In
Solaris, for example, creating a
process is about thirty times slower
than is creating a thread, and context
switching is about five times slower.

There is some overhead in thread creation, but comparing it with usually slow baud rates of the serial port (19200 bits/sec being the most common), it just doesn't matter.

...sends Messages on a serial port ... for every message a pthread is created, bits are properly set up, then the thread terminates. ...how much overhead is there when actually creating a thread?
This is highly system specific. For example, last time I used VMS threading was nightmarishly slow (been years, but from memory one thread could create something like 10 more per second (and if you kept that up for a few seconds without threads exiting you'd core)), whereas on Linux you can probably create thousands. If you want to know exactly, benchmark it on your system. But, it's not much use just knowing that without knowing more about the messages: whether they average 5 bytes or 100k, whether they're sent contiguously or the line idles in between, and what the latency requirements for the app are are all as relevant to the appropriateness of the code's thread use as any absolute measurement of thread creation overhead. And performance may not have needed to be the dominant design consideration.

You definitely do not want to do this. Create a single thread or a pool of threads and just signal when messages are available. Upon receiving the signal, the thread can perform any necessary message processing.
In terms of overhead, thread creation/destruction, especially on Windows, is fairly expensive. Somewhere on the order of tens of microseconds, to be specific. It should, for the most part, only be done at the start/end of an app, with the possible exception of dynamically resized thread pools.

I used the above "terrible" design in a VOIP app I made. It worked very well ... absolutely no latency or missed/dropped packets for locally connected computers. Each time a data packet arrived in, a thread was created and handed that data to process it to the output devices. Of course the packets were large so it caused no bottleneck. Meanwhile the main thread could loop back to wait and receive another incoming packet.
I have tried other designs where the threads I need are created in advance but this creates it's own problems. First you need to design your code properly for threads to retrieve the incoming packets and process them in a deterministic fashion. If you use multiple (pre-allocated) threads it's possible that the packets may be processed 'out of order'. If you use a single (pre-allocated) thread to loop and pick up the incoming packets, there is a chance that thread might encounter a problem and terminate leaving no threads to process any data.
Creating a thread to process each incoming data packet works very cleanly, especially on multi-core systems and where incoming packets are large. Also to answer your question more directly, the alternative to thread creation is to create a run-time process that manages the pre-allocated threads. Being able to synchronize data hand-off and processing as well as detecting errors may add just as much, if not more overhead as just simply creating a new thread. It all depends on your design and requirements.

Thread creation and computing in a thread is pretty expensive.
All data strucutres need to be set up, the thread registered with the kernel and a thread switch must occur so that the new thread actually gets executed (in an unspecified and unpredictable time).
Executing thread.start does not mean that the thread main function is called immediately.
As the article (mentioned by typoking) points out creation of a thread is cheap only compared to the creation of a process. Overall, it is pretty expensive.
I would never use a thread
for a short computation
a computation where I need the result in my flow of code (that
means, I am starting the thread and
wait for it to return the result of
it's computation
In your example, it would make sense (as has already been pointed out) to create a thread that handles all of the serial communication and is eternal.
hth
Mario

For comparison , take a look of OSX: Link
Kernel data structures : Approximately 1 KB Stack space: 512 KB
(secondary threads) : 8 MB (OS X main thread) , 1 MB (iOS main
thread)
Creation time: Approximately 90 microseconds
The posix thread creation also should be around this (not a far away figure) I guess.

On any sane implementation, the cost of thread creation should be proportional to the number of system calls it involves, and on the same order of magnitude as familiar system calls like open and read. Some casual measurements on my system showed pthread_create taking about twice as much time as open("/dev/null", O_RDWR), which is very expensive relative to pure computation but very cheap relative to any IO or other operations which would involve switching between user and kernel space.

It is indeed very system dependent, I tested #Nafnlaus code:
#include <thread>
int main(int argc, char** argv)
{
for (volatile int i = 0; i < 500000; i++)
std::thread([](){}).detach();
return 0;
}
On my Desktop Ryzen 5 2600:
windows 10, compiled with MSVC 2019 release adding std::chrono calls around it to time it. Idle (only Firefox with 217 tabs):
It took around 20 seconds (20.274, 19.910, 20.608) (also ~20 seconds with Firefox closed)
Ubuntu 18.04 compiled with:
g++ main.cpp -std=c++11 -lpthread -O3 -o thread
timed with:
time ./thread
It took around 5 seconds (5.595, 5.230, 5.297)
The same code on my raspberry pi 3B compiled with:
g++ main.cpp -std=c++11 -lpthread -O3 -o thread
timed with:
time ./thread
took around 15 seconds (16.225, 14.689, 16.235)

Interesting.
I tested with my FreeBSD PCs and got the following results:
FreeBSD 12-STABLE, Core i3-8100T, 8GB RAM: 9.523sec<br/>
FreeBSD 12.1-RELEASE, Core i5-6600K, 16GB: 8.045sec
You need to do
sysctl kern.threads.max_threads_per_proc=500100
though.
Core i3-8100T is pretty slow but the results are not very different. Rather the CPU clocks seem to be more relevant: i3-8100T 3.1GHz vs i5-6600k 3.50GHz.

As others have mentioned, this seems to be very OS dependent. On my Core i5-8350U running Win10, it took 118 seconds which indicates an overhead of around 237 uS per thread (I suspect that the virus scanner and all the other rubbish IT installed is really slowing it down too). Dual core Xeon E5-2667 v4 running Windows Server 2016 took 41.4 seconds (82 uS per thread), but it's also running a lot of IT garbage in the background including the virus scanner. I think a better approach is to implement a queue with a thread that continuously processes whatever is in the queue to avoid the overhead of creating and destroying the thread everytime.

Impact of hundreds of idle threads

I am considering the use of potentially hundreds of threads to implement tasks that manage devices over a network.
This is a C++ application running on a powerpc processor with a linux kernel.
After an initial phase when each task does synchronization to copy data from the device into the task, the task becomes idle, and only wakes up when it receives an alarm, or needs to change some data (configuration), which is rare after the start phase. Once all tasks reach the "idle" phase, I expect that only a few per second will need to wake.
So, my main concern is, if I have hundreds of threads will they have a negative impact on the system once they become idle?
Thanks.
amso
edit:
I'm updating the question based on the answers that I got. Thanks guys.
So it seems that having a ton of threads idling (IO blocked, waiting, sleeping, etc), per se , will not have an impact on the system in terms of responsiveness.
Of course, they will spend extra money for each thread's stack and TLS data but that's okay as long as we throw more memory at the thing (making it more €€€)
But then, other issues have to be accounted for. Having 100s of threads waiting will likely increase memory usage on the kernel, due to the need of wait queues or other similar resources. There's also a latency issue, which looks non-deterministic. To check the responsiveness and memory usage of each solution one should measure it and compare.
Finally, the whole idea of hundreds of threads that will be mostly idling may be modeled like a thread pool. This reduces a bit of code linearity but dramatically increases the scalability of the solution and with propper care can be easily tunable to adjust the compromise between performance and resource usage.
I think that's all. Thanks everyone for their input.
--
amso

Each thread has overhead - most importantly each one has its own stack and TLS. Performance is not that much of a problem since they will not get any time slices unless they actually do anything. You may still want to consider using thread pools.

Chiefly they will use up address space and memory for stacks; once you get, say, 1000 threads, this gets quite significant as I've seen that 10M per thread is typical for stacks (on x86_64). It is changable, but only with care.
If you have a 32-bit processor, address space will be the main limitation once you hit 1000s of threads, you can easily exhaust the AS.
They use up some kernel memory, but probably not as much as userspace.
Edit: of course threads share address space with each other only if they are in the same process; I am assuming that they are.

I'm not a Linux hacker, but assuming that Linux's thread scheduling is similar to Windows'...
Yes, of course the will be some impact. Every bit of memory you consume will potentially have some impact.
However, in a time-sliced environment, threads that are in a Wait/Sleep/Join state will not consume CPU cycles until they are awoken.

I would be worried about offering 1:1 thread-connections mappings, if nothing else because it leaves you rather exposed to denial of service attacks. (pthread_create() is a fairly expensive operation compared to just a call to accept())
EboMike has already answered the question directly - provided threads are blocked and not busy-waiting then they won't consume much in the way of resources although they will occupy memory and swap for all the per-thread state.

I'm learning the basics of the kernel now. I can't give you a specific answer yet; I'm still a noob... but here are some things for you to chew on.
Linux implements each POSIX thread as a unique process. This will create overhead as others have mentioned. In addition to this, your waiting model appears flawed any way you do it. If you create one conditional variable for each thread, then I think (based off of my interpretation of the website below) that you'll actually be expending a lot of kernel memory, as each thread would be placed into its own wait queue. If instead you break your threads up for each group of X threads to share a conditional variable, then you've got problems as well because every time the variable signals, you must wake up _EVERY_DARN_PROCESS_ in that variable's wait queue.
I also assume that you will need to do some object sharing an synchronization. In this case, your code may get slower because of the need to wake up all processes waiting on a resource, as I mentioned earlier.
I know this wasn't much help, but as I said, I'm a kernel noob. Hope it helped a little.
http://book.chinaunix.net/special/ebook/PrenticeHall/PrenticeHallPTRTheLinuxKernelPrimer/0131181637/ch03lev1sec7.html

I'm not sure what "device" you are talking about, but if it's a file descriptor, I'd suggest that you look at starting to migrate to using either poll or epoll (Id suggest the latter given the description of how active you expect each file descriptor to be). That way, you could use one process which would be responsible for all the fds.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js