How much overhead is there when creating a thread? - c++

I just reviewed some really terrible code - code that sends messages on a serial port by creating a new thread to package and assemble the message in a new thread for every single message sent. Yes, for every message a pthread is created, bits are properly set up, then the thread terminates. I haven't a clue why anyone would do such a thing, but it raises the question - how much overhead is there when actually creating a thread?

To resurrect this old thread, I just did some simple test code:
#include <thread>
int main(int argc, char** argv)
{
for (volatile int i = 0; i < 500000; i++)
std::thread([](){}).detach();
return 0;
}
I compiled it with g++ test.cpp -std=c++11 -lpthread -O3 -o test. I then ran it three times in a row on an old (kernel 2.6.18) heavily loaded (doing a database rebuild) slow laptop (Intel core i5-2540M). Results from three consecutive runs: 5.647s, 5.515s, and 5.561s. So we're looking at a tad over 10 microseconds per thread on this machine, probably much less on yours.
That's not much overhead at all, given that serial ports max out at around 1 bit per 10 microseconds. Now, of course there's various additional thread losses one can get involving passed/captured arguments (although function calls themselves can impose some), cache slowdowns between cores (if multiple threads on different cores are battling over the same memory at the same time), etc. But in general I highly doubt the use case you presented will adversely impact performance at all (and could provide benefits, depending), despite having you already preemptively labeled the concept "really terrible code" without even knowing how much time it takes to launch a thread.
Whether it's a good idea or not depends a lot on the details of your situation. What else is the calling thread responsible for? What precisely is involved in preparing and writing out the packets? How frequently are they written out (with what sort of distribution? uniform, clustered, etc...?) and what's their structure like? How many cores does the system have? Etc. Depending on the details, the optimal solution could be anywhere from "no threads at all" to "shared thread pool" to "thread for each packet".
Note that thread pools aren't magic and can in some cases be a slowdown versus unique threads, since one of the biggest slowdowns with threads is synchronizing cached memory used by multiple threads at the same time, and thread pools by their very nature of having to look for and process updates from a different thread have to do this. So either your primary thread or child processing thread can get stuck having to wait if the processor isn't sure whether the other process has altered a section of memory. By contrast, in an ideal situation, a unique processing thread for a given task only has to share memory with its calling task once (when it's launched) and then they never interfere with each other again.

I have always been told that thread creation is cheap, especially when compared to the alternative of creating a process. If the program you are talking about does not have a lot of operations that need to run concurrently then threading might not be necessary, and judging by what you wrote this might well be the case. Some literature to back me up:
http://www.personal.kent.edu/~rmuhamma/OpSystems/Myos/threads.htm
Threads are cheap in the sense that
They only need a stack and storage for registers therefore, threads are cheap to create.
Threads use very little resources of an operating system in
which they are working. That is,
threads do not need new address space,
global data, program code or operating
system resources.
Context switching are fast when working with threads. The reason is
that we only have to save and/or
restore PC, SP and registers.
More of the same here.
In Operating System Concepts 8th Edition (page 155) the authors write about the benefits of threading:
Allocating memory and resources for process creation is costly. Because
threads share the resource of the
process to which they belong, it is
more economical to create and
context-switch threads. Empirically
gauging the difference in overhead can
be difficult, but in general it is
much more time consuming to create and
manage processes than threads. In
Solaris, for example, creating a
process is about thirty times slower
than is creating a thread, and context
switching is about five times slower.

There is some overhead in thread creation, but comparing it with usually slow baud rates of the serial port (19200 bits/sec being the most common), it just doesn't matter.

...sends Messages on a serial port ... for every message a pthread is created, bits are properly set up, then the thread terminates. ...how much overhead is there when actually creating a thread?
This is highly system specific. For example, last time I used VMS threading was nightmarishly slow (been years, but from memory one thread could create something like 10 more per second (and if you kept that up for a few seconds without threads exiting you'd core)), whereas on Linux you can probably create thousands. If you want to know exactly, benchmark it on your system. But, it's not much use just knowing that without knowing more about the messages: whether they average 5 bytes or 100k, whether they're sent contiguously or the line idles in between, and what the latency requirements for the app are are all as relevant to the appropriateness of the code's thread use as any absolute measurement of thread creation overhead. And performance may not have needed to be the dominant design consideration.

You definitely do not want to do this. Create a single thread or a pool of threads and just signal when messages are available. Upon receiving the signal, the thread can perform any necessary message processing.
In terms of overhead, thread creation/destruction, especially on Windows, is fairly expensive. Somewhere on the order of tens of microseconds, to be specific. It should, for the most part, only be done at the start/end of an app, with the possible exception of dynamically resized thread pools.

I used the above "terrible" design in a VOIP app I made. It worked very well ... absolutely no latency or missed/dropped packets for locally connected computers. Each time a data packet arrived in, a thread was created and handed that data to process it to the output devices. Of course the packets were large so it caused no bottleneck. Meanwhile the main thread could loop back to wait and receive another incoming packet.
I have tried other designs where the threads I need are created in advance but this creates it's own problems. First you need to design your code properly for threads to retrieve the incoming packets and process them in a deterministic fashion. If you use multiple (pre-allocated) threads it's possible that the packets may be processed 'out of order'. If you use a single (pre-allocated) thread to loop and pick up the incoming packets, there is a chance that thread might encounter a problem and terminate leaving no threads to process any data.
Creating a thread to process each incoming data packet works very cleanly, especially on multi-core systems and where incoming packets are large. Also to answer your question more directly, the alternative to thread creation is to create a run-time process that manages the pre-allocated threads. Being able to synchronize data hand-off and processing as well as detecting errors may add just as much, if not more overhead as just simply creating a new thread. It all depends on your design and requirements.

Thread creation and computing in a thread is pretty expensive.
All data strucutres need to be set up, the thread registered with the kernel and a thread switch must occur so that the new thread actually gets executed (in an unspecified and unpredictable time).
Executing thread.start does not mean that the thread main function is called immediately.
As the article (mentioned by typoking) points out creation of a thread is cheap only compared to the creation of a process. Overall, it is pretty expensive.
I would never use a thread
for a short computation
a computation where I need the result in my flow of code (that
means, I am starting the thread and
wait for it to return the result of
it's computation
In your example, it would make sense (as has already been pointed out) to create a thread that handles all of the serial communication and is eternal.
hth
Mario

For comparison , take a look of OSX: Link
Kernel data structures : Approximately 1 KB Stack space: 512 KB
(secondary threads) : 8 MB (OS X main thread) , 1 MB (iOS main
thread)
Creation time: Approximately 90 microseconds
The posix thread creation also should be around this (not a far away figure) I guess.

On any sane implementation, the cost of thread creation should be proportional to the number of system calls it involves, and on the same order of magnitude as familiar system calls like open and read. Some casual measurements on my system showed pthread_create taking about twice as much time as open("/dev/null", O_RDWR), which is very expensive relative to pure computation but very cheap relative to any IO or other operations which would involve switching between user and kernel space.

It is indeed very system dependent, I tested #Nafnlaus code:
#include <thread>
int main(int argc, char** argv)
{
for (volatile int i = 0; i < 500000; i++)
std::thread([](){}).detach();
return 0;
}
On my Desktop Ryzen 5 2600:
windows 10, compiled with MSVC 2019 release adding std::chrono calls around it to time it. Idle (only Firefox with 217 tabs):
It took around 20 seconds (20.274, 19.910, 20.608) (also ~20 seconds with Firefox closed)
Ubuntu 18.04 compiled with:
g++ main.cpp -std=c++11 -lpthread -O3 -o thread
timed with:
time ./thread
It took around 5 seconds (5.595, 5.230, 5.297)
The same code on my raspberry pi 3B compiled with:
g++ main.cpp -std=c++11 -lpthread -O3 -o thread
timed with:
time ./thread
took around 15 seconds (16.225, 14.689, 16.235)

Interesting.
I tested with my FreeBSD PCs and got the following results:
FreeBSD 12-STABLE, Core i3-8100T, 8GB RAM: 9.523sec<br/>
FreeBSD 12.1-RELEASE, Core i5-6600K, 16GB: 8.045sec
You need to do
sysctl kern.threads.max_threads_per_proc=500100
though.
Core i3-8100T is pretty slow but the results are not very different. Rather the CPU clocks seem to be more relevant: i3-8100T 3.1GHz vs i5-6600k 3.50GHz.

As others have mentioned, this seems to be very OS dependent. On my Core i5-8350U running Win10, it took 118 seconds which indicates an overhead of around 237 uS per thread (I suspect that the virus scanner and all the other rubbish IT installed is really slowing it down too). Dual core Xeon E5-2667 v4 running Windows Server 2016 took 41.4 seconds (82 uS per thread), but it's also running a lot of IT garbage in the background including the virus scanner. I think a better approach is to implement a queue with a thread that continuously processes whatever is in the queue to avoid the overhead of creating and destroying the thread everytime.

Related

Futex throughput on Linux

I have an async API which wraps some IO library. The library uses C style callbacks, the API is C++, so natural choice (IMHO) was to use std::future/std::promise to build this API. Something like std::future<void> Read(uint64_t addr, byte* buff, uint64_t buffSize). However, when I was testing the implementation I saw that the bottleneck is the future/promise, more precisely, the futex used to implement promise/future. Since the futex, AFAIK, is user space and the fastest mechanism I know to sync two threads, I just switched to use raw futexes, which somewhat improved the situation, but not something drastic. The performance floating somewhere around 200k futex WAKEs per second. Then I stumbled upon this article - Futex Scaling for Multi-core Systems which quite matches the effect I observe with futexes. My questions is, since the futex too slow for me, what is the fastest mechanism on Linux I can use to wake the waiting side. I dont need anything more sophisticated than binary semaphore, just to signal IO operation completion. Since IO operations are very fast (tens of microseconds) switching to kernel mode not an option. Busy wait not an option too, since CPU time is precious in my case.
Bottom line, user space, simple synchronization primitive, shared between two threads only, only one thread sets the completion, only one thread waits for completion.
EDIT001:
What if... Previously I said, no spinning in busy wait. But futex already spins in busy wait, right? But the implementation covers more general case, which requests global hash table, to hold the futexes, queues for all subscribers etc. Is it a good idea to mimic same behavior on some simple entity (like int), no locks, no atomics, no global datastructures and busy wait on it like futex already does?
In my experience, the bottleneck is due to linux's poor support for IPC. This probably isn't a multicore scaling issue, unless you have a large number of threads.
When one thread wakes another (by futex or any other mechanism), the system tries to run the 'wakee' thread immediately. But the waker thread is still running and using a core, so the system will usually put the wakee thread on a different core. If that core was previously idle, then the system will have to wake the core up from a power-down state, which takes some time. Any data shared between the threads must now be transferred between the cores.
Then, the waker thread will usually wait for a response from the wakee (it sounds like this is what you are doing). So it immediately goes to sleep, and puts its core to idle.
Then a similar thing happens again when the response comes. The continuous CPU wakes and migrations cause the slowdown. You may well discover that if you launch many instances of your process simultaneously, so that all your cores are busy, you see increased performance as the CPUs no longer have to wake up, and the threads may stop migrating between cores. You can get a similar performance increase if you pin the two threads to one core - it will do more than 1 million 'pings'/sec in this case.
So isn't there a way of saying 'put this thread to sleep and then wake that one'? Then the OS could run the wakee on the same core as the waiter? Well, Google proposed a solution to this with a FUTEX_SWAP api that does exactly this, but has yet to be accepted into the linux kernel. The focus now seems to be on user-space thread control via User Managed Concurrency Groups which will hopefully be able to do something similar. However at the time of writing this is yet to be merged into the kernel.
Without these changes to the kernel, as far as I can tell there is no way around this problem. 'You are on the fastest route'! UNIX sockets, TCP loopback, pipes all suffer from the same issue. Futexes have the lowest overhead, which is why they go faster than the others. (with TCP you get about 100k pings per sec, about half the speed of a futex impl). Fixing this issue in a general way would benefit a lot of applications/deployments - anything that uses connections to localhost could benefit.
(I did try a DIY approach where the waker thread pins the wakee thread to the same core that the waker is on, but if you don't want to to pin the waker, then every time you post the futex you need to pin the wakee to the current thread, and the system call to do this has too much overhead)

My multithread program works slowly or appear deadlock on dual core machine, please help

I have a program with several threads, one thread will change a global when it exits itself and the other thread will repeatedly poll the global. No any protection on the globals.
The program works fine on uni-processor. On dual core machine, it works for a while and then halt either on Sleep(0) or SuspendThread(). Would anyone be able to help me out on this?
The code would be like this:
Thread 1:
do something...
while(1)
{
.....
flag_thread1_running=false;
SuspendThread(GetCurrentThread());
continue;
}
Thread 2
flag_thread1_running=true;
ResumeThread(thread1);
.....do some other work here....
while(flag_thread1_running) Sleep(0);
....
The fact that you don't see any problem on a uniprocessor machine, but see problems on a multiproc machine is an artifact of the relatively large granularity of thread context switching on a uniprocessor machine. A thread will execute for N amount of time (milliseconds, nanoseconds, whatever) before the thread scheduler switches execution to a different thread. A lot of CPU instructions can execute in the typical thread timeslice. You can think of it as having a fairly large chunk of "free play" exclusive processor time during which you probably won't run into resource collisions because nothing else is executing on the processor.
When running on a multiproc machine, though, CPU instructions in two threads execute exactly at the same time. The size of the "free play" chunk of time is near zero.
To reproduce a resource contention issue between two threads, you need to get thread 1 to be accessing the resource and thread 2 to be accessing the resource at the same time, or very nearly the same time.
In the large-granularity thread switching that takes place on a uniprocessor machine, the chances that a thread switch will happen exactly in the right spot are slim, so the program may never exhibit a failure under normal use on a uniproc machine.
In a multiproc machine, the instructions are executing at the same time in the two threads, so the chances of thread 1 and thread 2 accessing the same resource at the same time are much, much greater - thousands of times more likely than the uniprocessor scenario.
I've seen it happen many times: an app that has been running fine for years on uniproc machines suddenly starts failing all over the place when executed on a new multiproc machine. The cause is a latent threading bug in the original code that simply never hit the right coincidence of timeslicing to repro on the uniproc machines.
When working with multithreaded code, it is absolutely imperitive to test the code on multiproc hardware. If you have thread collision issues in your code, they will quickly present themselves on a multiproc machine.
As others have noted, don't use SuspendThread() unless you are a debugger. Use mutexes or other synchronization objects to coordinate between threads.
Try using something more like WaitForSingleObjectEx instead of SuspendThread.
You are hitting a race condition. Thread 2 may execute flag_thread1_running=true;
before thread 1 executes flag_thread1_running=false.
This is not likely to happen on single CPU, because with usual the scheduling quantum 10-20 ms you are not likely to hit the problem. It will happen there as well, but very rarely.
Using proper synchronization primitives is a must here. Instead of bool, use event. Instead of checking the bool in a loop, use WaitForSingleObject (or WaitForMultipleObjects for more elaborate stuff later).
It is possible to perform synchronization between threads using plain variables, but it is rarely a good idea and it is quite hard to do it right - cf. How can I write a lock free structure?. It is definitely not a good idea to perform schedulling using Sleep, Suspend or Resume.
I guess that you already know that polling a global flag is a "Bad Idea™" so I'll skip that little speech. Try adding volatile to the flag declaration. That should force each read of it to read from memory. Without volatile, the implementation could be reading the flag into a register and not fetching it from memory.

Best way to slow down a thread? Is using Sleep() OK?

I've written a C++ library that does some seriously heavy CPU work (all of it math and calculations) and if left to its own devices, will easily consume 100% of all available CPU resources (it's also multithreaded to the number of available logical cores on the machine).
As such, I have a callback inside the main calculation loop that software using the library is supposed to call:
while(true)
{
//do math here
callback(percent_complete);
}
In the callback, the client calls Sleep(x) to slow down the thread.
Originally, the clientside code was a fixed Sleep(100) call, but this led to bad unreliable performance because some machines finish the math faster than others, but the sleep is the same on all machines. So now the client checks the system time, and if more than 1 second has passed (which == several iterations), it will sleep for half a second.
Is this an acceptable way of slowing down a thread? Should I be using a semaphore/mutex instead of Sleep() in order to maximize performance? Is sleeping x milliseconds for each 1 second of processing work fine or is there something wrong that I'm not noticing?
The reason I ask is that the machine still gets heavily bogged down even though taskman shows the process taking up ~10% of the CPU. I've already explored hard disk and memory contention to no avail, so now I'm wondering if the way I'm slowing down the thread is causing this problem.
Thanks!
Why don't you use a lower priority for the calculation threads? That will ensure other threads are scheduled while allowing your calculation threads to run as fast as possible if no other threads need to run.
What is wrong with the CPU at 100%? That's what you should strive for, not try to avoid. These math calculations are important, no? Unless you're trying to avoid hogging some other resource not explicitly managed by the OS (a mutex, the disk, etc) and used by the main thread, generally trying to slow your thread down is a bad idea. What about on multicore systems (which almost all systems will be, going forward)? You'd be slowing down a thread for absolutely no reason.
The OS has a concept of a thread quantum. It will take care of ensuring that no important thread on your system is starved. And, as I mentioned, on multicore systems spiking one thread on one CPU does not hurt performance for other threads on other cores at all.
I also see in another comment that this thread is also doing a lot of disk I/O - these operations will already cause your thread to yield while it's waiting for the results, so the sleeps will do nothing.
In general, if you're calling Sleep(x), there is something wrong/lazy with your design, and if x==0, you're opening yourself up to live locks (the thread calling Sleep(0) can actually be rescheduled immediately, making it a noop).
Sleep should be fine for throttling an app, which from your comments is what you're after. Perhaps you just need to be more precise how long you sleep for.
The only software in which I use a feature like this is the BOINC client. I don't know what mechanism it uses, but it's open-source and multi-platform, so help yourself.
It has a configuration option ("limit CPU use to X%"). The way I'd expect to implement that is to use platform-dependent APIs like clock() or GetSystemTimes(), and compare processor time against elapsed wall clock time. Do a bit of real work, check whether you're over or under par, and if you're over par sleep for a while to get back under.
The BOINC client plays nicely with priorities, and doesn't cause any performance issues for other apps even at 100% max CPU. The reason I use the throttle it is that otherwise, the client runs the CPU flat-out all the time, and drives up the fan speed and noise. So I run it at the level where the fan stays quiet. With better cooling maybe I wouldn't need it :-)
Another, not so elaborate, method could be to time one iteration and let the thread sleep for (x * t) milliseconds before the next iteration where t is the millisecond time for one iteration and x is the choosen sleep time fraction (between 0 and 1).
Have a look at cpulimit. It sends SIGSTOP and SIGCONT as required to keep a process below a given CPU usage percentage.
Even still, WTF at "crazy complaints and outlandish reviews about your software killing PC performance". I'd be more likely to complain that your software was slow and not making the best use of my hardware, but I'm not your customer.
Edit: on Windows, SuspendThread() and ResumeThread() can probably produce similar behaviour.

Allocate more processor cycles to my program

I've been working on win32, c,c++ for a while. I code on visual studio. Most of the time I see system idle process uses more cpu utilization. Is there a way to allocate more processor cycles to my program to run it faster? I understand there might be limitations from i/o, in those cases this question doesn't make any sense.
OR
did i misunderstood the task manager numbers? I'm in a confusion, please help me out.
And I want to do something in program itself, btw I will be happy if answers are specific to windows.
Thanks in advance
~calvin
If your program it the only program that has something to do (not wait for IO), its thread will always be assigned to a processor core.
However, if you have a multi-core processor, and a single-threaded program, the CPU usage of your process displayed in the task manager will always be limited by 100/Ncores.
For example, if you have a quad-core machine, your process will be at 25% (using one core), and the idle process at around 75%. You can only additional CPU power by dividing your tasks into chunks that can be worked on by separate threads which will then be run on the idle cores.
The idle process only "runs" when no other process needs to. If you want to use more CPU cycles, then use them.
If your program is idling, it doesn't do anything, i.e. there is nothing that could be done any faster. So the CPU is probably not the bottle-neck in your case.
Are you maybe waiting for data coming from the disk or network?
In case your processor has multiple cores and your program uses only one core to its full extent, making your program multi-threaded could work.
In a multitask / multithread OS the processor(s) time is splitted among threads.
If you want a specific thread to get bigger time chunk you can set its priority with the SetThreadPriority function, not wise to do it though.
Only special software (should) mess with those settings.
It's common for window applications to have a low cpu usage percent (which we see in the task manager)
because most of the time they just wait for messages.
Use threads to:
abstract away all the I/O waits.
assign work to all cores.
also, remove all sleep-wait states from main thread.
Defer all I/O to a thread, so that wait states are confined within it. Keep the actual computations in the foreground thread, and use synchronization mechanisms that make the I/O slave thread to wait for your main thread when communicating.
If your CPU is multi-core, and your problem is paralellizable, create as many threads as you have cores, research "set affinity" functions to assign them between the cores and still keep a separate thread for all I/O.
Also pay attention not to wait in your main thread - usleep(1) doesn't send you into background for 1 microsecond, but for "no less than..." and that may mean anything between 1ms and 100ms but hardly ever less than that, and never anything close to a microsecond.

My threadspool just make 4~5threads. why?

I use QueueUserWorkItem() function to invoke threadpool.
And I tried lots of work with it. (about 30000)
but by the task manager my application only make 4~5 thread after I push the start button.
I read the MSDN which said that the default number of thread limitation is about 500.
why just a few of threads are made in my application?
I'm tyring to speed up my application and I dout this threadpool is the one of reason that slow down my application.
thanks
It is important to understand how the threadpool scheduler works. It was designed to fine-tune the number of running threads against the capabilities of your machine. Your machine probably can run only two threads at the same time, dual-core CPUs are the current standard. Maybe four.
So when you dump a bunch of threads in its lap, it starts out by activating only two threads. The rest of them are in a queue, waiting for CPU cores to become available. As soon as one of those two threads completes, it activates another one. Twice a second, it evaluates what's going on with active threads that didn't complete. It makes the rough assumption that those threads are blocking and thus not making progress and allows another thread to activate. You've now got three running threads. Getting up the 500 threads, the default max number of threads, will take 249 seconds.
Clearly, this behavior spells out what a thread should do to be suitable to run as a threadpool thread. It should complete quickly and don't block often. Note that blocking on I/O requests is dealt with separately.
If this behavior doesn't suit you then you can use a regular Thread. It will start running right away and compete with other threads in your program (and the operating system) for CPU time. Creating 30,000 of such threads is not possible, there isn't enough virtual memory available for that. A 32-bit operating system poops out somewhere south of 2000 threads, consuming all available virtual memory. You can get about 50,000 threads on a 64-bit operating system before the paging file runs out. Testing these limits in a production program is not recommended.
I think you may have misunderstood the use of the threadpool. Spawning threads and killing threads involves the Windows Kernel and is an expensive operation. If you continuously need threads to perform an aynchronous operation and then you throw them away it would perform many system calls.
So the threadpool is actually a group of threads which are created once which instead of exiting when they complete their task actually enter a wait for another item for queueuserworkitem. The threadpool will then tune itself based on how many threads are required concurrently for your process. If you wish to test this write this code:
for(int i = 0; i < 30000; i++)
{
ThreadPool.QueueUserWorkItem(myMethod);
}
You will see this will create a whole bunch of threads. Maybe not 30000 as some of the threads that are created will be reused as the ThreadPool starts to work through your function calls.
The threadpool is there so you can avoid creating a thread for every asynchronous operation for the very reason that threads are expensive. If you want 30,000 threads you're going to use a lot of memory for the thread stacks plus waste a lot of CPU time doing context switches. Now creating that many threads would be justified if you had 30,000 CPU cores...