I'm using C++11 to develop a project.
In some function, I got some parallel tasks as below:
void func() {
auto res1 = task1();
auto res2 = task2();
auto res3 = task3();
...
std::cout << res1 + res2 + res3 + ...;
}
Well, each task is a little heavy, let's say each task would spend 300ms.
Now I'm thinking that making each task to be a std::thread should improve the performance.
But as my understanding, it's the OS who schedules the threads. I'm not sure if the OS ensures that it will execute these threads immediately or it may need to wait for some other stuff?
So my question is if making the tasks multi-threading can definitely improve the performance, or in some cases, this method would get a worse performance?
BTW, I know that too many threads can cause a very bad performance because of context switch, in my real case, the counts of the tasks is less than 10 and they don't share any data.
I'm not sure if the OS ensures that it will execute these threads
immediately or it may need to wait for some other stuff?
The OS will try to start up the threads as quickly as it can. They aren't guaranteed to already be running at the exact instant your thread object's constructor constructor returns, but OTOH the OS won't deliberately wait around before starting them up, either. (e.g. on a system that isn't terribly loaded-down, you can generally expect them to be running within a small number of milliseconds)
So my question is if making the tasks multi-threading can definitely
improve the performance, or in some cases, this method would get a
worse performance?
Adding multithreading to a program and finding that your program actually takes longer to complete than the single-threaded version is actually a fairly common experience, especially for programmers who are new to multithreaded programming.
It's similar to adding more cooks to a kitchen -- if the cooks work together well and stay out of each other's way, they can get more food cooked in less time, but if they don't, they may well spend most of their time waiting for each other to finish using the various tools/ingredients, or talking about who is supposed to be doing what, and end up being slower than a single cook would be.
In general, multithreading can speed things up, if you are running on a multi-core system, and your various threads don't need to communicate with each other too much, and the threads don't need to obtain exclusive access to shared resources too often, and your threads aren't doing anything grossly inefficient (like intensive polling or busy-waiting). Whether these conditions are easy to achieve or difficult depends a lot on the task you are trying to accomplish.
Related
I am currently learning basic C++ multithreading and I implemented a very small code to learn the concepts. I keep hearing multithreading is faster so I tried the below :
int main()
{
//---- SECTION 1
Timer timer;
Func();
Func();
//---- SECTION 2
Timer timer;
std::thread t(Func);
Func();
t.join();
}
And below is the Timer,
Timer()
{
start = std::chrono::high_resolution_clock::now();
}
~Timer()
{
end = std::chrono::high_resolution_clock::now();
duration = end - start;
//To get duration in MilliSeconds
float ms = duration.count() * 1000.0f;
std::cout << "Timer : " << ms << " milliseconds\n";
}
When I implement Section 1 (the other commented out), I get times of 0.1ms,0.2ms and in that range but when I implement Section 2, I get 1ms and above. So it seems that Section 1 is faster even though it is running on the main thread but the Section 2 seems to be slower.
Your answers would be much appreciated. If I am wrong in regards to any concepts, your help would be helpful.
Thanks in advance.
Multithreading can mean faster, but it does not always mean faster. There are many things you can do in multithreaded code which can actually slow things down!
This example shows one of them. In this case, your Func() is too short to benefit from this simplistic multi threading example. Standing up a new thread involves calls to the operating system to manage these new resources. These calls are quite expensive when compared with the 100-200us of your Func. It adds what are called "context switches," which are how the OS changes from one thread to another. If you used a much longer Func (like 20x or 50x longer), you would start to see the benefits.
How big of a deal are these context switches? Well, if you are CPU bound, doing computations as fast as you can, on every core of the CPU, most OSs like to switch threads every 4 milliseconds. That seems to be a decent tradeoff between responsiveness and minimizing overhead. If you aren't CPU bound (like when you finish your Func calls and have nothing else to do), it will obviously switch faster than that, but it's a useful number to keep in the back of your head when considering the time-scales threading is done at.
If you need to run a large number of things in a multi-threaded way, you are probably looking at a dispatch-queue sort of pattern. In this pattern, you stand up the "worker" thread once, and then use mutexes/condition-variables to shuffle work to the worker. This decreases the overhead substantially, especially if you can queue up enough work such that the worker can do several tasks before context switching over to the threads that are consuming the results.
Another thing to watch out for, when starting on multi threading, is managing the granularity of the locking mechanisms you use. If you lock too coarsely (protecting large swaths of data with a single lock), you can't be concurrent. But if you lock too finely, you spend all of your time managing the synchronization tools rather than doing the actual computations. You don't get benefits from multi threading for free. You have to look at your problem and find the right places to do it.
Your test code is timining the starting of a thread (which is a system call and relatively expensive). Also, 0.1ms is too small to get accurate answers. You should try to get your test code to run at least 5 seconds, but even more if you want accurate results, that might make the thread start-up time less significant.
There are two reasons to run threads, one is to perform work in parallel with other threads thereby minimizing the time to compute, the other is to perform some i/o where it will wait for the the kernel to respond. More modern approaches are to use asynchronous system calls so you don't need to wait.
You might want to use condition variables (google std::condition_variable) or some thread pool library. These will be much faster that spinning up a new thread.
I have I've been experimenting with multi-threading a game engine. For sake of an example here is the basic structure of my code.
#include <thread>
int main() {
std::thread physics_thread;
std::thread particles_thread;
std::thread graphics_thread;
while (!shouldClose) {
physics_thread = thread(doPhysics, &someData1);
particles_thread = thread(doParticles, &someData2);
graphics_thread = thread(doGraphics, &someData3);
physics_thread.join();
particles_thread.join();
graphics_thread.join();
}
}
Is this incredibly bad practice? I'm running linux on a pretty low-power system and creating these threads every update is quite cheap. In terms of all platforms though, is this just a bad idea?
On a side note I've tried to do constantly running worker threads with mutexs and condition variables out the wazoo and sometimes I can get it to work, but it seems absurdly complicated for what I'm trying to achieve.
Although this seems inexpensive, this is not the case.
Threads require some heavy lifting and if you know you are going to need them again (let alone reuse them multiple times) you should try and avoid it.
The classic solution is to use a thread pool to execute your tasks right when you need them, and then let the workers sleep until the next time.
The cost of having N additional sleeping threads is minimal and will not affect the application's performance.
Here is a simple yet very powerful single header thread pool library I wrote.
I am trying to speed up a piece of code by having background threads already setup to solve one specific task. When it is time to solve my task I would like to wake up these threads, do the job and block them again waiting for the next task. The task is always the same.
I tried using condition variables (and mutex that need to go with them), but I ended up slowing my code down instead of speeding it up; mostly it happened because the calls to all needed functions are very expensive (pthread_cond_wait/pthread_cond_signal/pthread_mutex_lock/pthread_mutex_unlock).
There is no point in using a thread pool (that I don't have either) because it is a too generic construct; here I want to address only my specific task. Depending on the implementation I would also pay a performance penalty for the queue.
Do you have any suggestion for a quick wake-up without using mutex or con_var?
I was thinking in setup threads like timers reading an atomic variable; if the variable is set to 1 the threads will do the job; if it is set to 0 they will go to sleep for few microseconds (I would start with microsecond sleep since I would like to avoid using spinlocks that might be too expensive for the CPU). What do you think about it? Any suggestion is very appreciated.
I am using Linux, gcc, C and C++.
These functions should be fast. If they are taking a large fraction of your time, it is quite possible that you are trying to switch threads too often.
Try buffering up a work queue, and send the signal once a significant amount of work has accumulated.
If this is impossible due to dependencies between the tasks, then your application is not amenable to multithreading at all.
In order to gain performance in a multithreaded application, spawn as many threads as there are CPUs, not a separate thread for each task. Otherwise you end up with a lot of overhead from context switching.
You may also consider making your algorithm more linear (i.e. by using non-blocking calls).
I've written a C++ library that does some seriously heavy CPU work (all of it math and calculations) and if left to its own devices, will easily consume 100% of all available CPU resources (it's also multithreaded to the number of available logical cores on the machine).
As such, I have a callback inside the main calculation loop that software using the library is supposed to call:
while(true)
{
//do math here
callback(percent_complete);
}
In the callback, the client calls Sleep(x) to slow down the thread.
Originally, the clientside code was a fixed Sleep(100) call, but this led to bad unreliable performance because some machines finish the math faster than others, but the sleep is the same on all machines. So now the client checks the system time, and if more than 1 second has passed (which == several iterations), it will sleep for half a second.
Is this an acceptable way of slowing down a thread? Should I be using a semaphore/mutex instead of Sleep() in order to maximize performance? Is sleeping x milliseconds for each 1 second of processing work fine or is there something wrong that I'm not noticing?
The reason I ask is that the machine still gets heavily bogged down even though taskman shows the process taking up ~10% of the CPU. I've already explored hard disk and memory contention to no avail, so now I'm wondering if the way I'm slowing down the thread is causing this problem.
Thanks!
Why don't you use a lower priority for the calculation threads? That will ensure other threads are scheduled while allowing your calculation threads to run as fast as possible if no other threads need to run.
What is wrong with the CPU at 100%? That's what you should strive for, not try to avoid. These math calculations are important, no? Unless you're trying to avoid hogging some other resource not explicitly managed by the OS (a mutex, the disk, etc) and used by the main thread, generally trying to slow your thread down is a bad idea. What about on multicore systems (which almost all systems will be, going forward)? You'd be slowing down a thread for absolutely no reason.
The OS has a concept of a thread quantum. It will take care of ensuring that no important thread on your system is starved. And, as I mentioned, on multicore systems spiking one thread on one CPU does not hurt performance for other threads on other cores at all.
I also see in another comment that this thread is also doing a lot of disk I/O - these operations will already cause your thread to yield while it's waiting for the results, so the sleeps will do nothing.
In general, if you're calling Sleep(x), there is something wrong/lazy with your design, and if x==0, you're opening yourself up to live locks (the thread calling Sleep(0) can actually be rescheduled immediately, making it a noop).
Sleep should be fine for throttling an app, which from your comments is what you're after. Perhaps you just need to be more precise how long you sleep for.
The only software in which I use a feature like this is the BOINC client. I don't know what mechanism it uses, but it's open-source and multi-platform, so help yourself.
It has a configuration option ("limit CPU use to X%"). The way I'd expect to implement that is to use platform-dependent APIs like clock() or GetSystemTimes(), and compare processor time against elapsed wall clock time. Do a bit of real work, check whether you're over or under par, and if you're over par sleep for a while to get back under.
The BOINC client plays nicely with priorities, and doesn't cause any performance issues for other apps even at 100% max CPU. The reason I use the throttle it is that otherwise, the client runs the CPU flat-out all the time, and drives up the fan speed and noise. So I run it at the level where the fan stays quiet. With better cooling maybe I wouldn't need it :-)
Another, not so elaborate, method could be to time one iteration and let the thread sleep for (x * t) milliseconds before the next iteration where t is the millisecond time for one iteration and x is the choosen sleep time fraction (between 0 and 1).
Have a look at cpulimit. It sends SIGSTOP and SIGCONT as required to keep a process below a given CPU usage percentage.
Even still, WTF at "crazy complaints and outlandish reviews about your software killing PC performance". I'd be more likely to complain that your software was slow and not making the best use of my hardware, but I'm not your customer.
Edit: on Windows, SuspendThread() and ResumeThread() can probably produce similar behaviour.
I have some long-running operations that number in the hundreds. At the moment they are each on their own thread. My main goal in using threads is not to speed these operations up. The more important thing in this case is that they appear to run simultaneously.
I'm aware of cooperative multitasking and fibers. However, I'm trying to avoid anything that would require touching the code in the operations, e.g. peppering them with things like yieldToScheduler(). I also don't want to prescribe that these routines be stylized to be coded to emit queues of bite-sized task items...I want to treat them as black boxes.
For the moment I can live with these downsides:
Maximum # of threads tend to be O(1000)
Cost per thread is O(1MB)
To address the bad cache performance due to context-switches, I did have the idea of a timer which would juggle the priorities such that only idealThreadCount() threads were ever at Normal priority, with all the rest set to Idle. This would let me widen the timeslices, which would mean fewer context switches and still be okay for my purposes.
Question #1: Is that a good idea at all? One certain downside is it won't work on Linux (docs say no QThread::setPriority() there).
Question #2: Any other ideas or approaches? Is QtConcurrent thinking about this scenario?
(Some related reading: how-many-threads-does-it-take-to-make-them-a-bad-choice, many-threads-or-as-few-threads-as-possible, maximum-number-of-threads-per-process-in-linux)
IMHO, this is a very bad idea. If I were you, I would try really, really hard to find another way to do this. You're combining two really bad ideas: creating a truck load of threads, and messing with thread priorities.
You mention that these operations only need to appear to run simultaneously. So why not try to find a way to make them appear to run simultaneously, without literally running them simultaneously?
It's been 6 months, so I'm going to close this.
Firstly I'll say that threads serve more than one purpose. One is speedup...and a lot of people are focusing on that in the era of multi-core machines. But another is concurrency, which can be desirable even if it slows the system down when taken as a whole. Yet concurrency can be achieved using mechanisms more lightweight than threads, although it may complicate the code.
So this is just one of those situations where the tradeoff of programmer convenience against user experience must be tuned to fit the target environment. It's how Google's approach to a process-per-tab with Chrome would have been ill-advised in the era of Mosaic (even if process isolation was preferable with all else being equal). If the OS, memory, and CPU couldn't give a good browsing experience...they wouldn't do it that way now.
Similarly, creating a lot of threads when there are independent operations you want to be concurrent saves you the trouble of sticking in your own scheduler and yield() operations. It may be the cleanest way to express the code, but if it chokes the target environment then something different needs to be done.
So I think I'll settle on the idea that in the future when our hardware is better than it is today, we'll probably not have to worry about how many threads we make. But for now I'll take it on a case-by-case basis. i.e. If I have 100 of concurrent task class A, and 10 of concurrent task class B, and 3 of concurrent task class C... then switching A to a fiber-based solution and giving it a pool of a few threads is probably worth the extra complication.