multithreading in a for loop c++ - c++

I am trying to create program that last 100 seconds. This program will create a thread every 2 milliseconds interval. Each thread will do a job that takes say 20 ms to complete.
So ideally there will be around 10 threads running at any point in time. How should I approach this problem?
#include <thread>
void runJob(); // took ~20 ms to complete
for (int i = 0; i < 50000; i++) {
//create thread
std::thread threadObj1(runJob);
Sleep(2);
};

The problem with this approach is that with only 20ms worth of computation for each thread, you are very likely to spend considerably more CPU time spawning and shutting down threads than doing the actual work in runJob.
On most operating systems, spawning threads is quite an expensive operation and can easily take several dozen milliseconds on its own. So for relatively short lived jobs like you have, it is much more desirable to reuse the same thread for multiple jobs. This has the additional benefit that you don't create more threads than your system can handle, that is you avoid thread oversubscription.
So a better approach would be to create an appropriate number of threads upfront and then schedule the different runJob instances to those existing threads. Since this can be quite challenging to do by hand with just the barebones C++11 thread library facilities, you should consider using a higher level facility for this. OpenMP, C++17 parallel algorithms and task-based parallelization libraries like Intel Threading Building Blocks are all good options to get something like this off the ground quickly.

Related

Simple C++ multithreading example is slower

I am currently learning basic C++ multithreading and I implemented a very small code to learn the concepts. I keep hearing multithreading is faster so I tried the below :
int main()
{
//---- SECTION 1
Timer timer;
Func();
Func();
//---- SECTION 2
Timer timer;
std::thread t(Func);
Func();
t.join();
}
And below is the Timer,
Timer()
{
start = std::chrono::high_resolution_clock::now();
}
~Timer()
{
end = std::chrono::high_resolution_clock::now();
duration = end - start;
//To get duration in MilliSeconds
float ms = duration.count() * 1000.0f;
std::cout << "Timer : " << ms << " milliseconds\n";
}
When I implement Section 1 (the other commented out), I get times of 0.1ms,0.2ms and in that range but when I implement Section 2, I get 1ms and above. So it seems that Section 1 is faster even though it is running on the main thread but the Section 2 seems to be slower.
Your answers would be much appreciated. If I am wrong in regards to any concepts, your help would be helpful.
Thanks in advance.
Multithreading can mean faster, but it does not always mean faster. There are many things you can do in multithreaded code which can actually slow things down!
This example shows one of them. In this case, your Func() is too short to benefit from this simplistic multi threading example. Standing up a new thread involves calls to the operating system to manage these new resources. These calls are quite expensive when compared with the 100-200us of your Func. It adds what are called "context switches," which are how the OS changes from one thread to another. If you used a much longer Func (like 20x or 50x longer), you would start to see the benefits.
How big of a deal are these context switches? Well, if you are CPU bound, doing computations as fast as you can, on every core of the CPU, most OSs like to switch threads every 4 milliseconds. That seems to be a decent tradeoff between responsiveness and minimizing overhead. If you aren't CPU bound (like when you finish your Func calls and have nothing else to do), it will obviously switch faster than that, but it's a useful number to keep in the back of your head when considering the time-scales threading is done at.
If you need to run a large number of things in a multi-threaded way, you are probably looking at a dispatch-queue sort of pattern. In this pattern, you stand up the "worker" thread once, and then use mutexes/condition-variables to shuffle work to the worker. This decreases the overhead substantially, especially if you can queue up enough work such that the worker can do several tasks before context switching over to the threads that are consuming the results.
Another thing to watch out for, when starting on multi threading, is managing the granularity of the locking mechanisms you use. If you lock too coarsely (protecting large swaths of data with a single lock), you can't be concurrent. But if you lock too finely, you spend all of your time managing the synchronization tools rather than doing the actual computations. You don't get benefits from multi threading for free. You have to look at your problem and find the right places to do it.
Your test code is timining the starting of a thread (which is a system call and relatively expensive). Also, 0.1ms is too small to get accurate answers. You should try to get your test code to run at least 5 seconds, but even more if you want accurate results, that might make the thread start-up time less significant.
There are two reasons to run threads, one is to perform work in parallel with other threads thereby minimizing the time to compute, the other is to perform some i/o where it will wait for the the kernel to respond. More modern approaches are to use asynchronous system calls so you don't need to wait.
You might want to use condition variables (google std::condition_variable) or some thread pool library. These will be much faster that spinning up a new thread.

Handling of Creation / Reuse of threads in C++

I am working a c++ hobby project that requires lots of processing several times a second. Splitting up my work into multiple threads could improve the completion speed. When the Threads are done should I keep the Threads until I have more work for them or should I throw the threads away and just make new ones when I need them again?
If it's just several times a second (e.g. 10 times a second) then keep it simple and throw the thread away when it's done.
When you get to hundreds or thousands of threads, then you should start thinking about a thread pool.
All that is assuming you're working on a typical machine and not a weak CPU like a microcontroller.
When the Threads are done should I keep the Threads until I have more work for them or should I throw the threads away and just make new ones when I need them again?
It makes little sense paying for thread creation many times if you can pay the price just once (Greta would tell you "how dare you?!"). Idle threads in a (blocking) thread pool do not consume any CPU time and are ready to execute your tasks with the shortest possible delay and overhead because all the necessary resources for the thread were allocated when the thread was spawned.
I would recommend using Intel TBB task scheduler, see a tutorial. It allows for an efficient modern programming paradigm where you partition your problem into stages/tasks, where some of them can be executed in parallel. I cannot recommend enough watching Plain Threads are the GOTO of todays computing - Hartmut Kaiser - Keynote Meeting C++ 2014.

What is the best way to read a lot of files simultaneously in C++?

I have a multicore processor (for instance, 8 cores) and I want read a lot of files by function int read_file(...) and do it using all cores effectively. Also, after executing read_file the returned value should be placed in some place (may be in vector or queue).
I'm thinking about using async (from ะก++11) and future (for getting result from read_file) with launch policy launch::async in a for loop over all files. But it creates a lot of threads during the execution and reading some files can be failed. Maybe I should use some guard on an amount of threads which are created during this execution?
Reading files isn't CPU intensive. So you're focusing on the wrong thing. It's like asking how to use all the power of your car's engine effectively if you're going across the street.
I've written code and done benchmark study to do exactly that. The storage sub-system configurations vary. E.g. someone may have files spread out into multiple disks, or on the same RAID device consisting of multiple disks. The best solution in my opinion is a combination of a powerful thread pool together with async I/O, that are tailored for the system configuration. For instance, the number of threads in the thread pool can be equal to the number of hardware threads; the number of boost::io_service objects can be equal to the number of disks.
Async IO is usually done through an event based solution. You can use boost::asio, libevent, libuv etc.
I'm tempted to argue a Boost.Asio solution might be ideal.
The basic idea involves creating a thread pool that waits for tasks to arrive and queuing all your file reads into that pool.
boost::asio::io_service service;
//The work_ptr object keeps the calls to io_service.run() from returning immediately.
//We could get rid of the object by queuing the tasks before we construct the threads.
//The method presented here is (probably) faster, however.
std::unique_ptr<boost::asio::io_service::work> work_ptr = std::make_unique<boost::asio::io_service::work>(service);
std::vector<YOUR_FILE_TYPE> files = /*...*/;
//Our Thread Pool
std::vector<std::thread> threads;
//std::thread::hardware_concurrency() gets us the number of logical CPU cores.
//May be twice the number of physical cores, due to Hyperthreading/similar tech
for(unsigned int thread = 0; thread < std::thread::hardware_concurrency(); thread++) {
threads.emplace_back([&]{service.run();});
}
//The basic functionality: We "post" tasks to the io_service.
std::vector<int> ret_vals;
ret_vals.resize(files.size());
for(size_t index = 0; index < files.size(); index++) {
service.post([&files, &ret_vals, index]{ret_vals[index] = read_file(files[index], /*...*/);});
}
work_ptr.reset();
for(auto & thread : threads) {
thread.join();
}
//At this time, all ret_vals values have been filled.
/*...*/
One important caveat though: Reading from the disk is orders of magnitude slower than reading from memory. The solution I've provided will scale to virtually any number of threads, but there's very little reason to believe that multithreading will improve the performance of this task, since you'll almost certainly be I/O-bottlenecked, especially if your storage medium is a traditional hard disk, rather than a Solid State Drive.
That isn't to say this is automatically a bad idea; after all, if your read_file function involves a lot of processing of the data (not just reading it) then the performance gains could be quite real. But I do suspect that your use case is a "Premature Optimization" situation, which is the deathknell of programming productivity.

C++ - Questions about multithreading

I am having trouble understanding some concepts of multithreading. I know the basic principles but am having trouble with the understanding of when individual threads are sent and used by cores.
I know that having multiple threads allow code to run in parallel. I think this would be a good addition to my archive extraction program which could decompress blocks using multiple cores. It decompresses all of the files in a for loop and I am hoping that each available core will work on a file.
Here are my questions:
Do I need to query or even consider the number of cores on a machine or when the threads are running, they are automatically sent to free cores?
Can anyone show me an example of a for loop using threads. Say in each loop iteration it would call a function using a different thread. I read that the ideal number of threads to have active are the number of cores. How do I know when a core is free or should I check to see if it has joined main thread, and create a new thread when it has to keep a certain number of threads running.
Am I overcomplicating things or are my questions indicative that I am not grasping the concepts?
If you're decompressing files then you'll probably want a bounded number of thread rather than one thread per file. Otherwise, if you're processing 1000 files you're going to create 1000 thread, which won't make efficient use of the cpu.
As you've mentioned, one approach is to create as many threads as there are cores, and this is a reasonable approach in your case as decompression is reasonably cpu bound, and therefore any threads you create will be active for most of their time slice. If your problem with IO bound then your threads would be spending a lot of time waiting for IO to complete, and therefore you could have spin of more threads than you've got cores, within bounds.
For your application I'd probably look at spinning up one thread per core, and have each thread process one file at a time. This will help keep your algorithm simple. If you had multiple threads working on one file then you'd have to synchronize between them in order to ensure that the blocks they processed were written out to the correct location in the uncompressed file, which will cause needless headaches.
C++11 includes a thread library which you can use simplify working with threads.
No, you can use an API that keeps that transparent, for example POSIX threads on Linux (pthread library).
This answer probably depends on what API you use, though many APIs share threading basics like mutexes. Here, however, is a pthreads example (since that's the only C/C++ threading API I know).
#include <stdio.h>
#include <stdlib.h>
#include <pthread.h>
// Whatever other headers you need for your code.
#define MAX_NUM_THREADS 12
// Each thread will run this function.
void *worker( void *arg )
{
// Do stuff here and it will be 'in parallel'.
// Note: Threads can read from the same location concurrently
// without issue, but writing to any shared resource that has not been
// locked with, for example, a mutex, can cause pernicious bugs.
// Call this when you're done.
pthread_exit( NULL );
}
int main()
{
// Each is a handle for one thread, with 12 in total.
pthread_t myThreads[MAX_NUM_THREADS];
// Create the worker threads.
for(unsigned long i = 0; i < numThreads; i++)
{
// NULL thread attributes struct.
// This initializes the threads with the default PTHREAD_CREATE_JOINABLE
// attribute; we know a thread is finished when it joins, see below.
pthread_create(&myThreads[i], NULL, worker, (void *)i);
}
void *status;
// Wait for the threads to finish.
for(unsigned int i = 0; i < numThreads; i++)
{
pthread_join(myThreads[i], &status);
}
// That's all, folks.
pthread_exit(NULL);
}
Without too much detail, that's a pretty basic skeleton for a simple threaded application using pthreads.
Regarding your questions on the best way to go about applying this to your program:
I suggest one thread per file, using a Threadpool Pattern, and here's why:
Single thread per file is much simpler because there's no sharing, hence no synchronization. You can change the worker function to a decompressFile function, passing a filename each time you call pthread_create. That's basically it. Your threadpool pattern sort of falls into place here.
Multiple threads per file means synchronization, which means complexity because you have to manage access to shared resources. In order to speed up your algorithm, you'd have to isolate portions of it that can run in parallel. However, I would actually expect this method to run slower:
Imagine Thread A has File A open, and Thread B has File B open, but File A and File B are in completely different sectors of your disk. As your OS's scheduling algorithm switches between Thread A and Thread B, your hard drive has to spin like mad to keep up, making the CPU (hence your program) wait.
Since you are seemingly new to threading/parallelism, and you just want to get more performance out of multiple processors/cores, I suggest you look for libraries that deal with threading and allow you to enable parallelism without getting into thread management, work distribution etc.
It sounds all you need now is a parallel loop execution. Nowadays there is a plenty of C++ libraries that can ease this task for you, e.g. Intel's TBB, Microsoft's PPL, AMD's Bolt, Quallcomm's MARE to name a few. You may compare licensing terms, supported platforms, functionality and make a choice that best fits your needs.
To be more specific and answer your questions:
1) Generally, you should have no need to know/consider the number of processors or cores. Choose a library that abstracts this detail away from you and your program. On the other hand, if you see that with default settings CPU is not fully utilized (e.g. due to a significant number of I/O operations), you may find it useful to ask for more threads, e.g. by multiplying the default by a certain factor.
2) A sketch of a for loop made parallel with tbb::parallel_for and C++11 lambda functions:
#include <tbb/tbb.h>
void ParallelFoo( std::vector<MyDataType>& v ) {
tbb::parallel_for( size_t(0), v.size(), [&](int i){
Foo( v[i] );
} );
}
Note that it is not guaranteed that each iteration is executed by a separate thread; but you should not actually worry about such details; all you need is available cores being busy with useful work.
Disclaimer: I'm a developer of Intel's TBB library.
If you're on Windows, you could take a look at Thread Pools, a good description can be found here: http://msdn.microsoft.com/en-us/magazine/cc163327.aspx. An interesting feature of this facility is that it promises to manage the threads for you. It also selects the optimal number of threads depending on demand as well as on the available cores.

My threadspool just make 4~5threads. why?

I use QueueUserWorkItem() function to invoke threadpool.
And I tried lots of work with it. (about 30000)
but by the task manager my application only make 4~5 thread after I push the start button.
I read the MSDN which said that the default number of thread limitation is about 500.
why just a few of threads are made in my application?
I'm tyring to speed up my application and I dout this threadpool is the one of reason that slow down my application.
thanks
It is important to understand how the threadpool scheduler works. It was designed to fine-tune the number of running threads against the capabilities of your machine. Your machine probably can run only two threads at the same time, dual-core CPUs are the current standard. Maybe four.
So when you dump a bunch of threads in its lap, it starts out by activating only two threads. The rest of them are in a queue, waiting for CPU cores to become available. As soon as one of those two threads completes, it activates another one. Twice a second, it evaluates what's going on with active threads that didn't complete. It makes the rough assumption that those threads are blocking and thus not making progress and allows another thread to activate. You've now got three running threads. Getting up the 500 threads, the default max number of threads, will take 249 seconds.
Clearly, this behavior spells out what a thread should do to be suitable to run as a threadpool thread. It should complete quickly and don't block often. Note that blocking on I/O requests is dealt with separately.
If this behavior doesn't suit you then you can use a regular Thread. It will start running right away and compete with other threads in your program (and the operating system) for CPU time. Creating 30,000 of such threads is not possible, there isn't enough virtual memory available for that. A 32-bit operating system poops out somewhere south of 2000 threads, consuming all available virtual memory. You can get about 50,000 threads on a 64-bit operating system before the paging file runs out. Testing these limits in a production program is not recommended.
I think you may have misunderstood the use of the threadpool. Spawning threads and killing threads involves the Windows Kernel and is an expensive operation. If you continuously need threads to perform an aynchronous operation and then you throw them away it would perform many system calls.
So the threadpool is actually a group of threads which are created once which instead of exiting when they complete their task actually enter a wait for another item for queueuserworkitem. The threadpool will then tune itself based on how many threads are required concurrently for your process. If you wish to test this write this code:
for(int i = 0; i < 30000; i++)
{
ThreadPool.QueueUserWorkItem(myMethod);
}
You will see this will create a whole bunch of threads. Maybe not 30000 as some of the threads that are created will be reused as the ThreadPool starts to work through your function calls.
The threadpool is there so you can avoid creating a thread for every asynchronous operation for the very reason that threads are expensive. If you want 30,000 threads you're going to use a lot of memory for the thread stacks plus waste a lot of CPU time doing context switches. Now creating that many threads would be justified if you had 30,000 CPU cores...