suppose I have a code like this
for(i = 0; i < i_max; i++)
for(j = 0; j < j_max; j++)
// do something
and I want to do this by using different threads (assuming the //do something tasks are independent from each other, think about montecarlo simulations for instance). My question is this: is it necessarily better to create a thread for each value of i, than creating a thread for each value of j? Something like this
for(i = 0; i < i_max; i++)
create_thread(j_max);
additionally: what would a suitable number of threads? Shall I just create i_max threads or, perhaps, use a semaphore with k < i_max threads running concurrently at any given time.
Thank you,
The best way to apportion the workload is workload-dependent.
Broadly - for parallelizable workload, use OpenMP; for heterogeneous workload, use a thread pool. Avoid managing your own threads if you can.
Monte Carlo simulation should be a good candidate for truly parallel code rather than thread pool.
By the way - in case you are on Visual C++, there is in Visual C++ v10 an interesting new Concurrency Runtime for precisely this type of problem. This is somewhat analogous to the Task Parallel Library that was added to .Net Framework 4 to ease the implementation of multicore/multi-CPU code.
Avoid creating threads unless you can keep them busy!
If your scenario is compute-bound, then you should minimize the number of threads you spawn to the number of cores you expect your code to run on. If you create more threads than you have cores, then the OS has to waste time and resources scheduling the threads to execute on the available cores.
If your scenario is IO-bound, then you should consider using async IO operations that are queued and which you check the response codes from after the async result is returned. Again, in this case, spawning a thread per IO operation is hugely wasteful as you'll cause the OS to have to waste time scheduling threads that are stalled.
Everyone here is basically right, but here's a quick-and-dirty way to split up the work and keep all of the processors busy. This works best when 1) creating threads is expensive compared to the work done in an iteration 2) most iterations take about the same amount of time to complete
First, create 1 thread per processor/core. These are your worker threads. They sit idle until they're told to do something.
Now, split up your work such that work that data that is needed at the same time is close together. What I mean by that is that if you were processing a ten-element array on a two processor machine, you'd split it up so that one group is elements 1,2,3,4,5 and the other is 6,7,8,9,10. You may be tempted to split it up 1,3,5,7,9 and 2,4,6,8,10, but then you're going to cause more false sharing (http://en.wikipedia.org/wiki/False_sharing) in your cache.
So now that you have a thread per processor and a group of data for each thread, you just tell each thread to work on an independent group of that data.
So in your case I'd do something like this.
for (int t=0;t<n_processors;++t)
{
thread[t]=create_thread();
datamin[t]=t*(i_max/n_processors);
datamax[t]=(t+1)*(i_max/n_processors);
}
for (int t=0;t<n_processors;++t)
do_work(thread[t], datamin[t], datamax[t], j_max)
//wait for all threads to be done
//continue with rest of the program.
Of course I left out things like dealing with your data not being an integer multiple of the number of processors, but those are easily fixed.
Also, if you're not adverse to 3rd party libraries, Intel's TBB (threading building blocks) does a great job of abstracting this from you and letting you get to the real work you want to do.
Everything around creating and calling threads is relatively expensive so you want to do that as little as possible.
If you parallelize your inner loop instead of the outer loop, then for each iteration of the outer loop j_max threads are created. An order of i_max more than if you parallelized the outer loop instead.
That said, the best parallelization depends on your actual problem. Depending on that, it can actually make sense to parallelize the inner loop instead.
Depends on the tasks and on what platform you're about to simulate on. For example, on CUDA's architecture you can split the tasks up so that each i,j,1 is done individually.
You still have the time to load data onto the card to consider.
Using for loops and something like OpenMP/MPI/your own threading mechanism, you can basically choose. In one scenario, parallel threads are broken out and j is looped sequentially on each thread. In the ohter, a loop is processed sequentially, and a loop is broken out in each parallelisation.
Parallelisation (breaking out threads) is costly. Remember that you have the cost of setting up n threads, then synchronising n threads. This represents a cost c over and above the runtime of the routines which in and of itself can make the total time greater for parallel processing than in single threaded mode. It depends on the problem in question; often, there's a critical size beyond which parallel is faster.
I would suggest breaking out into the parallel zone in the first for loop would be faster. If you do so on the inner loop, you must fork/join each time the loop runs, adding a large overhead to the speed of the code. You want, ideally, to have to create threads only once.
Related
I have n-threads that access to two shared matrices in the following way:
if (matrix2[i][j] <= d){
matrix1[i][j] = v;
matrix2[i][j] = d;
}
I tried with a unique mutex before the critial section but the performace are very poor.
Which is the best way to synchronize this code? A matrix of mutex? Alternatives?
A matrix of mutex?
It's rare for very fine-grained locking to perform better than a smaller number of locks, unless keeping them locked for a long time is unavoidable. That seems unlikely here. It also opens the door to deadlocks (for example, what if one thread runs with i=1, j=2 at the same time as another thread with i=2, j=1?).
I tried with a unique mutex before the critical section but the performance are very poor.
What synchronization you need depends on your access pattern. Are multiple threads all performing the operation shown?
If so, do you really need to do that in parallel? It doesn't seem expensive enough to be worthwhile. Can you partition i,j regions between threads so they don't collide? Can you do some other long-running work in your threads, and batch the matrix updates for single-threaded application?
If not, you need to show what other access is causing a data race with the code shown.
I am trying to create program that last 100 seconds. This program will create a thread every 2 milliseconds interval. Each thread will do a job that takes say 20 ms to complete.
So ideally there will be around 10 threads running at any point in time. How should I approach this problem?
#include <thread>
void runJob(); // took ~20 ms to complete
for (int i = 0; i < 50000; i++) {
//create thread
std::thread threadObj1(runJob);
Sleep(2);
};
The problem with this approach is that with only 20ms worth of computation for each thread, you are very likely to spend considerably more CPU time spawning and shutting down threads than doing the actual work in runJob.
On most operating systems, spawning threads is quite an expensive operation and can easily take several dozen milliseconds on its own. So for relatively short lived jobs like you have, it is much more desirable to reuse the same thread for multiple jobs. This has the additional benefit that you don't create more threads than your system can handle, that is you avoid thread oversubscription.
So a better approach would be to create an appropriate number of threads upfront and then schedule the different runJob instances to those existing threads. Since this can be quite challenging to do by hand with just the barebones C++11 thread library facilities, you should consider using a higher level facility for this. OpenMP, C++17 parallel algorithms and task-based parallelization libraries like Intel Threading Building Blocks are all good options to get something like this off the ground quickly.
I've got an array of chunks that first need to be processed and then combined. chunks can be processed in arbitrary order but need to be combined in the order they appear in the array.
The following pseudo-code shows my first approach:
array chunks;
def worker_process(chunks):
while not all_chunks_processed:
// get the next chunk that isn't processed or in processing yet
get_next_chunk()
process_chunk()
def worker_combine(chunks):
for i=0 to num_chunks:
// wait until chunk i is processed
wait_until(chunks[i].is_processed())
combine_chunk(i)
def main():
array chunks;
// execute worker_process in n threads where n is the number of (logical) cores
start_n_threads(worker_process, chunks)
// wait until all processing threads are finished
join_threads()
// combine chunks in a single thread
start_1_thread(worker_combine, chunks)
// wait until combine thread is finished
join_threads()
Measurements of the above algorithm show that processing all chunks in parallel and combining the processed chunks sequentially both takes about 3 seconds which leads to a total runtime of roughly 6 seconds.
I'm using a 24 core CPU which means that during the combine stage, only one of those cores is used. My idea then was to use a pipeline to reduce the execution time. My expectation is that the execution time using a pipeline should be somewhere between 3 and 4 seconds. This is how the main function changes when using a pipeline:
def main():
array chunks;
// execute worker_process in n threads where n is the number of (logical) cores
start_n_threads(worker_process, chunks)
// combine chunks in a single thread
start_1_thread(worker_combine, chunks)
// wait until ALL threads are finished
join_threads()
This change decreases the runtime slightly, but by far not as much as I would have expected. I found out that all processing threads were finished after 3 to 4 seconds and the combine thread needed about 2 seconds more.
The problem is that the all threads are treated equally by the scheduler which leads to the combine thread being paused.
Now the question:
How can I change the pipeline so that the combine thread executes faster while still taking advantage of all cores?
I already tried reducing the number of processing threads which helped a bit, but this leads to some cores not being used at all which isn't good either.
EDIT:
While this question wasn't language-specific until here, I actually need to implement in c++ 14.
You could make your worker threads less specialized. So each time a worker thread is free, it could look for work to do; if a chunk is processed but not combined and no thread is currently combining, then the thread can run the combine for that chunk. Otherwise it can check the unprocessed queue for the next chunk to process.
UPDATE
First, I've thought a little more about why this might (or might not) help; and second, form the comments it's clear that some additional clarification is required.
But before I get into it - have you actually tried this approach to see if it helps? Because the fact is, reasoning about parallel processing it hard, which is why frameworks for parallel processing do everything they can to make simplifying assumptions. So if you want to know if it helps, try doing it and let the results direct the conversation. In truth neither of us can say for sure if it's going to help.
So, what this approach gives you is a more fluid acceptance of work onto the cores. Instead of having one worker who, if work is available when his turn comes up, will do that work but won't do anything else, and X (say 24) workers who will never do that one task even if it's ready to do, you have a pool of workers doing what needs done.
Now it's a simple reality that at any time when one core is being used to combine, one less core will be available for processing than would otherwise. And the total aggregate processor time that will be spent on each kind of work is constant. So those aren't variables to optimize. What we'd like is for the allocation of resources at any time to approximate the ratio of total work to be done.
Now to analyze in any detail we'd need to know whether each task (processing task, combining task) is processor-bound or not (and a million follow-up questions depending on the answer). I don't know that, so the following is only generally accurate...
Your initial data suggests that you spent 3 seconds of single-processing time on combining; lets just call that 3 units of work.
You spent 3 seconds of parallel processing time across 24 cores to do the processing task. Let's swag that out as 72 units of work.
So we'd guess that having roughly 1/25 of your resources on combining wouldn't be bad, but serialization constraints may keep you from realizing that ratio. (And again, if some other resource than processor is the real bottleneck in one or both tasks, then that ratio could be completely wrong anyway.)
Your pipeline approach should get close to that if you could ensure 100% utilization without the combiner thread ever falling asleep. But it can fall asleep, either because work isn't ready for it or because it loses the scheduler lottery some of the time. You can't do much about the former, and maybe the question is can you fix the latter...?
There may be architecture-specific games you can play with thread priority or affinity, but you specified portability and I would at best expect you to have to re-tune parameters to each specific hardware if you play those games. Again, my question is can we get by with something simpler?
The intuition behind my suggestion is simply this: Run enough workers to keep the system busy, and let them do whatever work is ready to do.
The limitation of this approach is that if a worker is put to sleep while it is doing combine work, then the combine pipeline is stalled. But you can't help that unless you can inoculate "thread that's doing combine work" - be that a specialized thread or a worker that happens to be doing a combine unit of work - against being put aside to let other threads run. Again, I'm not saying there's no way - but I'm saying there's no portable way, especially that would run optimally without machine-specific tuning.
And even if you are on a system where you can just outright assign one core to your combiner thread, that still might not be optimal because (a) combining work can still only be done as processing tasks finish, and (b) any time combining work isn't ready that reserved core would sit idle.
Anyway, my guess is that cases where a generic worker gets put to sleep when he happened to be combining would be less frequent then the cases where a dedicated combiner thread is unable to move forward. That's what would drive gains with this approach.
Sometimes it's better to let the incoming workload determine your task allocations, than to try to anticipate and outmaneuver the incoming workload.
One limitation of the approach the OP asked about is the wait on all threads. You need to pipeline the passing of finished jobs from the workers to the one combining as soon as they are ready to make maximum use of all cores, unless the combining operation really takes very little time compared to the actual computation in the workers (as in almost zero in comparison).
Using a simple threading framework like TBB or OpenMP would enable parallelization of the workers, but the reduce phase tuning will be critical (the chunk joining). If each join takes awhile, doing that at a course granularity will be needed. In OpenMP you could do something like:
join_arr;
#pragma omp parallel
{
double local_result;
#pragma omp for
for (i=0; i<N; i++) {
do work()
#pragma omp critical
join()
} // end of for loop
}
A more explicit and simpler way to do it would be to use something like RaftLib (http://raftlib.io, full disclosure, I'm one of the maintainers..but this is what I designed it for):
int main()
{
arr some_arr;
using foreach = raft::for_each< type_t >;
foreach fe( some_arr, arr_size, thread_count );
raft::map m;
/**
* send data, zero copy, from fe to join as soon as items are
* ready to be joined, so everything is done fully in parallel
* where fe is duplicated on fibers/threads up to thread_count
* wide and the join is on a separate fiber/thread and gathering
*/
m += fe >> join;
m.exe();
}
I have a multicore processor (for instance, 8 cores) and I want read a lot of files by function int read_file(...) and do it using all cores effectively. Also, after executing read_file the returned value should be placed in some place (may be in vector or queue).
I'm thinking about using async (from ะก++11) and future (for getting result from read_file) with launch policy launch::async in a for loop over all files. But it creates a lot of threads during the execution and reading some files can be failed. Maybe I should use some guard on an amount of threads which are created during this execution?
Reading files isn't CPU intensive. So you're focusing on the wrong thing. It's like asking how to use all the power of your car's engine effectively if you're going across the street.
I've written code and done benchmark study to do exactly that. The storage sub-system configurations vary. E.g. someone may have files spread out into multiple disks, or on the same RAID device consisting of multiple disks. The best solution in my opinion is a combination of a powerful thread pool together with async I/O, that are tailored for the system configuration. For instance, the number of threads in the thread pool can be equal to the number of hardware threads; the number of boost::io_service objects can be equal to the number of disks.
Async IO is usually done through an event based solution. You can use boost::asio, libevent, libuv etc.
I'm tempted to argue a Boost.Asio solution might be ideal.
The basic idea involves creating a thread pool that waits for tasks to arrive and queuing all your file reads into that pool.
boost::asio::io_service service;
//The work_ptr object keeps the calls to io_service.run() from returning immediately.
//We could get rid of the object by queuing the tasks before we construct the threads.
//The method presented here is (probably) faster, however.
std::unique_ptr<boost::asio::io_service::work> work_ptr = std::make_unique<boost::asio::io_service::work>(service);
std::vector<YOUR_FILE_TYPE> files = /*...*/;
//Our Thread Pool
std::vector<std::thread> threads;
//std::thread::hardware_concurrency() gets us the number of logical CPU cores.
//May be twice the number of physical cores, due to Hyperthreading/similar tech
for(unsigned int thread = 0; thread < std::thread::hardware_concurrency(); thread++) {
threads.emplace_back([&]{service.run();});
}
//The basic functionality: We "post" tasks to the io_service.
std::vector<int> ret_vals;
ret_vals.resize(files.size());
for(size_t index = 0; index < files.size(); index++) {
service.post([&files, &ret_vals, index]{ret_vals[index] = read_file(files[index], /*...*/);});
}
work_ptr.reset();
for(auto & thread : threads) {
thread.join();
}
//At this time, all ret_vals values have been filled.
/*...*/
One important caveat though: Reading from the disk is orders of magnitude slower than reading from memory. The solution I've provided will scale to virtually any number of threads, but there's very little reason to believe that multithreading will improve the performance of this task, since you'll almost certainly be I/O-bottlenecked, especially if your storage medium is a traditional hard disk, rather than a Solid State Drive.
That isn't to say this is automatically a bad idea; after all, if your read_file function involves a lot of processing of the data (not just reading it) then the performance gains could be quite real. But I do suspect that your use case is a "Premature Optimization" situation, which is the deathknell of programming productivity.
I am having trouble understanding some concepts of multithreading. I know the basic principles but am having trouble with the understanding of when individual threads are sent and used by cores.
I know that having multiple threads allow code to run in parallel. I think this would be a good addition to my archive extraction program which could decompress blocks using multiple cores. It decompresses all of the files in a for loop and I am hoping that each available core will work on a file.
Here are my questions:
Do I need to query or even consider the number of cores on a machine or when the threads are running, they are automatically sent to free cores?
Can anyone show me an example of a for loop using threads. Say in each loop iteration it would call a function using a different thread. I read that the ideal number of threads to have active are the number of cores. How do I know when a core is free or should I check to see if it has joined main thread, and create a new thread when it has to keep a certain number of threads running.
Am I overcomplicating things or are my questions indicative that I am not grasping the concepts?
If you're decompressing files then you'll probably want a bounded number of thread rather than one thread per file. Otherwise, if you're processing 1000 files you're going to create 1000 thread, which won't make efficient use of the cpu.
As you've mentioned, one approach is to create as many threads as there are cores, and this is a reasonable approach in your case as decompression is reasonably cpu bound, and therefore any threads you create will be active for most of their time slice. If your problem with IO bound then your threads would be spending a lot of time waiting for IO to complete, and therefore you could have spin of more threads than you've got cores, within bounds.
For your application I'd probably look at spinning up one thread per core, and have each thread process one file at a time. This will help keep your algorithm simple. If you had multiple threads working on one file then you'd have to synchronize between them in order to ensure that the blocks they processed were written out to the correct location in the uncompressed file, which will cause needless headaches.
C++11 includes a thread library which you can use simplify working with threads.
No, you can use an API that keeps that transparent, for example POSIX threads on Linux (pthread library).
This answer probably depends on what API you use, though many APIs share threading basics like mutexes. Here, however, is a pthreads example (since that's the only C/C++ threading API I know).
#include <stdio.h>
#include <stdlib.h>
#include <pthread.h>
// Whatever other headers you need for your code.
#define MAX_NUM_THREADS 12
// Each thread will run this function.
void *worker( void *arg )
{
// Do stuff here and it will be 'in parallel'.
// Note: Threads can read from the same location concurrently
// without issue, but writing to any shared resource that has not been
// locked with, for example, a mutex, can cause pernicious bugs.
// Call this when you're done.
pthread_exit( NULL );
}
int main()
{
// Each is a handle for one thread, with 12 in total.
pthread_t myThreads[MAX_NUM_THREADS];
// Create the worker threads.
for(unsigned long i = 0; i < numThreads; i++)
{
// NULL thread attributes struct.
// This initializes the threads with the default PTHREAD_CREATE_JOINABLE
// attribute; we know a thread is finished when it joins, see below.
pthread_create(&myThreads[i], NULL, worker, (void *)i);
}
void *status;
// Wait for the threads to finish.
for(unsigned int i = 0; i < numThreads; i++)
{
pthread_join(myThreads[i], &status);
}
// That's all, folks.
pthread_exit(NULL);
}
Without too much detail, that's a pretty basic skeleton for a simple threaded application using pthreads.
Regarding your questions on the best way to go about applying this to your program:
I suggest one thread per file, using a Threadpool Pattern, and here's why:
Single thread per file is much simpler because there's no sharing, hence no synchronization. You can change the worker function to a decompressFile function, passing a filename each time you call pthread_create. That's basically it. Your threadpool pattern sort of falls into place here.
Multiple threads per file means synchronization, which means complexity because you have to manage access to shared resources. In order to speed up your algorithm, you'd have to isolate portions of it that can run in parallel. However, I would actually expect this method to run slower:
Imagine Thread A has File A open, and Thread B has File B open, but File A and File B are in completely different sectors of your disk. As your OS's scheduling algorithm switches between Thread A and Thread B, your hard drive has to spin like mad to keep up, making the CPU (hence your program) wait.
Since you are seemingly new to threading/parallelism, and you just want to get more performance out of multiple processors/cores, I suggest you look for libraries that deal with threading and allow you to enable parallelism without getting into thread management, work distribution etc.
It sounds all you need now is a parallel loop execution. Nowadays there is a plenty of C++ libraries that can ease this task for you, e.g. Intel's TBB, Microsoft's PPL, AMD's Bolt, Quallcomm's MARE to name a few. You may compare licensing terms, supported platforms, functionality and make a choice that best fits your needs.
To be more specific and answer your questions:
1) Generally, you should have no need to know/consider the number of processors or cores. Choose a library that abstracts this detail away from you and your program. On the other hand, if you see that with default settings CPU is not fully utilized (e.g. due to a significant number of I/O operations), you may find it useful to ask for more threads, e.g. by multiplying the default by a certain factor.
2) A sketch of a for loop made parallel with tbb::parallel_for and C++11 lambda functions:
#include <tbb/tbb.h>
void ParallelFoo( std::vector<MyDataType>& v ) {
tbb::parallel_for( size_t(0), v.size(), [&](int i){
Foo( v[i] );
} );
}
Note that it is not guaranteed that each iteration is executed by a separate thread; but you should not actually worry about such details; all you need is available cores being busy with useful work.
Disclaimer: I'm a developer of Intel's TBB library.
If you're on Windows, you could take a look at Thread Pools, a good description can be found here: http://msdn.microsoft.com/en-us/magazine/cc163327.aspx. An interesting feature of this facility is that it promises to manage the threads for you. It also selects the optimal number of threads depending on demand as well as on the available cores.