I am working on a recursive algorithm which we want to parallelize to improve the performance.
I implemented multithreading using Visual c++ 12.0 and < thread > library . However I dont see any performance improvements. The time taken either less by a few milliseconds or is more than the time with single thread.
Kindly let me know if am doing something wrong and what corrections should I make to the code.
Here is my code
void nonRecursiveFoo(<className> &data, int first, int last)
{
//process the data between first and last index and set its value to true based on some condition
//no threads are created here
}
void recursiveFoo(<className> &data, int first, int last)
{
int partitionIndex = -1;
data[first]=true;
data[last]=true;
for (int i = first + 1; i < last; i++)
{
//some logic setting the index
If ( some condition is true)
partitionIndex = i;
}
//no dependency of partitions on one another and so can be parallelized
if( partitionIndex != -1)
{
data[partitionIndex]=true;
//assume some threadlimit
if (Commons::GetCurrentThreadCount() < Commons::GetThreadLimit())
{
std::thread t1(recursiveFoo, std::ref(data), first, index);
Commons::IncrementCurrentThreadCount();
recursiveFoo(data, partitionIndex , last);
t1.join();
}
else
{
nonRecursiveFoo(data, first, partitionIndex );
nonRecursiveFoo(data, partitionIndex , last);
}
}
}
//main
int main()
{
recursiveFoo(data,0,data.size-1);
}
//commons
std::mutex threadCountMutex;
static void Commons::IncrementCurrentThreadCount()
{
threadCountMutex.lock();
CurrentThreadCount++;
threadCountMutex.unlock();
}
static int GetCurrentThreadCount()
{
return CurrentThreadCount;
}
static void SetThreadLimit(int count)
{
ThreadLimit = count;
}
static int GetThreadLimit()
{
return ThreadLimit;
}
static int GetMinPointsPerThread()
{
return MinimumPointsPerThread;
}
Without further information (see comments) this is mostly guesswork, but there are a few things you should watch out for:
First of all, make sure that your partitioning logic is very short and fast compared to the processing. Otherwise, you are just creating more work than you gain processing power.
Make sure, there is enough work to begin with or the speedup might be not enough to pay for the additional overhead of thread creation.
Check that your work gets evenly distributed among the different threads and don't spawn more threads than you have cores on your computer (print the number of total threads at the end - don't rely on your ThreadLimit).
Don't let your partitions get too small, (especially no less than 64 Bytes) or you end up with false sharing.
It would be MUCH more efficient, to implement CurrentThreadCount as a std::atomic<int> in which case you don't need a mutex.
Put the increment of the counter before the creation of the thread. Otherwise, the newly created thread might read the counter before it is incremented and spawn a new thread again, even if the max number of threads is already reached (This is still not a perfect solution, but I would only invest more time on this if you have verified, that overcommitting is your actual problem)
If you really must use a mutex (for reasons outside of the example code) you have to use it for every access to CurrentThreadCount (read and write access). Otherwise this is - strictly speaking - a race condition and thus UB.
By using t1.join you're basically waiting for the other thread to finish - i.e. not doing anything in parallel.
By looking at your algorithm I don't see how it can be parallelized(thus improved) by using threads - you have to wait for a single recursive call to end.
First of all, you are not doing anything in parallel, as every thread creation blocks, until the created thread has finished. Hence, your multithreaded code will always be slower than the non multithreaded version.
In order to parallelize you could spawn threads for that part, where the non-recursive function is called, put the thread ID into a vector and join on the highest level of the recursion, by walking through the vector. (Although there are more elegant ways to do that, but for a first should this would be OK, I think).
Thus, all non recursive calls will run in parallel. But you should use another condition than the max number of threads, but the size of the problem, e.g. last-first<threshold.
Related
Wholly new to multithreading.
I am writing a program which takes as input a vector of objects and an integer for the number of threads to dedicate. The nature of the objects isn't important, only that each has several members that are file paths to large text files. Here's a simplified version:
// Not very important. Reads file, writes new version omitting
// some lines
void proc_file(OBJ obj) {
std::string inFileStr(obj.get_path().c_str());
std::string outFileStr(std::string(obj.get_path().replace_extension("new.txt").c_str()));
std::ifstream inFile(inFileStr);
std::ofstream outFile(outFileStr);
std::string currLine;
while (getline(inFile, currLine)) {
if (currLine.size() == 1 ||
currLine.compare(currLine.length()-5, 5, "thing") != 0) {
outFile << currLine << '\n';
}
else {
for (int i = 0; i < 3; i++) {
getline(inFile, currLine);
}
}
}
inFile.close();
outFile.close();
}
// Processes n file concurrently, working way through
// all OBJ in objs
void multi_file_proc(std::vector<OBJ> objs, int n) {
std::vector<std::thread> procVec;
for (int i = 0; i < objs.size(); i++) {
/*
Ensure that n files are always being processed.
Upon completion of one, initiate another, until
all OBJ in objs have had their text files changed.
*/
}
}
I want to loop through each OBJ and write altered versions of their text files in concurrence, the limitation on simultaneous file read/writes being the thread value (n). Ultimately, all the objects' text files must be changed, but in such a way that there are always n files being processed, to maximize efficiency in concurrence.
Note the vector of threads, procVec. I originally approached this by managing a vector of threads, with a file being processed for each thread in procVec. From my reading, it seems a vector for managing these tasks is logical. But how do I always ensure there are n files open until all have been processed, without exiting with an open thread?
Edit: Apologies, my intention was not to ask others to write code for me. I just didn't want my approach to bias anyone's answer if the approach was bad to begin with.
These are some things I've tried (this code would go into the block comment in my function):
1. First approach. Idea is to add to procVec up until the thread limit n was reached, then join, remove a process from the front of the vector upon its completion. This is a summary of several similar iterations, none of which worked:
if (i >= n) {
procVec.front().join();
procVec.erase(procVec.begin());
}
procVec.push_back(std::thread(proc_file, sra[i]));
Problems with this:
Incorrectly assumes front of vector will always finish first
(Possibly?) Invalidates all iterators in procVec after first is erased
2. Using mutexes, I attempt writing a lambda function where the thread would be removed upon its completion. This is my current approach. Unsure why it isn't working, or if it even suits my needs:
// remThread() and lamb() defined above main function, **procVec** and **threadMutex**
//are global variables
void remThread(std::thread::id id) {
std::lock_guard<std::mutex lock(threadMutex);
auto iter = std::find_if(procVec.begin(), procVec.end(), [=](std::thread &t)
{return (t.get_id() == id); });
if (iter != procVec.end()) {
iter->join();
procVec.erase(iter);
}
}
void lamb(SRA sra, std::thread::id id) {
proc_file(sra);
remThread(id);
}
// This is the code contained in the main for loop. called lambda to process file
// and then remove thread
std::lock_guard<std::mutex> lock(threadMutex);
procVec.push_back(std::thread([sras, i]() {
std::thread(lamb, sras[i], std::this_thread::get_id()).detach();
}));
Problems with this:
Program terminates, likely a joinable thread is active, leaves scope
Given that the example you show is fairly simple, a for loop of fixed size, no strange dependencies, a very simple solution could be to use OpenMP which would allow you to do what you describe (providing I understood correctly) by adding a single line
void multi_file_proc(std::vector<OBJ> objs, int n) {
std::vector<std::thread> procVec;
#pragma omp parallel for num_threads(n) schedule(dynamic, 1)
for (int i = 0; i < objs.size(); i++) {
/*
...
*/
}
}
in front of the for loop. Of course you then have to modify your compile command to add openmp support, the precise flag naturally being different from compiler to compiler i.e. -fopenmp for g++, -qopenmp for icpc, etc.
The line above basically instructs the compiler to create code to execute the for loop below in parallel. The important bit here is the last one where we set the schedule. Dynamic simply means that the order is not predetermined, instead threads will get their next iteration when they finish with the last. The integer 1 there defines the number of steps they take at a time, given that each file is large we want something fine grained since we don't expect too much overhead from the scheduling.
A word of caution, OpenMP, like most of C++, will not even try to stop you from shooting yourself in the foot. And with concurrency there are whole new ways to do just that.
Finally, this is by no means guaranteed to be the absolute best solution outright. For instance if your files are of varying lengths then you would probably want to sort the objects from longest to shortest before the loop. This way once the last object is being processed (at some point only a single thread will be working on the final object) that won't take too long.
We're studying for our test next week, and have been given an exercise from our teacher, and we just don't see the solution:
How to synchronize n threads, so that all n threads wait at a specific location and only continue with their "work" together when all n threads have reached that location?
We're allowed to use Mutex and Semaphore constructs. The solution should be easy, but we just cant find the answer.
Here's a big hint. You need 2 semaphores, both with N flags. You can solve this with an extra thread. The key is that you can call down() on a semaphore multiple times. e.g. If you call down() on a semaphore 8 times, you need all 8 up()'s before you can continue.
// an additional thread (not one of the N)
void trigger(Semaphore* workersCollect, Semaphore* workersRelease, int n)
{
while(true)
{
for (int i = 0; i < n; ++i)
workersCollect->down();
for (int i = 0; i < n; ++i)
workersRelease->up();
}
}
// Prototype for the "checkpoint" function (exercise for the reader)
void await(Semaphore* workersCollect, Semaphore* workersRelease);
You can also solve it without the extra thread, by using more complicated state checking.
This design has a drawback. If a worker finishes its work extremely quickly, it can grab more than one task (while another thread ends up not running at all). This is fine if you have a threadpool kind of design, but bad if, say, each thread is supposed to work on it's own distinct section of a dataset.
To fix that, you need a semaphore per thread. Something akin to
Semaphore workerRelease[N];
but being careful to avoid false sharing. (You don't want more than 1 semaphore on a cache line.)
I was trying to run a function on multiple pthreads in order to increase efficiency and runtime. This function performs a lot of matrix calculations and print statements. However, when I ran tests in order see the performance improvement, the single threaded code ran faster.
My tests went as follows:
-For the single-threaded: Run a for loop 1:1000 that called the function.
-For the multi-pthreaded: Spawn 100 pthreads, have a queue of 1000 items and a pthread_cond_wait and have the threads run the function until the queue is empty.
Here is my code for the pthreads (single-threaded is just a for loop instead):
# include <iostream>
# include <string>
# include <pthread.h>
# include <queue>
using namespace std;
# define NUM_THREADS 100
int main ( );
queue<int> testQueue;
void *playQueue(void* arg);
void matrix_exponential_test01 ( );
void matrix_exponential_test02 ( );
pthread_mutex_t queueLock;
pthread_cond_t queue_cv;
int main()
{
pthread_t threads[NUM_THREADS];
pthread_mutex_init(&queueLock, NULL);
pthread_cond_init (&queue_cv, NULL);
for( int i=0; i < NUM_THREADS; i++ )
{
pthread_create(&threads[i], NULL, playQueue, (void*)NULL);
}
pthread_mutex_lock (&queueLock);
for(int z=0; z<1000; z++)
{
testQueue.push(1);
pthread_cond_signal(&queue_cv);
}
pthread_mutex_unlock (&queueLock);
pthread_mutex_destroy(&queueLock);
pthread_cond_destroy(&queue_cv);
pthread_cancel(NULL);*/
return 0;
}
void* playQueue(void* arg)
{
bool accept;
while(true)
{
pthread_cond_wait(&queue_cv, &queueLock);
accept = false;
if(!testQueue.empty())
{
testQueue.pop();
accept = true;
}
pthread_mutex_unlock (&queueLock);
if(accept)
{
runtest();
}
}
pthread_exit(NULL);
}
My intuition tells me that the multi-threaded version should run faster, but it doesnt. Is there a reason, or is my code faulty? I am using C++ on Windows, and had to download a library to use pthreads.
First, your code is written in a way that only one thread will run at any time (your mutex is locked the whole time the thread is doing work). So at best you can expect your Code to be as fast as the single threaded version.
Also, all threads reading and writing the same memory each time. This way you force your CPU cores to synchronize their caches, meaning actually more load on the bus than would be caused by a single thread. Since you are not doing any computationally expensive stuff, it is likely that memory bandwidth is your actual bottleneck and thus the bus load added by cache synchronization slows down your program. Take a look at http://en.wikipedia.org/wiki/False_sharing for more information.
If runtest() is CPU-bound -- that is, doesn't do anything which might block on i/o or the like -- then there's not much point starting 100 threads, unless you have 100 cpus/cores ! [Edit: I now notice that runtest() does some print statements... file i/o probably won't block... so won't release the CPU.]
The code as currently shown holds the mutex while filling the queue, so nothing will start until the queue is full. By the time filling of the queue has finished signalling 1000 times, if any have reached the pthread_cond_wait(), then hopefully they will all have been started -- so all 100 will be waiting on the mutex.
As currently shown the waiting in playQueue() is broken. It should be something along the lines of:
pthread_mutex_wait(&queueLock) ;
while (testQueue.empty)
pthread_cond_wait(&queue_cv, &queueLock) ;
if (testQueue.eof)
val = NULL ;
else
val = testQueue.pop ;
pthread_mutex_unlock(&queueLock) ;
But, even when this is all sorted out, there is no guarantee you will see an improvement in performance, unless runtest() does a serious amount of work. [Edit: I now notice that it does "a lot of matrix calculations", which sounds like it could be plenty of work.]
One small suggestion, starting the worker threads and filling the queue could be overlapped, by (for instance) starting one worker with the job of starting all the others, or starting a thread to fill the queue.
Without knowing more about the problem, if the work can be statically divided across the worker threads -- for instance, give the first thread items 0..9, the second thread items 10..19, and so on -- so each worker can ignore all the others, reducing the amount of synchronisation operations.
In addition to the other good answers, you said your runtest() function does I/O.
So you could well be I/O bound, in which case all your threads have to wait in line like everybody else to empty out their buffers.
I attended one interview two days back. The interviewed guy was good in C++, but not in multithreading. When he asked me to write a code for multithreading of two threads, where one thread prints 1,3,5,.. and the other prints 2,4,6,.. . But, the output should be 1,2,3,4,5,.... So, I gave the below code(sudo code)
mutex_Lock LOCK;
int last=2;
int last_Value = 0;
void function_Thread_1()
{
while(1)
{
mutex_Lock(&LOCK);
if(last == 2)
{
cout << ++last_Value << endl;
last = 1;
}
mutex_Unlock(&LOCK);
}
}
void function_Thread_2()
{
while(1)
{
mutex_Lock(&LOCK);
if(last == 1)
{
cout << ++last_Value << endl;
last = 2;
}
mutex_Unlock(&LOCK);
}
}
After this, he said "these threads will work correctly even without those locks. Those locks will reduce the efficiency". My point was without the lock there will be a situation where one thread will check for(last == 1 or 2) at the same time the other thread will try to change the value to 2 or 1. So, My conclusion is that it will work without that lock, but that is not a correct/standard way. Now, I want to know who is correct and in which basis?
Without the lock, running the two functions concurrently would be undefined behaviour because there's a data race in the access of last and last_Value Moreover (though not causing UB) the printing would be unpredictable.
With the lock, the program becomes essentially single-threaded, and is probably slower than the naive single-threaded code. But that's just in the nature of the problem (i.e. to produce a serialized sequence of events).
I think the interviewer might have thought about using atomic variables.
Each instantiation and full specialization of the std::atomic template defines an atomic type. Objects of atomic types are the only C++ objects that are free from data races; that is, if one thread writes to an atomic object while another thread reads from it, the behavior is well-defined.
In addition, accesses to atomic objects may establish inter-thread synchronization and order non-atomic memory accesses as specified by std::memory_order.
[Source]
By this I mean the only thing you should change is remove the locks and change the lastvariable to std::atomic<int> last = 2; instead of int last = 2;
This should make it safe to access the last variable concurrently.
Out of curiosity I have edited your code a bit, and ran it on my Windows machine:
#include <iostream>
#include <atomic>
#include <thread>
#include <Windows.h>
std::atomic<int> last=2;
std::atomic<int> last_Value = 0;
std::atomic<bool> running = true;
void function_Thread_1()
{
while(running)
{
if(last == 2)
{
last_Value = last_Value + 1;
std::cout << last_Value << std::endl;
last = 1;
}
}
}
void function_Thread_2()
{
while(running)
{
if(last == 1)
{
last_Value = last_Value + 1;
std::cout << last_Value << std::endl;
last = 2;
}
}
}
int main()
{
std::thread a(function_Thread_1);
std::thread b(function_Thread_2);
while(last_Value != 6){}//we want to print 1 to 6
running = false;//inform threads we are about to stop
a.join();
b.join();//join
while(!GetAsyncKeyState('Q')){}//wait for 'Q' press
return 0;
}
and the output is always:
1
2
3
4
5
6
Ideone refuses to run this code (compilation errors)..
Edit: But here is a working linux version :) (thanks to soon)
The interviewer doesn't know what he is talking about. Without the locks you get races on both last and last_value. The compiler could for example reorder the assignment to last before the print and increment of last_value, which could lead to the other thread executing on stale data. Furthermore you could get interleaved output, meaning things like two numbers not being seperated by a linebreak.
Another thing, which could go wrong is that the compiler might decide not to reload last and (less importantly) last_value each iteration, since it can't (safely) change between those iterations anyways (since data races are illegal by the C++11 standard and aren't acknowledged in previous standards). This means that the code suggested by the interviewer actually has a good chance of creating infinite loops of doing absoulutely doing nothing.
While it is possible to make that code correct without mutices, that absolutely needs atomic operations with appropriate ordering constraints (release-semantics on the assignment to last and acquire on the load of last inside the if statement).
Of course your solution does lower efficiency due to effectivly serializing the whole execution. However since the runtime is almost completely spent inside the streamout operation, which is almost certainly internally synchronized by the use of locks, your solution doesn't lower the efficiency anymore then it already is. Waiting on the lock in your code might actually be faster then busy waiting for it, depending on the availible resources (the nonlocking version using atomics would absolutely tank when executed on a single core machine)
For example, I've got a some work that is computed simultaneously by multiple threads.
For demonstration purposes the work is performed inside a while loop. In a single iteration each thread performs its own portion of the work, before the next iteration begins a counter should be incremented once.
My problem is that the counter is updated by each thread.
As this seems like a relatively simple thing to want to do, I presume there is a 'best practice' or common way to go about it?
Here is some sample code to illustrate the issue and help the discussion along.
(Im using boost threads)
class someTask {
public:
int mCounter; //initialized to 0
int mTotal; //initialized to i.e. 100000
boost::mutex cntmutex;
int getCount()
{
boost::mutex::scoped_lock lock( cntmutex );
return mCount;
}
void process( int thread_id, int numThreads )
{
while ( getCount() < mTotal )
{
// The main task is performed here and is divided
// into sub-tasks based on the thread_id and numThreads
// Wait for all thread to get to this point
cntmutex.lock();
mCounter++; // < ---- how to ensure this is only updated once?
cntmutex.unlock();
}
}
};
The main problem I see here is that you reason at a too-low level. Therefore, I am going to present an alternative solution based on the new C++11 thread API.
The main idea is that you essentially have a schedule -> dispatch -> do -> collect -> loop routine. In your example you try to reason about all this within the do phase which is quite hard. Your pattern can be much more easily expressed using the opposite approach.
First we isolate the work to be done in its own routine:
void process_thread(size_t id, size_t numThreads) {
// do something
}
Now, we can easily invoke this routine:
#include <future>
#include <thread>
#include <vector>
void process(size_t const total, size_t const numThreads) {
for (size_t count = 0; count != total; ++count) {
std::vector< std::future<void> > results;
// Create all threads, launch the work!
for (size_t id = 0; id != numThreads; ++id) {
results.push_back(std::async(process_thread, id, numThreads));
}
// The destruction of `std::future`
// requires waiting for the task to complete (*)
}
}
(*) See this question.
You can read more about std::async here, and a short introduction is offered here (they appear to be somewhat contradictory on the effect of the launch policy, oh well). It is simpler here to let the implementation decides whether or not to create OS threads: it can adapt depending on the number of available cores.
Note how the code is simplified by removing shared state. Because the threads share nothing, we no longer have to worry about synchronization explicitly!
You protected the counter with a mutex, ensuring that no two threads can access the counter at the same time. Your other option would be using Boost::atomic, c++11 atomic operations or platform-specific atomic operations.
However, your code seems to access mCounter without holding the mutex:
while ( mCounter < mTotal )
That's a problem. You need to hold the mutex to access the shared state.
You may prefer to use this idiom:
Acquire lock.
Do tests and other things to decide whether we need to do work or not.
Adjust accounting to reflect the work we've decided to do.
Release lock. Do work. Acquire lock.
Adjust accounting to reflect the work we've done.
Loop back to step 2 unless we're totally done.
Release lock.
You need to use a message-passing solution. This is more easily enabled by libraries like TBB or PPL. PPL is included for free in Visual Studio 2010 and above, and TBB can be downloaded for free under a FOSS licence from Intel.
concurrent_queue<unsigned int> done;
std::vector<Work> work;
// fill work here
parallel_for(0, work.size(), [&](unsigned int i) {
processWorkItem(work[i]);
done.push(i);
});
It's lockless and you can have an external thread monitor the done variable to see how much, and what, has been completed.
I would like to disagree with David on doing multiple lock acquisitions to do the work.
Mutexes are expensive and with more threads contending for a mutex , it basically falls back to a system call , which results in user space to kernel space context switch along with the with the caller Thread(/s) forced to sleep :Thus a lot of overheads.
So If you are using a multiprocessor system , I would strongly recommend using spin locks instead [1].
So what i would do is :
=> Get rid of the scoped lock acquisition to check the condition.
=> Make your counter volatile to support above
=> In the while loop do the condition check again after acquiring the lock.
class someTask {
public:
volatile int mCounter; //initialized to 0 : Make your counter Volatile
int mTotal; //initialized to i.e. 100000
boost::mutex cntmutex;
void process( int thread_id, int numThreads )
{
while ( mCounter < mTotal ) //compare without acquiring lock
{
// The main task is performed here and is divided
// into sub-tasks based on the thread_id and numThreads
cntmutex.lock();
//Now compare again to make sure that the condition still holds
//This would save all those acquisitions and lock release we did just to
//check whther the condition was true.
if(mCounter < mTotal)
{
mCounter++;
}
cntmutex.unlock();
}
}
};
[1]http://www.alexonlinux.com/pthread-mutex-vs-pthread-spinlock