Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 4 years ago.
Improve this question
If I have an array that I want to be updated by multiple threads simultaneously, what's the best/fastest way to go about doing that? For example, say I have the following code:
std::vector<float> vec;
vec.push_back(0.f);
for(int i = 0; i < 10000; i++) {
std::thread([&]{
// SAFETY CONSTRUCTS GO HERE
vec[0] += 1; // OR MAYBE HERE
// AND HERE?
});
}
// wait a little while, i.e. I was too lazy to write out joins
std::cout << vec[0];
If I want this to be safe and finally print the value 10000, what would be the best/fastest way to do this?
In the example you've given, the best/safest way would be to not launch threads, and simply update v[0] in the loop. The overhead of launching and synchronising threads will probably exceed any benefit you get by doing some operations in parallel.
v is a non-atomic object (std::vector<float>) and v[0] is actually a function call. Such objects, and their non-static member functions, cannot protect themselves from concurrent access by multiple threads. To use them from multiple threads, every direct usage of v (and v[0]) must be synchronised.
Generally, safety involving concurrently executing threads is achieved by synchronising access to any variables (or, more generally, memory) that are updated and accessed by multiple threads.
If using a mutex, that normally means all threads which access shared data must first grab the mutex, do the operation on shared variables (e.g. update v[0]), and then release the mutex. If a thread has not grabbed (or has grabbed and then released) the mutex, then all operations it does must NOT touch the shared variables.
If you want performance through threading, you will need to have a significant amount of the work done in each thread without ANY access to shared variables. That work, since parts can be executed concurrently, can potentially be executed in less total elapsed time. For that to represent a performance benefit, the gains (e.g. by doing a lot of operations concurrently) need to exceed the costs (of launching threads, of synchronising access to any data that is accessed by multiple threads).
Which is highly unlikely in anything similar to the code you have shown.
The point is that there is always a trade-off between speed and safety, when threads share any data. Safety requires updating of shared variables to be synchronised - without exception. A performance gain is generally derived from the things that do not need to be synchronised (i.e. that don't access variables shared between threads) and can be executed in parallel.
There's no single magic technique to have highly performant parallel access to shared data, but there are a few general techniques you'll see fairly often.
I'll use the example of summing an array in parallel for my answer, but these techniques apply pretty generally to many parallel algorithms.
1) Avoid sharing data in the first place
This is likely to be the safest and fastest method. Instead of having your worker threads directly update the shared state, have each of them work with their own local state, and then have your main thread combine the results. For the array sum example, this could look something like this:
int main() {
std::vector<int> toSum = getSomeVector();
std::vector<int> sums(NUM_THREADS);
std::vector<std::thread> threads;
int chunkSize = std::ceil(toSum.size() / (float)NUM_THREADS);
for (int i = 0; i < NUM_THREADS; ++i) {
auto chunkBegin = toSum.begin() + (i * chunkSize);
auto chunkEnd = chunkBegin + chunkSize;
threads.emplace_back([chunkBegin, chunkEnd](int& result) mutable {
for (; chunkBegin != chunkEnd; ++chunkBegin) {
result += *chunkBegin;
}
}, std::ref(sums[i]));
}
for (std::thread& thd : threads) {
thd.join();
}
int finalSum = 0;
for (int partialSum : sums) {
finalSum += partialSum;
}
std::cout << finalSum << '\n';
}
Since each thread only ever operates on its own partial sum, they cannot interfere with each other, and no extra synchronization is needed. You have to to a little bit of extra work at the end to add all the partial sums up, but the number of partial results is small, so this overhead should be pretty minimal.
2) Mutual exclusion
Instead of having each thread operate on its own state, you can protect shared state with a locking mechanism. Fairly often, this is a mutex, but there are lots of different locking primitives that have slightly different roles. The point here is to make sure only one thread is ever working with the shared state at a time. Be very careful when using this technique to avoid accessing the shared state within a tight loop. Since only one thread can hold the lock at a time, it's very easy to accidentally transform you fancy parallel code back into single-threaded code by making it so that only one thread can ever be working at a time.
For example, consider the following:
int main() {
std::vector<int> toSum = getSomeVector();
int sum = 0;
std::vector<std::thread> threads;
int chunkSize = std::ceil(toSum.size() / (float)NUM_THREADS);
std::mutex mtx;
for (int i = 0; i < NUM_THREADS; ++i) {
auto chunkBegin = toSum.begin() + (i * chunkSize);
auto chunkEnd = chunkBegin + chunkSize;
threads.emplace_back([chunkBegin, chunkEnd, &mtx, &sum]() mutable {
for (; chunkBegin != chunkEnd; ++chunkBegin) {
std::lock_guard guard(mtx);
sum += *chunkBegin;
}
});
}
for (std::thread& thd : threads) {
thd.join();
}
std::cout << sum << '\n';
}
Since each thread locks mtx within its loop, only one thread can ever be doing any work at a time. There is no parallelization here, and this code is likely to be slower than the equivalent single-threaded code due to the extra overhead of allocating threads and locking and unlocking the mutex.
Instead try to do as much as possible independantly, and access your shared state as infrequently as possible. For this example, you can do something similar to the example in (1) and build up partial sums within each thread, only adding them to the shared sum once at the end:
int main() {
std::vector<int> toSum = getSomeVector();
int sum = 0;
std::vector<std::thread> threads;
int chunkSize = std::ceil(toSum.size() / (float)NUM_THREADS);
std::mutex mtx;
for (int i = 0; i < NUM_THREADS; ++i) {
auto chunkBegin = toSum.begin() + (i * chunkSize);
auto chunkEnd = chunkBegin + chunkSize;
threads.emplace_back([chunkBegin, chunkEnd, &mtx, &sum]() mutable {
int partialSum = 0;
for (; chunkBegin != chunkEnd; ++chunkBegin) {
partialSum += *chunkBegin;
}
{
std::lock_guard guard(mtx);
sum += partialSum;
}
});
}
for (std::thread& thd : threads) {
thd.join();
}
std::cout << sum << '\n';
}
3) Atomic variables
Atomic variables are variables that can be "safely" shared between threads. They are very powerful, but also very easy to get wrong. You have to worry about things like memory-ordering constraints, and when you get them wrong it can be very difficult to debug and figure out what you did wrong.
At their core, atomic variables could be implemented as a simple variable whose operations are guarded by a mutex or similar. The magic all lies in the implementation, which often uses special CPU instructions to coordinate access to the variables at the CPU level to avoid a lot of the overhead of locking and unlocking.
Atomics aren't a magic bullet though. There is still overhead involved, and you can still shoot yourself in the foot by accessing your atomics too frequently. Your CPU does a lot of caching, and having multiple threads writing to an atomic variable likely means spilling the contents back out to memory, or at least to a higher level of cache. Once again, if you can avoid accessing your shared state withing tight loops in your thread, you should do so:
int main() {
std::vector<int> toSum = getSomeVector();
std::atomic<int> sum(0);
std::vector<std::thread> threads;
int chunkSize = std::ceil(toSum.size() / (float)NUM_THREADS);
for (int i = 0; i < NUM_THREADS; ++i) {
auto chunkBegin = toSum.begin() + (i * chunkSize);
auto chunkEnd = chunkBegin + chunkSize;
threads.emplace_back([chunkBegin, chunkEnd, &sum]() mutable {
int partialSum = 0;
for (; chunkBegin != chunkEnd; ++chunkBegin) {
partialSum += *chunkBegin;
}
// Since we don't care about the order that the threads update the sum,
// we can use memory_order_relaxed. This is a rabbit-hole I won't get
// too deep into here though.
sum.fetch_add(partialSum, std::memory_order_relaxed);
});
}
for (std::thread& thd : threads) {
thd.join();
}
std::cout << sum << '\n';
}
Related
Suppose I have some tasks (Monte Carlo simulations) that I want to run in parallel. I want to complete a given number of tasks, but tasks take different amount of time so not easy to divide the work evenly over the threads. Also: I need the results of all simulations in a single vector (or array) in the end.
So I come up with below approach:
int Max{1000000};
//SimResult is some struct with well-defined default value.
std::vector<SimResult> vec(/*length*/Max);//Initialize with default values of SimResult
int LastAdded{0};
void fill(int RandSeed)
{
Simulator sim{RandSeed};
while(LastAdded < Max)
{
// Do some work to bring foo to the desired state
//The duration of this work is subject to randomness
vec[LastAdded++]
= sim.GetResult();//Produces SimResult.
}
}
main()
{
//launch a bunch of std::async that start
auto fut1 = std::async(fill,1);
auto fut2 = std::async(fill,2);
//maybe some more tasks.
fut1.get();
fut2.get();
//do something with the results in vec.
}
The above code will give race conditions I guess. I am looking for a performant approach to avoid that. Requirements: avoid race conditions (fill the entire array, no skips) ; final result is immediately in array ; performant.
Reading on various approaches, it seems atomic is a good candidate, but I am not sure what settings will be most performant in my case? And not even sure whether atomic will cut it; maybe a mutex guarding LastAdded is needed?
One thing I would say is that you need to be very careful with the standard library random number functions. If your 'Simulator' class creates an instance of a generator, you should not run Monte Carlo simulations in parallel using the same object, because you'll get likely get repeated patterns of random numbers between the runs, which will give you inaccurate results.
The best practice in this area would be to create N Simulator objects with the same properties, and give each one a different random seed. Then you could pool these objects out over multiple threads using OpenMP, which is a common parallel programming model for scientific software development.
std::vector<SimResult> generateResults(size_t N_runs, double seed)
{
std::vector<SimResult> results(N_runs);
#pragma omp parallel for
for(auto i = 0; i < N_runs; i++)
{
auto sim = Simulator(seed + i);
results[i] = sim.GetResult();
}
}
Edit: With OpenMP, you can choose different scheduling models, which allow you to for e.g. dynamically split work between threads. You can do this with:
#pragma omp parallel for schedule(dynamic, 16)
which would give each thread chunks of 16 items to work on at a time.
Since you already know how many elements your are going to work with and never change the size of the vector, the easiest solution is to let each thread work on it's own part of the vector. For example
Update
to accomodate for vastly varying calculation times, you should keep your current code, but avoid race conditions via a std::lock_guard. You will need a std::mutex that is the same for all threads, for example a global variable, or pass a reference of the mutex to each thread.
void fill(int RandSeed, std::mutex &nextItemMutex)
{
Simulator sim{RandSeed};
size_t workingIndex;
while(true)
{
{
// enter critical area
std::lock_guard<std::mutex> nextItemLock(nextItemMutex);
// Acquire next item
if(LastAdded < Max)
{
workingIndex = LastAdded;
LastAdded++;
}
else
{
break;
}
// lock is released when nextItemLock goes out of scope
}
// Do some work to bring foo to the desired state
// The duration of this work is subject to randomness
vec[workingIndex] = sim.GetResult();//Produces SimResult.
}
}
Problem with this is, that snychronisation is quite expensive. But it's probably not that expensive in comparison to the simulation you run, so it shouldn't be too bad.
Version 2:
To reduce the amount of synchronisation that is required, you could acquire blocks to work on, instead of single items:
void fill(int RandSeed, std::mutex &nextItemMutex, size_t blockSize)
{
Simulator sim{RandSeed};
size_t workingIndex;
while(true)
{
{
std::lock_guard<std::mutex> nextItemLock(nextItemMutex);
if(LastAdded < Max)
{
workingIndex = LastAdded;
LastAdded += blockSize;
}
else
{
break;
}
}
for(size_t i = workingIndex; i < workingIndex + blockSize && i < MAX; i++)
vec[i] = sim.GetResult();//Produces SimResult.
}
}
Simple Version
void fill(int RandSeed, size_t partitionStart, size_t partitionEnd)
{
Simulator sim{RandSeed};
for(size_t i = partitionStart; i < partitionEnd; i++)
{
// Do some work to bring foo to the desired state
// The duration of this work is subject to randomness
vec[i] = sim.GetResult();//Produces SimResult.
}
}
main()
{
//launch a bunch of std::async that start
auto fut1 = std::async(fill,1, 0, Max / 2);
auto fut2 = std::async(fill,2, Max / 2, Max);
// ...
}
I am experimenting with std::async to populate a vector. The idea behind it is to use multi-threading to save time. However, running some benchmark tests I find that my non-async method is faster!
#include <algorithm>
#include <vector>
#include <future>
std::vector<int> Generate(int i)
{
std::vector<int> v;
for (int j = i; j < i + 10; ++j)
{
v.push_back(j);
}
return v;
}
Async:
std::vector<std::future<std::vector<int>>> futures;
for (int i = 0; i < 200; i+=10)
{
futures.push_back(std::async(
[](int i) { return Generate(i); }, i));
}
std::vector<int> res;
for (auto &&f : futures)
{
auto vec = f.get();
res.insert(std::end(res), std::begin(vec), std::end(vec));
}
Non-async:
std::vector<int> res;
for (int i = 0; i < 200; i+=10)
{
auto vec = Generate(i);
res.insert(std::end(res), std::begin(vec), std::end(vec));
}
My benchmark test shows that the async method is 71 times slower than non-async. What am I doing wrong?
std::async has two modes of operation:
std::launch::async
std::launch::deferred
In this case, you've called std::async without specifying either one, which means it's allowed to choose either one. std::launch::deferred basically means do the work on the calling thread. So std::async returns a future, and with std::launch::deferred, the action you've requested won't be carried out until you call .get on that future. It can be kind of handy under a few circumstances, but it's probably not what you want here.
Even if you specify std::launch::async, you need to realize that this starts up a new thread of execution to carry out the action you've requested. It then has to create a future, and use some sort of signalling from the thread to the future to let you know when the computation you've requested is done.
All of that adds a fair amount of overhead--anywhere from microseconds to milliseconds or so, depending on the OS, CPU, etc.
So, for asynchronous execution to make sense, the "stuff" you do asynchronously typically needs to take tens of milliseconds at the very least (and hundreds of milliseconds might be a more sensible lower threshold). I wouldn't get too wrapped up in the exact cutoff, but it needs to be something that takes a while.
So, filling an array asynchronously probably only makes sense if the array is quite a lot larger than you're dealing with here.
For filling memory, you'll quickly run into another problem though: most CPUs are enough faster than main memory that if all you're doing is writing to memory, there's a pretty good chance that a single thread will already saturate the path to memory, so even at best doing the job asynchronously will only gain a little, and may still pretty easily cause a slow-down.
The ideal case for asynchronous operation would be something like one thread that's heavily memory bound, but another that (for example) reads a little bit of data, and does a lot of computation on that small amount of data. In this case, the computation thread will mostly operate on its data in the cache, so it won't get in the way of the memory thread doing its thing.
There are multiple factors that are causing the Multithreaded code to perform (much) slower than the Singlethreaded code.
Your array sizes are too small
Multithreading often has negligible-to-no effect on datasets that are particularly small. In both versions of your code, you're generating 2000 integers, and each Logical Thread (which, because std::async is often implemented in terms of thread pools, might not be the same as a Software Thread) is only generating 10 integers. The cost of spooling up a thread every 10 integers way offsets the benefit of generating those integers in parallel.
You might see a performance gain if each thread were instead responsible for, say, 10,000 integers each, but you'll probably instead have a different issue:
All your code is bottlenecked by an inherently serial process
Both versions of the code copy the generated integers into a host vector. It would be one thing if the act of generating those integers was itself a time consuming process, but in your case, it's likely just a matter of a small, fast bit of assembly generating each integer.
So the act of copying each integer into the final vector is probably not inherently faster than generating each integer, meaning a sizable chunk of the "work" being done is completely serial, defeating the whole purpose of multithreading your code.
Fixing the code
Compilers are very good at their jobs, so in trying to revise your code, I was only barely able to get multithreaded code that was faster than the serial code. Multiple executions had varying results, so my general assessment is that this kind of code is bad at being multithreaded.
But here's what I came up with:
#include <algorithm>
#include <vector>
#include <future>
#include<chrono>
#include<iostream>
#include<iomanip>
//#1: Constants
constexpr int BLOCK_SIZE = 500000;
constexpr int NUM_OF_BLOCKS = 20;
std::vector<int> Generate(int i) {
std::vector<int> v;
for (int j = i; j < i + BLOCK_SIZE; ++j) {
v.push_back(j);
}
return v;
}
void asynchronous_attempt() {
std::vector<std::future<void>> futures;
//#2: Preallocated Vector
std::vector<int> res(NUM_OF_BLOCKS * BLOCK_SIZE);
auto it = res.begin();
for (int i = 0; i < NUM_OF_BLOCKS * BLOCK_SIZE; i+=BLOCK_SIZE)
{
futures.push_back(std::async(
[it](int i) {
auto vec = Generate(i);
//#3 Copying done multithreaded
std::copy(vec.begin(), vec.end(), it + i);
}, i));
}
for (auto &&f : futures) {
f.get();
}
}
void serial_attempt() {
//#4 Changes here to show fair comparison
std::vector<int> res(NUM_OF_BLOCKS * BLOCK_SIZE);
auto it = res.begin();
for (int i = 0; i < NUM_OF_BLOCKS * BLOCK_SIZE; i+=BLOCK_SIZE) {
auto vec = Generate(i);
it = std::copy(vec.begin(), vec.end(), it);
}
}
int main() {
using clock = std::chrono::steady_clock;
std::cout << "Theoretical # of Threads: " << std::thread::hardware_concurrency() << std::endl;
auto begin = clock::now();
asynchronous_attempt();
auto end = clock::now();
std::cout << "Duration of Multithreaded Attempt: " << std::setw(10) << (end - begin).count() << "ns" << std::endl;
begin = clock::now();
serial_attempt();
end = clock::now();
std::cout << "Duration of Serial Attempt: " << std::setw(10) << (end - begin).count() << "ns" << std::endl;
}
This resulted in the following output:
Theoretical # of Threads: 2
Duration of Multithreaded Attempt: 361149213ns
Duration of Serial Attempt: 364785676ns
Given that this was on an online compiler (here) I'm willing to bet the multithreaded code might win out on a dedicated machine, but I think this at least demonstrates the improvement in performance that we're at least on par between the two methods.
Below are the changes I made, that are ID'd in the code:
We've dramatically increased the number of integers being generated, to force the threads to do actual meaningful work, instead of getting bogged down on OS-level housekeeping
The vector has its size pre-allocated. No more frequent resizing.
Now that the space has been preallocated, we can multithread the copying instead of doing it in serial later.
We have to change the serial code so it also preallocates + copies so that it's a fair comparison.
Now, we've ensured that all the code is indeed running in parallel, and while it's not amounting to a substantial improvement over the serial code, it's at least no longer exhibiting the degenerate performance losses we were seeing before.
First of all, you are not forcing the std::async to work asynchronously (you would need to specify std::launch::async policy to do so). Second of all, it'd be kind of an overkill to asynchronously create an std::vector of 10 ints. It's just not worth it. Remember - using more threads does not mean that you will see performance benefit! Creating a thread (or even using a threadpool) introduces some overhead, which, in this case, seems to dwarf the benefits of running tasks asynchronously.
Thanks #NathanOliver ;>
I'm making a parallel password cracker for an assignment. When I launch more than one thread, the times taken to crack take longer the more threads I add. What is the problem here?
Secondly, what resource sharing techniques can I use for optimal performance too? I'm required to use either mutexes, atomic operations or barriers while also using semaphores, conditional variables or channels. Mutexes seem to slow my program down quite drastically.
Here is an example of my code for context:
std::mutex mtx;
std::condition_variable cv;
void run()
{
std::unique_lock<std::mutex> lck(mtx);
ready = true;
cv.notify_all();
}
crack()
{
std::lock_guard<std::mutex> lk(mtx);
...do cracking stuff
}
main()
{
....
std::thread *t = new std::thread[uiThreadCount];
for(int i = 0; i < uiThreadCount; i++)
{
t[i] = std::thread(crack, params);
}
run();
for(int i = 0; i < uiThreadCount; i++)
{
t[i].join();
}
}
When writing multi-threaded code, it's generally a good idea to share as few resources as possible, so you can avoid having to synchronize using a mutex or an atomic.
There are a lot of different ways to do password cracking, so I'll give a slightly simpler example. Let's say you have a hash function, and a hash, and you're trying to guess what input produces the hash (this is basically how a password would get cracked).
We can write the cracker like this. It'll take the hash function and the password hash, check a range of values, and invoke the callback function if it found a match.
auto cracker = [](auto passwdHash, auto hashFunc, auto min, auto max, auto callback) {
for(auto i = min; i < max; i++) {
auto output = hashFunc(i);
if(output == passwdHash) {
callback(i);
}
}
};
Now, we can write a parallel version. This version only has to synchronize when it finds a match, which is pretty rare.
auto parallel_cracker = [](auto passwdHash, auto hashFunc, auto min, auto max, int num_threads) {
// Get a vector of threads
std::vector<std::thread> threads;
threads.reserve(num_threads);
// Make a vector of all the matches it discovered
using input_t = decltype(min);
std::vector<input_t> matches;
std::mutex match_lock;
// Whenever a match is found, this function gets called
auto callback = [&](input_t match) {
std::unique_lock<std::mutex> _lock(match_lock);
std::cout << "Found match: " << match << '\n';
matches.push_back(match);
};
for(int i = 0; i < num_threads; i++) {
auto sub_min = min + ((max - min) * i) / num_threads;
auto sub_max = min + ((max - min) * (i + 1)) / num_threads;
matches.push_back(std::thread(cracker, passwdHash, hashFunc, sub_min, sub_max, callback));
}
// Join all the threads
for(auto& thread : threads) {
thread.join();
}
return matches;
};
yes, not surprising with the way it's written: putting a mutex at the beginning of your thread (crack function), you effectively make them run sequentially
I understand you want to achieve a "synchronous start" of the threads (by the intention of using conditional variable cv), but you don't use it properly - without use of one of its wait methods, the call cv.notify_all() is useless: it does not do what you intended, instead your threads will simply run sequentially.
using wait() from the std::condition_variable in your crack() call is imperative: it will release the mtx (which you just grabbed with the mutex guard lk) and will block the execution of the thread until the cv.notify_all(). After the call, your other threads (except the first one, whichever it will be) will remain under the mtx so if you really want the "parallel" execution, you'd then need to unlock the mtx.
Here, how your crack thread should look like:
crack()
{
std::unique_lock<std::mutex> lk(mtx);
cv.wait(lk);
lk.unlock();
...do cracking stuff
}
btw, you don't need ready flag in your run() call - it's entirely redundant/unused.
I'm required to use either mutexes, atomic operations or barriers
while also using semaphores, conditional variables or channels
- different tools/techniques are good for different things, the question is too general
I have a function that populates entries in a large matrix. As the computations are independent, I was thinking about exploiting std::thread so that chunks of the matrix can be processed by separate threads.
Instead of dividing the matrix in to n chunks where n is the limit on the maximum number of threads allowed to run simultaneously, I would like to make finer chunks, so that I could spawn a new thread when an existing thread is finished. (As the compute time will be widely different for different entries, and equally dividing the matrix will not be very efficient here. Hence the latter idea.)
What are the concepts in std::thread I should look into for doing this? (I came across async and condition_variables although I don't clearly see how they can be exploited for such kinds of spawning). Some example pseudo code would greatly help!
Why tax the OS scheduler with thread creation & destruction? (Assume these operations are expensive.) Instead, make your threads work more instead.
EDIT: If you do no want to split the work in equal chunks, then the best solution really is a thread pool. FYI, there is a thread_pool library in the works for C++14.
What is below assumed that you could split the work in equal chunks, so is not exactly applicable to your question. END OF EDIT.
struct matrix
{
int nrows, ncols;
// assuming row-based processing; adjust for column-based processing
void fill_rows(int first, int last);
};
int num_threads = std::thread::hardware_concurrency();
std::vector< std::thread > threads(num_threads);
matrix m; // must be initialized...
// here - every thread will process as many rows as needed
int nrows_per_thread = m.nrows / num_threads;
for(int i = 0; i != num_threads; ++i)
{
// thread i will process these rows:
int first = i * nrows_per_thread;
int last = first + nrows_per_thread;
// last thread gets remaining rows
last += (i == num_threads - 1) ? m.nrows % nrows_per_thread : 0;
threads[i] = std::move(std::thread([&m,first,last]{
m.fill_rows(first,last); }))
}
for(int i = 0; i != num_threads; ++i)
{
threads[i].join();
}
If this is an operation you do very frequently, then use a worker pool as #Igor Tandetnik suggests in the comments. For one-offs, it's not worth the trouble.
I am trying to multithread a piece of code using the boost library. The problem is that each thread has to access and modify a couple of global variables. I am using mutex to lock the shared resources, but the program ends up taking more time then when it was not multithreaded. Any advice on how to optimize the shared access?
Thanks a lot!
In the example below, the *choose_ecount* variable has to be locked, and I cannot take it out of the loop and lock it for only an update at the end of the loop because it is needed with the newest values by the inside function.
for(int sidx = startStep; sidx <= endStep && sidx < d.sents[lang].size(); sidx ++){
sentence s = d.sents[lang][sidx];
int senlen = s.words.size();
int end_symb = s.words[senlen-1].pos;
inside(s, lbeta);
outside(s,lbeta, lalpha);
long double sen_prob = lbeta[senlen-1][F][NO][0][senlen-1];
if (lambda[0] == 0){
mtx_.lock();
d.sents[lang][sidx].prob = sen_prob;
mtx_.unlock();
}
for(int size = 1; size <= senlen; size++)
for(int i = 0; i <= senlen - size ; i++)
{
int j = i + size - 1;
for(int k = i; k < j; k++)
{
int hidx = i; int head = s.words[hidx].pos;
for(int r = k+1; r <=j; r++)
{
int aidx = r; int arg = s.words[aidx].pos;
mtx_.lock();
for(int kids = ONE; kids <= MAX; kids++)
{
long double num = lalpha[hidx][R][kids][i][j] * get_choose_prob(s, hidx, aidx) *
lbeta[hidx][R][kids - 1][i][k] * lbeta[aidx][F][NO][k+1][j];
long double gen_right_prob = (num / sen_prob);
choose_ecount[lang][head][arg] += gen_right_prob; //LOCK
order_ecount[lang][head][arg][RIGHT] += gen_right_prob; //LOCK
}
mtx_.unlock();
}
}
From the code you have posted I can see only writes to choose_ecount and order_ecount. So why not use local per thread buffers to compute the sum and then add them up after the outermost loop and only sync this operation?
Edit:
If you need to access the intermediate values of choose_ecount how do you assure the correct intermediate value is present? One thread might have finished 2 iterations of its loop in the meantime producing different results in another thread.
It kind of sounds like you need to use a barrier for your computation instead.
It's unlikely you're going to get acceptable performance using a mutex in an inner loop. Concurrent programming is difficult, not just for the programmer but also for the computer. A large portion of the performance of modern CPUs comes from being able to treat blocks of code as sequences independent of external data. Algorithms that are efficient for single-threaded execution are often unsuitable for multi-threaded execution.
You might want to have a look at boost::atomic, which can provide lock-free synchronization, but the memory barriers required for atomic operations are still not free, so you may still run into problems, and you will probably have to re-think your algorithm.
I guess that you divide your complete problem into chunks ranging from startStep to endStep to get processed by each thread.
Since you have that locked mutex there, you're effectively serializing all threads:
You divide your problem into some chunks which are processed in serial, yet unspecified order.
That is the only thing you get is the overhead for doing multithreading.
Since you're operating on doubles, using atomic operations is not a choice for you: they're typically implemented for integral types only.
The only possible solution is to follow Kratz' suggestion to have a copy of choose_ecount and order_ecount for each thread and reduce them to a single one after your threads have finished.