C++ amp atomics

C++ amp atomics - c++

I am rewriting an algorithm in C++ AMP and just ran into an issue with atomic writes, more specifically atomic_fetch_add, which apparently is only for integers?
I need to add a double_4 (or if I have to, a float_4) in an atomic fashion. How do I accomplish that with C++ AMP's atomics?
Is the best/only solution really to have a lock variable which my code can use to control the writes? I actually need to do atomic writes for a long list of output doubles, so I would essentially need a lock for every output.
I have already considered tiling this for better performance, but right now I am just in the first iteration.
EDIT:
Thanks for the quick answers already given.
I have a quick update to my question though.
I made the following lock attempt, but it seems that when one thread in a warp gets past the lock, all the other threads in the same warp just tags along. I was expecting the first warp thread to get the lock, but I must be missing something (note that it has been quite a few years since my cuda days, so I have just gotten dumb)
parallel_for_each(attracting.extent, [=](index<1> idx) restrict(amp)
{
.....
for (int j = 0; j < attracted.extent.size(); j++)
{
...
int lock = 0; //the expected lock value
while (!atomic_compare_exchange(&locks[j], &lock, 1));
//when one warp thread gets the lock, ALL threads continue on
...
acceleration[j] += ...; //locked write
locks[j] = 0; //leaving the lock again
}
});
It is as such not a big problem, since I should write into a shared variable at first and only write it to global memory after all threads in a tile have completed, but I just don't understand this behavior.

All the atomic add ops are only for integer types. You can do what you want without locks using 128-bit CAS (compare-and-swap) operations though for float_4 (I'm assuming this is 4 floats), but there's no 256-bit CAS ops what you would need for double_4. What you have to do is to have a loop which atomically reads float_4 from memory, perform the float add in the regular way, and then use CAS to test & swap the value if it's the original (and loop if not, i.e. some other thread changed the value between read & write). Note that the 128-bit CAS is only available on 64-bit architectures and that your data needs to be properly aligned.

if the critical code is short, you can create your own lock using atomic operations:
int lock = 1;
while(__sync_lock_test_and_set(&lock, 0) == 0) // trying to acquire lock
{
//yield the thread or go to sleep
}
//critical section, do the work
// release lock
lock = 1;
the advantage is you save the overhead of the OS locks.

The question has as such been answered by others and the answer is that you need to handle double atomics yourself. There is no function for it in the library.
I would also like to elaborate on my own edit, in case others come here with the same failing lock.
In the following example, my error was in not realizing that when the exchange failed, it actually changed the expected value! Thus the first thread would expect lock being zero and write a 1 in it. The next thread would expect 0 and would fail to write a one - but then the exchange wrote a one in the variable holding the expected value. This means that the next time the thread tried to do an exchange it expects a 1 in the lock! This it gets and then it thinks it gets the lock.
I was absolutely not aware that the &lock would receive a 1 on failed exchange match!
parallel_for_each(attracting.extent, [=](index<1> idx) restrict(amp)
{
.....
for (int j = 0; j < attracted.extent.size(); j++)
{
...
int lock = 0; //the expected lock value
**//note that, if locks[j]!=lock then lock=1
//meaning that ACE will be true the next time if locks[j]==1
//meaning the while will terminate even though someone else has the lock**
while (!atomic_compare_exchange(&locks[j], &lock, 1));
//when one warp thread gets the lock, ALL threads continue on
...
acceleration[j] += ...; //locked write
locks[j] = 0; //leaving the lock again
}
});
It seems that a fix is to do
parallel_for_each(attracting.extent, [=](index<1> idx) restrict(amp)
{
.....
for (int j = 0; j < attracted.extent.size(); j++)
{
...
int lock = 0; //the expected lock value
while (!atomic_compare_exchange(&locks[j], &lock, 1))
{
lock=0; //reset the expected value
};
//when one warp thread gets the lock, ALL threads continue on
...
acceleration[j] += ...; //locked write
locks[j] = 0; //leaving the lock again
}
});

Related

Safety vs speed of multithreading in C++ [closed]

Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 4 years ago.
Improve this question
If I have an array that I want to be updated by multiple threads simultaneously, what's the best/fastest way to go about doing that? For example, say I have the following code:
std::vector<float> vec;
vec.push_back(0.f);
for(int i = 0; i < 10000; i++) {
std::thread([&]{
// SAFETY CONSTRUCTS GO HERE
vec[0] += 1; // OR MAYBE HERE
// AND HERE?
});
}
// wait a little while, i.e. I was too lazy to write out joins
std::cout << vec[0];
If I want this to be safe and finally print the value 10000, what would be the best/fastest way to do this?

In the example you've given, the best/safest way would be to not launch threads, and simply update v[0] in the loop. The overhead of launching and synchronising threads will probably exceed any benefit you get by doing some operations in parallel.
v is a non-atomic object (std::vector<float>) and v[0] is actually a function call. Such objects, and their non-static member functions, cannot protect themselves from concurrent access by multiple threads. To use them from multiple threads, every direct usage of v (and v[0]) must be synchronised.
Generally, safety involving concurrently executing threads is achieved by synchronising access to any variables (or, more generally, memory) that are updated and accessed by multiple threads.
If using a mutex, that normally means all threads which access shared data must first grab the mutex, do the operation on shared variables (e.g. update v[0]), and then release the mutex. If a thread has not grabbed (or has grabbed and then released) the mutex, then all operations it does must NOT touch the shared variables.
If you want performance through threading, you will need to have a significant amount of the work done in each thread without ANY access to shared variables. That work, since parts can be executed concurrently, can potentially be executed in less total elapsed time. For that to represent a performance benefit, the gains (e.g. by doing a lot of operations concurrently) need to exceed the costs (of launching threads, of synchronising access to any data that is accessed by multiple threads).
Which is highly unlikely in anything similar to the code you have shown.
The point is that there is always a trade-off between speed and safety, when threads share any data. Safety requires updating of shared variables to be synchronised - without exception. A performance gain is generally derived from the things that do not need to be synchronised (i.e. that don't access variables shared between threads) and can be executed in parallel.

There's no single magic technique to have highly performant parallel access to shared data, but there are a few general techniques you'll see fairly often.
I'll use the example of summing an array in parallel for my answer, but these techniques apply pretty generally to many parallel algorithms.
1) Avoid sharing data in the first place
This is likely to be the safest and fastest method. Instead of having your worker threads directly update the shared state, have each of them work with their own local state, and then have your main thread combine the results. For the array sum example, this could look something like this:
int main() {
std::vector<int> toSum = getSomeVector();
std::vector<int> sums(NUM_THREADS);
std::vector<std::thread> threads;
int chunkSize = std::ceil(toSum.size() / (float)NUM_THREADS);
for (int i = 0; i < NUM_THREADS; ++i) {
auto chunkBegin = toSum.begin() + (i * chunkSize);
auto chunkEnd = chunkBegin + chunkSize;
threads.emplace_back([chunkBegin, chunkEnd](int& result) mutable {
for (; chunkBegin != chunkEnd; ++chunkBegin) {
result += *chunkBegin;
}
}, std::ref(sums[i]));
}
for (std::thread& thd : threads) {
thd.join();
}
int finalSum = 0;
for (int partialSum : sums) {
finalSum += partialSum;
}
std::cout << finalSum << '\n';
}
Since each thread only ever operates on its own partial sum, they cannot interfere with each other, and no extra synchronization is needed. You have to to a little bit of extra work at the end to add all the partial sums up, but the number of partial results is small, so this overhead should be pretty minimal.
2) Mutual exclusion
Instead of having each thread operate on its own state, you can protect shared state with a locking mechanism. Fairly often, this is a mutex, but there are lots of different locking primitives that have slightly different roles. The point here is to make sure only one thread is ever working with the shared state at a time. Be very careful when using this technique to avoid accessing the shared state within a tight loop. Since only one thread can hold the lock at a time, it's very easy to accidentally transform you fancy parallel code back into single-threaded code by making it so that only one thread can ever be working at a time.
For example, consider the following:
int main() {
std::vector<int> toSum = getSomeVector();
int sum = 0;
std::vector<std::thread> threads;
int chunkSize = std::ceil(toSum.size() / (float)NUM_THREADS);
std::mutex mtx;
for (int i = 0; i < NUM_THREADS; ++i) {
auto chunkBegin = toSum.begin() + (i * chunkSize);
auto chunkEnd = chunkBegin + chunkSize;
threads.emplace_back([chunkBegin, chunkEnd, &mtx, &sum]() mutable {
for (; chunkBegin != chunkEnd; ++chunkBegin) {
std::lock_guard guard(mtx);
sum += *chunkBegin;
}
});
}
for (std::thread& thd : threads) {
thd.join();
}
std::cout << sum << '\n';
}
Since each thread locks mtx within its loop, only one thread can ever be doing any work at a time. There is no parallelization here, and this code is likely to be slower than the equivalent single-threaded code due to the extra overhead of allocating threads and locking and unlocking the mutex.
Instead try to do as much as possible independantly, and access your shared state as infrequently as possible. For this example, you can do something similar to the example in (1) and build up partial sums within each thread, only adding them to the shared sum once at the end:
int main() {
std::vector<int> toSum = getSomeVector();
int sum = 0;
std::vector<std::thread> threads;
int chunkSize = std::ceil(toSum.size() / (float)NUM_THREADS);
std::mutex mtx;
for (int i = 0; i < NUM_THREADS; ++i) {
auto chunkBegin = toSum.begin() + (i * chunkSize);
auto chunkEnd = chunkBegin + chunkSize;
threads.emplace_back([chunkBegin, chunkEnd, &mtx, &sum]() mutable {
int partialSum = 0;
for (; chunkBegin != chunkEnd; ++chunkBegin) {
partialSum += *chunkBegin;
}
{
std::lock_guard guard(mtx);
sum += partialSum;
}
});
}
for (std::thread& thd : threads) {
thd.join();
}
std::cout << sum << '\n';
}
3) Atomic variables
Atomic variables are variables that can be "safely" shared between threads. They are very powerful, but also very easy to get wrong. You have to worry about things like memory-ordering constraints, and when you get them wrong it can be very difficult to debug and figure out what you did wrong.
At their core, atomic variables could be implemented as a simple variable whose operations are guarded by a mutex or similar. The magic all lies in the implementation, which often uses special CPU instructions to coordinate access to the variables at the CPU level to avoid a lot of the overhead of locking and unlocking.
Atomics aren't a magic bullet though. There is still overhead involved, and you can still shoot yourself in the foot by accessing your atomics too frequently. Your CPU does a lot of caching, and having multiple threads writing to an atomic variable likely means spilling the contents back out to memory, or at least to a higher level of cache. Once again, if you can avoid accessing your shared state withing tight loops in your thread, you should do so:
int main() {
std::vector<int> toSum = getSomeVector();
std::atomic<int> sum(0);
std::vector<std::thread> threads;
int chunkSize = std::ceil(toSum.size() / (float)NUM_THREADS);
for (int i = 0; i < NUM_THREADS; ++i) {
auto chunkBegin = toSum.begin() + (i * chunkSize);
auto chunkEnd = chunkBegin + chunkSize;
threads.emplace_back([chunkBegin, chunkEnd, &sum]() mutable {
int partialSum = 0;
for (; chunkBegin != chunkEnd; ++chunkBegin) {
partialSum += *chunkBegin;
}
// Since we don't care about the order that the threads update the sum,
// we can use memory_order_relaxed. This is a rabbit-hole I won't get
// too deep into here though.
sum.fetch_add(partialSum, std::memory_order_relaxed);
});
}
for (std::thread& thd : threads) {
thd.join();
}
std::cout << sum << '\n';
}

what is the optimal Multithreading scenario for processing a long file lines?

I have a big file and i want to read and also [process] all lines (even lines) of the file with multi threads.
One suggests to read the whole file and break it to multiple files (same count as threads), then let every thread process a specific file. as this idea will read the whole file, write it again and read multiple files it seems to be slow (3x I/O) and i think there must be better scenarios,
I myself though this could be a better scenario:
One thread will read the file and put the data on a global variable and other threads will read the data from that variable and process. more detailed:
One thread will read the main file with running func1 function and put each even line on a Buffer: line1Buffer of a max size MAX_BUFFER_SIZE and other threads will pop their data from the Buffer and process it with running func2 function. in code:
Global variables:
#define MAX_BUFFER_SIZE 100
vector<string> line1Buffer;
bool continue = true;// to end thread 2 to last thread by setting to false
string file = "reads.fq";
Function func1 : (thread 1)
void func1(){
ifstream ifstr(file.c_str());
for (long long i = 0; i < numberOfReads; i++) { // 2 lines per read
getline(ifstr,ReadSeq);
getline(ifstr,ReadSeq);// reading even lines
while( line1Buffer.size() == MAX_BUFFER_SIZE )
; // to delay when the buffer is full
line1Buffer.push_back(ReadSeq);
}
continue = false;
return;
}
And function func2 : (other threads)
void func2(){
string ReadSeq;
while(continue){
if(line2Buffer.size() > 0 ){
ReadSeq = line1Buffer.pop_back();
// do the proccessing....
}
}
}
About the speed:
If the reading part is slower so the total time will be equal to reading the file for just one time(and the buffer may just contain 1 file at each time and hence just 1 another thread will be able to work with thread 1). and if the processing part is slower then the total time will be equal to the time for the whole processing with numberOfThreads - 1 threads. both cases is faster than reading the file and writing in multiple files with 1 thread and then read the files with multi threads and process...
and so there is 2 question:
1- how to call the functions by threads the way thread 1 runs func1 and others run func2 ?
2- is there any faster scenario?
3-[Deleted] anyone can extend this idea to M threads for reading and N threads for processing? obviously we know :M+N==umberOfThreads is true
Edit: the 3rd question is not right as multiple threads can't help in reading a single file
Thanks All

An other approach could be interleaved thread.
Reading is done by every thread, but only 1 at once.
Because of the waiting in the very first iteration, the
threads will be interleaved.
But this is only an scaleable option, if work() is the bottleneck
(then every non-parallel execution would be better)
Thread:
while (!end) {
// should be fair!
lock();
read();
unlock();
work();
}
basic example: (you should probably add some error-handling)
void thread_exec(ifstream* file,std::mutex* mutex,int* global_line_counter) {
std::string line;
std::vector<std::string> data;
int i;
do {
i = 0;
// only 1 concurrent reader
mutex->lock();
// try to read the maximum number of lines
while(i < MAX_NUMBER_OF_LINES_PER_ITERATION && getline(*file,line)) {
// only the even lines we want to process
if (*global_line_counter % 2 == 0) {
data.push_back(line);
i++;
}
(*global_line_counter)++;
}
mutex->unlock();
// execute work for every line
for (int j=0; j < data.size(); j++) {
work(data[j]);
}
// free old data
data.clear();
//until EOF was not reached
} while(i == MAX_NUMBER_OF_LINES_PER_ITERATION);
}
void process_data(std::string file) {
// counter for checking if line is even
int global_line_counter = 0;
// open file
ifstream ifstr(file.c_str());
// mutex for synchronization
// maybe a fair-lock would be a better solution
std::mutex mutex;
// create threads and start them with thread_exec(&ifstr, &mutex, &global_line_counter);
std::vector<std::thread> threads(NUM_THREADS);
for (int i=0; i < NUM_THREADS; i++) {
threads[i] = std::thread(thread_exec, &ifstr, &mutex, &global_line_counter);
}
// wait until all threads have finished
for (int i=0; i < NUM_THREADS; i++) {
threads[i].join();
}
}

What is your bottleneck? Hard disk or processing time?
If it's the hard disk, then you're probably not going to get any more performance out as you've hit the limits of the hardware. Concurrent reads are by far faster than trying to jump around the file. Having multiple threads trying to read your file will almost certainly reduce the overall speed as it will increase disk thrashing.
A single thread reading the file and a thread pool (or just 1 other thread) to deal with the contents is probably as good as you can get.
Global variables:
This is a bad habit to get into.

Assume having #p treads, two scenarios mentioned in the post and answers:
1) Reading with 'a' thread and processing with other threads, in this case #p-1 thread will process in comparison with only one thread reading. assume the time for full operation is jobTime and time for processing with n threads is pTime(n) so:
worst case occurs when reading time is very slower than processing and jobTime = pTime(1)+readTime and the best case is when the processing is slower than reading in which jobTime is equal to pTime(#p-1)+readTime
2) read and process with all #p threads. in this scenario every thread needs to do two steps. first step is to read a part of the file with size MAX_BUFFER_SIZE which is sequential; means no two threads can read at one time. but the second part is processing the read data which can be parallel. this way in the worst case jobTime is pTime(1)+readTime as before (but*), but the best optimized case is pTime(#p)+readTime which is better than previous.
*: in 2nd approach's worst case, however reading is slower but you can find a optimized MAX_BUFFER_SIZE in which (in the worst case) some reading with one thread will overlaps with some processing with another thread. with this optimized MAX_BUFFER_SIZE the jobTime will be less than pTime(1)+readTime and could diverge to readTime

First off, reading a file is a slow operation so unless you are doing some superheavy processing, the file reading will be limiting.
If you do decide to go the multithreaded route a queue is the right approach. Just make sure you push in front an pop out back. An stl::deque should work well. Also you will need to lock the queue with a mutex and sychronize it with a conditional variable.
One last thing is you will need to limit the size if the queue for the scenario where we are pushing faster than we are popping.

Threading and Mutex

I'm working on a program that simulates a gas station. Each car at the station is it's own thread. Each car must loop through a single bitmask to check if a pump is open, and if it is, update the bitmask, fill up, and notify other cars that the pump is now open. My current code works but there are some issues with load balancing. Ideally all the pumps are used the same amount and all cars get equal fill-ups.
EDIT: My program basically takes a number of cars, pumps, and a length of time to run the test for. During that time, cars will check for an open pump by constantly calling this function.
int Station::fillUp()
{
// loop through the pumps using the bitmask to check if they are available
for (int i = 0; i < pumpsInStation; i++)
{
//Check bitmask to see if pump is open
stationMutex->lock();
if ((freeMask & (1 << i)) == 0 )
{
//Turning the bit on
freeMask |= (1 << i);
stationMutex->unlock();
// Sleeps thread for 30ms and increments counts
pumps[i].fillTankUp();
// Turning the bit back off
stationMutex->lock();
freeMask &= ~(1 << i);
stationCondition->notify_one();
stationMutex->unlock();
// Sleep long enough for all cars to have a chance to fill up first.
this_thread::sleep_for(std::chrono::milliseconds((((carsInStation-1) * 30) / pumpsInStation)-30));
return 1;
}
stationMutex->unlock();
}
// If not pumps are available, wait until one becomes available.
stationCondition->wait(std::unique_lock<std::mutex>(*stationMutex));
return -1;
}
I feel the issue has something to do with locking the bitmask when I read it. Do I need to have some sort of mutex or lock around the if check?

It looks like every car checks the availability of pump #0 first, and if that pump is busy it then checks pump #1, and so on. Given that, it seems expected to me that pump #0 would service the most cars, followed by pump #1 serving the second-most cars, all the way down to pump #(pumpsInStation-1) which only ever gets used in the (relatively rare) situation where all of the pumps are in use simultaneously at the time a new car pulls in.
If you'd like to get better load-balancing, you should probably have each car choose a different random ordering to iterate over the pumps, rather than having them all check the pumps' availability in the same order.

Normally I wouldn't suggest refactoring as it's kind of rude and doesn't go straight to the answer, but here I think it would help you a bit to break your logic into three parts, like so, to better show where the contention lies:
int Station::acquirePump()
{
// loop through the pumps using the bitmask to check if they are available
ScopedLocker locker(&stationMutex);
for (int i = 0; i < pumpsInStation; i++)
{
// Check bitmask to see if pump is open
if ((freeMask & (1 << i)) == 0 )
{
//Turning the bit on
freeMask |= (1 << i);
return i;
}
}
return -1;
}
void Station::releasePump(int n)
{
ScopedLocker locker(&stationMutex);
freeMask &= ~(1 << n);
stationCondition->notify_one();
}
bool Station::fillUp()
{
// If a pump is available:
int i = acquirePump();
if (i != -1)
{
// Sleeps thread for 30ms and increments counts
pumps[i].fillTankUp();
releasePump(i)
// Sleep long enough for all cars to have a chance to fill up first.
this_thread::sleep_for(std::chrono::milliseconds((((carsInStation-1) * 30) / pumpsInStation)-30));
return true;
}
// If no pumps are available, wait until one becomes available.
stationCondition->wait(std::unique_lock<std::mutex>(*stationMutex));
return false;
}
Now when you have the code in this form, there is a load balancing issue which is important to fix if you don't want to "exhaust" one pump or if it too might have a lock inside. The issue lies in acquirePump where you are checking the availability of free pumps in the same order for each car. A simple tweak you can make to balance it better is like so:
int Station::acquirePump()
{
// loop through the pumps using the bitmask to check if they are available
ScopedLocker locker(&stationMutex);
for (int n = 0, i = startIndex; n < pumpsInStation; ++n, i = (i+1) % pumpsInStation)
{
// Check bitmask to see if pump is open
if ((freeMask & (1 << i)) == 0 )
{
// Change the starting index used to search for a free pump for
// the next car.
startIndex = (startIndex+1) % pumpsInStation;
// Turning the bit on
freeMask |= (1 << i);
return i;
}
}
return -1;
}
Another thing I have to ask is if it's really necessary (ex: for memory efficiency) to use bit flags to indicate whether a pump is used. If you can use an array of bool instead, you'll be able to avoid locking completely and simply use atomic operations to acquire and release pumps, and that'll avoid creating a traffic jam of locked threads.

Imagine that the mutex has a queue associated with it, containing the waiting threads. Now, one of your threads manages to get the mutex that protects the bitmask of occupied stations, checks if one specific place is free. If it isn't, it releases the mutex again and loops, only to go back to the end of the queue of threads waiting for the mutex. Firstly, this is unfair, because the first one to wait is not guaranteed to get the next free slot, only if that slot happens to be the one on its loop counter. Secondly, it causes an extreme amount of context switches, which is bad for performance. Note that your approach should still produce correct results in that no two cars collide while accessing a single filling station, but the behaviour is suboptimal.
What you should do instead is this:
lock the mutex to get exclusive access to the possible filling stations
locate the next free filling station
if none of the stations are free, wait for the condition variable and restart at point 2
mark the slot as occupied and release the mutex
fill up the car (this is where the sleep in the simulation actually makes sense, the other one doesn't)
lock the mutex
mark the slot as free and signal the condition variable to wake up others
release the mutex again
Just in case that part isn't clear to you, waiting on a condition variable implicitly releases the mutex while waiting and reacquires it afterwards!

c++ multithreading shared resources

I am trying to multithread a piece of code using the boost library. The problem is that each thread has to access and modify a couple of global variables. I am using mutex to lock the shared resources, but the program ends up taking more time then when it was not multithreaded. Any advice on how to optimize the shared access?
Thanks a lot!
In the example below, the *choose_ecount* variable has to be locked, and I cannot take it out of the loop and lock it for only an update at the end of the loop because it is needed with the newest values by the inside function.
for(int sidx = startStep; sidx <= endStep && sidx < d.sents[lang].size(); sidx ++){
sentence s = d.sents[lang][sidx];
int senlen = s.words.size();
int end_symb = s.words[senlen-1].pos;
inside(s, lbeta);
outside(s,lbeta, lalpha);
long double sen_prob = lbeta[senlen-1][F][NO][0][senlen-1];
if (lambda[0] == 0){
mtx_.lock();
d.sents[lang][sidx].prob = sen_prob;
mtx_.unlock();
}
for(int size = 1; size <= senlen; size++)
for(int i = 0; i <= senlen - size ; i++)
{
int j = i + size - 1;
for(int k = i; k < j; k++)
{
int hidx = i; int head = s.words[hidx].pos;
for(int r = k+1; r <=j; r++)
{
int aidx = r; int arg = s.words[aidx].pos;
mtx_.lock();
for(int kids = ONE; kids <= MAX; kids++)
{
long double num = lalpha[hidx][R][kids][i][j] * get_choose_prob(s, hidx, aidx) *
lbeta[hidx][R][kids - 1][i][k] * lbeta[aidx][F][NO][k+1][j];
long double gen_right_prob = (num / sen_prob);
choose_ecount[lang][head][arg] += gen_right_prob; //LOCK
order_ecount[lang][head][arg][RIGHT] += gen_right_prob; //LOCK
}
mtx_.unlock();
}
}

From the code you have posted I can see only writes to choose_ecount and order_ecount. So why not use local per thread buffers to compute the sum and then add them up after the outermost loop and only sync this operation?
Edit:
If you need to access the intermediate values of choose_ecount how do you assure the correct intermediate value is present? One thread might have finished 2 iterations of its loop in the meantime producing different results in another thread.
It kind of sounds like you need to use a barrier for your computation instead.

It's unlikely you're going to get acceptable performance using a mutex in an inner loop. Concurrent programming is difficult, not just for the programmer but also for the computer. A large portion of the performance of modern CPUs comes from being able to treat blocks of code as sequences independent of external data. Algorithms that are efficient for single-threaded execution are often unsuitable for multi-threaded execution.
You might want to have a look at boost::atomic, which can provide lock-free synchronization, but the memory barriers required for atomic operations are still not free, so you may still run into problems, and you will probably have to re-think your algorithm.

I guess that you divide your complete problem into chunks ranging from startStep to endStep to get processed by each thread.
Since you have that locked mutex there, you're effectively serializing all threads:
You divide your problem into some chunks which are processed in serial, yet unspecified order.
That is the only thing you get is the overhead for doing multithreading.
Since you're operating on doubles, using atomic operations is not a choice for you: they're typically implemented for integral types only.
The only possible solution is to follow Kratz' suggestion to have a copy of choose_ecount and order_ecount for each thread and reduce them to a single one after your threads have finished.

Implementation of Long Atomic Int

I would like to use an atomic counter (multi-thread computation) that counts to typically 2^40, so I cannot use a 32 bit int atomic counter directly. I do not have c++11 yet (I will migrate to it but not yet as this has a cost for me) and I have to compile on 32bit and 64bit platforms.
I use QT currently, so I can use QAtomicInt.
Here is what I'm thinking of:
(initialization...)
QAtomicInt counterLo = 0;
QAtomicInt counterHi = 0;
void increment()
{
int before = counterLo.fetchAndAddOrdered(1);
if(before==INT_MAX)
{
counterHi.fetchAndAddOrdered(1); //Increment high word
counterLo.fetchAndAddOrdered(INT_MAX); //Increments low word to -1
counterLo.fetchAndAddOrdered(1); //Increments low word to 0
}
}
uint64_t value()
{
//Wait until the low word is non-negative
int lo = counterLow;
while(lo<0)
lo = counterLow;
return (uint64_t)counterHi * ((uint64_t)INT_MAX+1) + (uint64_t)lo;
}
Is this correct? I already tried to make the counter with a mutex, but I'm loosing around 10% performance. This is called about 1 million times a second, shared between 8 threads (sample counter for Monte-Carlo simulation)
Thanks!

This is not overall atomic, see the following example:
hi=0,lo=INT_MAX
T1 calls value(), gets lo=INT_MAX, is interrupted
T2 calls increment() increments hi to 1
T1 resumes and reads counterHi, gets 1, returns a value of 2^32 + INT_MAX
This is likely not what you want. Can't you just split your sample space and let each thread calculate n/8 items without contending for a lock?

Of course this is not atomic. Sequence of atomic operations can be interrupted. I recommend to use protection (Mutex or critical section).

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js