I have a procedure that calculates the outcome of a game by random sample; it is passed a number of iterations, runs a loop of that size storing the outcomes in a local variable (subHits), then after the loop is done, adds the totals from the local variables into a class level member variable (m_Hits), to wit:
void Game::LogOutcomes(long periodSize) {
int subHits[11];
for (int i = 0; i < 11; ++i) {
subHits[i] = 0;
}
for (int iters = 0; iters < periodSize; ++iters) {
// ... snipped out code calculating rankIndex by random sample.
++subHits[rankIndex];
}
for (int i = 0; i < 11; ++i) {
m_Hits[i] += subHits[i];
}
}
.. of course, it uses a local variable as temporary storage for purposes of running the procedure in parallel, which I invoke with:
dispatch_queue_t globalQ = dispatch_get_global_queue(DISPATCH_QUEUE_PRIORITY_DEFAULT, 0);
dispatch_apply(m_BatchSize / m_PeriodSize, globalQ, ^(size_t periodCount) {
LogBonusWager(m_PeriodSize);
});
.. and it seems to work perfectly (all results are sufficiently close to statistically expected value). I can't help but think there's something wrong, because nowhere am I specifically 'locking' the class level variable when updating it with the contents of the local variable, and that I'm getting the right results through sheer good fortune.
Is there something I'm missing?
You're getting lucky. You should either have a dedicated (serial) queue for updating the shared state, or use OSAtomicAddSize to add to it. Without this you'll be losing updates occasionally.
Related
I'm trying to use steady clock to benchmark parts of my code and I'm pulling some hair out here. It seems that sometimes it returns the difference between 2 times, and sometimes it just returns 0.
I have the following code. This is not real code in my prog but illustrates the problem
typedef std::chrono::steady_clock::time_point clock_point;
for ( int i = 0; i < 2; i++ ) {
clock_point start_overall = std::chrono::steady_clock::now();
for ( int j = 0; j < 10000000; j++ ) {
int q = 4;
}
clock_point end_phase_1 = std::chrono::steady_clock::now();
std::cout << "DIFF=" << std::chrono::duration_cast<std::chrono::microseconds>( end_phase_1 - start_overall ).count() << "\n";
}
This gives me the following output from running the prog 4 times:
DIFF=15622
DIFF=0
DIFF=12968
DIFF=13001
DIFF=12966
DIFF=13997
DIFF=0
DIFF=0
Very frustrating!! i need some consistent times here. And Its not like the time needed to loop 10,000,000 times is completely irrelevant. In my actual program, there's much more going on in the loop and it takes significantly longer, but i still sometimes get 0 vals for time differences.
Whats going on? How can I fix this so i get reliable time differences? Thanks
EDIT: Ok because the explanation I'm getting is that the compiler is simplifying out the loop because nothing actually happens in it, I'm going to show you the actual code in the actual loop that runs, between the 2 clock points
// need to reset some variables with each situation
// these are global vars so can access throughout (ewww)
this_sit_phase_1_complete = dataVars.phase_1_complete;
this_sit_on_the_play = dataVars.on_the_play;
this_sit_start_drawing_cards = dataVars.start_drawing_cards;
this_sit_current_turn = dataVars.current_turn;
this_sit_max_turn = dataVars.max_turn;
// note: do i want a separate split count for each scenario?
// mmm yeah.. THIS IS WHAT I SHOULD DO INSTEAD OF GLOBAL VARS....
dataVars.scen_active_index = i;
// point to the split count we want to use
// dataVars.use_scen_split_count = &dataVars.scen_phase_1and2_split_counts[i];
dataVars.split_count[i] += 1;
// PHASE 1:
// if we're on the play, we execute first turn without drawing a card
// just a single split to start in a single que
// phase 1 won't be complete until we draw a card tho
// create the all_splits_phase_1 for each situation
all_splits_que all_splits_phase_1;
// SPLIT STRUCT
// create the first split in the scenario
split_struct first_split_struct;
// set vars to track splits
first_split_struct.split_id = dataVars.split_count[i];
// first_split_struct.split_trail = std::to_string(dataVars.split_count[i]);
// set remaining vars
first_split_struct.cards_in_hand_numbs = dataVars.scen_hand_card_numbs[i];
first_split_struct.cards_in_deck_numbs = dataVars.scen_initial_decks[i];
first_split_struct.cards_bf_numbs = dataVars.scen_bf_card_numbs[i];
first_split_struct.played_a_land = false;
// store the split struct as the initial split
all_splits_phase_1 = { first_split_struct };
// if we're on the play, execute first turn without
// drawing any cards
if ( this_sit_on_the_play ) {
// execute the turn on the play before drawing anything
execute_turn(all_splits_phase_1);
// move to next turn
this_sit_current_turn += 1;
}
// ok so now, regardless of if we were on the play or not, we have to draw
// a card for every remaining card in each split, and then execute a turn
// once these splits are done, we can convert over to phase 2
do_draw_every_card( all_splits_phase_1 );
// execute another turn after drawing one of everything,
// we wont actually draw anything within the turn
execute_turn( all_splits_phase_1 );
// next turn
this_sit_current_turn += 1;
clock_point end_phase_1 = std::chrono::steady_clock::now();
benchmarker[dataVars.scen_active_index].phase_1_t = std::chrono::duration_cast<std::chrono::microseconds>( end_phase_1 - start_overall ).count();
There is LOTS happening here, lots and lots, the compiler would never simplify out this block. And yet I'm getting 0's as i explained.
Out of OPs code:
for ( int j = 0; j < 10000000; j++ ) {
int q = 4;
}
This is a repeated assignment to a local variable which isn't used anywhere.
I strongly assume that the compiler is clever enough to recognize that there is no side-effect caused by the loop. Hence it doesn't emit any code for the loop – for proper (and legal) optimization.
To check this, I completed OPs code snippet to the following MCVE:
#include <chrono>
#include <iostream>
typedef std::chrono::steady_clock::time_point clock_point;
int main()
{
for ( int i = 0; i < 2; i++ ) {
clock_point start_overall = std::chrono::steady_clock::now();
for ( int j = 0; j < 10000000; j++ ) {
int q = 4;
}
clock_point end_phase_1 = std::chrono::steady_clock::now();
std::cout << "DIFF=" << std::chrono::duration_cast<std::chrono::microseconds>( end_phase_1 - start_overall ).count() << "\n";
}
}
and compiled with -O2 -Wall -std=c++17 on CompilerExplorer:
Live Demo on CompilerExplorer
Please, note that the lines for the loop are not colored.
The reason is (as I assumed): there is no code emitted for the for-loop.
So, OP measures two consecutive calls of std::chrono::steady_clock::now(); which may (or may not) appear in a sub-clock-tick time. Thus, it looks if no time has been passed between these calls.
To prevent such optimizations, the code has to contain something which causes side-effects that the compiler cannot foresee during compile time. Input/output operations are an option. So, the loop could contain an assignment from a variable determined by input and assign results to a container determined for output.
Marking variables as volatile could be an option as well because it forces the compiler to assign the variable in any case even if it cannot "see" side-effects.
I ran your code. In debug I see the correct difference. In release it's 0-es. Optimized assignment is my assumption. Try in debug, or flag int q to volatile int q
Suppose I have some tasks (Monte Carlo simulations) that I want to run in parallel. I want to complete a given number of tasks, but tasks take different amount of time so not easy to divide the work evenly over the threads. Also: I need the results of all simulations in a single vector (or array) in the end.
So I come up with below approach:
int Max{1000000};
//SimResult is some struct with well-defined default value.
std::vector<SimResult> vec(/*length*/Max);//Initialize with default values of SimResult
int LastAdded{0};
void fill(int RandSeed)
{
Simulator sim{RandSeed};
while(LastAdded < Max)
{
// Do some work to bring foo to the desired state
//The duration of this work is subject to randomness
vec[LastAdded++]
= sim.GetResult();//Produces SimResult.
}
}
main()
{
//launch a bunch of std::async that start
auto fut1 = std::async(fill,1);
auto fut2 = std::async(fill,2);
//maybe some more tasks.
fut1.get();
fut2.get();
//do something with the results in vec.
}
The above code will give race conditions I guess. I am looking for a performant approach to avoid that. Requirements: avoid race conditions (fill the entire array, no skips) ; final result is immediately in array ; performant.
Reading on various approaches, it seems atomic is a good candidate, but I am not sure what settings will be most performant in my case? And not even sure whether atomic will cut it; maybe a mutex guarding LastAdded is needed?
One thing I would say is that you need to be very careful with the standard library random number functions. If your 'Simulator' class creates an instance of a generator, you should not run Monte Carlo simulations in parallel using the same object, because you'll get likely get repeated patterns of random numbers between the runs, which will give you inaccurate results.
The best practice in this area would be to create N Simulator objects with the same properties, and give each one a different random seed. Then you could pool these objects out over multiple threads using OpenMP, which is a common parallel programming model for scientific software development.
std::vector<SimResult> generateResults(size_t N_runs, double seed)
{
std::vector<SimResult> results(N_runs);
#pragma omp parallel for
for(auto i = 0; i < N_runs; i++)
{
auto sim = Simulator(seed + i);
results[i] = sim.GetResult();
}
}
Edit: With OpenMP, you can choose different scheduling models, which allow you to for e.g. dynamically split work between threads. You can do this with:
#pragma omp parallel for schedule(dynamic, 16)
which would give each thread chunks of 16 items to work on at a time.
Since you already know how many elements your are going to work with and never change the size of the vector, the easiest solution is to let each thread work on it's own part of the vector. For example
Update
to accomodate for vastly varying calculation times, you should keep your current code, but avoid race conditions via a std::lock_guard. You will need a std::mutex that is the same for all threads, for example a global variable, or pass a reference of the mutex to each thread.
void fill(int RandSeed, std::mutex &nextItemMutex)
{
Simulator sim{RandSeed};
size_t workingIndex;
while(true)
{
{
// enter critical area
std::lock_guard<std::mutex> nextItemLock(nextItemMutex);
// Acquire next item
if(LastAdded < Max)
{
workingIndex = LastAdded;
LastAdded++;
}
else
{
break;
}
// lock is released when nextItemLock goes out of scope
}
// Do some work to bring foo to the desired state
// The duration of this work is subject to randomness
vec[workingIndex] = sim.GetResult();//Produces SimResult.
}
}
Problem with this is, that snychronisation is quite expensive. But it's probably not that expensive in comparison to the simulation you run, so it shouldn't be too bad.
Version 2:
To reduce the amount of synchronisation that is required, you could acquire blocks to work on, instead of single items:
void fill(int RandSeed, std::mutex &nextItemMutex, size_t blockSize)
{
Simulator sim{RandSeed};
size_t workingIndex;
while(true)
{
{
std::lock_guard<std::mutex> nextItemLock(nextItemMutex);
if(LastAdded < Max)
{
workingIndex = LastAdded;
LastAdded += blockSize;
}
else
{
break;
}
}
for(size_t i = workingIndex; i < workingIndex + blockSize && i < MAX; i++)
vec[i] = sim.GetResult();//Produces SimResult.
}
}
Simple Version
void fill(int RandSeed, size_t partitionStart, size_t partitionEnd)
{
Simulator sim{RandSeed};
for(size_t i = partitionStart; i < partitionEnd; i++)
{
// Do some work to bring foo to the desired state
// The duration of this work is subject to randomness
vec[i] = sim.GetResult();//Produces SimResult.
}
}
main()
{
//launch a bunch of std::async that start
auto fut1 = std::async(fill,1, 0, Max / 2);
auto fut2 = std::async(fill,2, Max / 2, Max);
// ...
}
The QtConcurrent namespace is really great for simplifying the management of multi-threaded calculations. Overall this works great and I have been able to use QtConcurrent run(), map(), and other variants in the way they are described in the API.
Overall Goal:
I would like to query, cancel(), or pause() a numerically intensive calculation from QML. So far this is working the way I would like, except that I cannot access the sequence numbers in the calculation. Here is a link that describes a similar QML setup.
Below is an image from small test app that I created to encapsulate what I am trying to do:
In the example above the calculation has nearly completed and all the cores have been enqueued with work properly, as can be seen from a system query:
But what I really would like to do is use the sequence numbers from a given list of the items IN THE multi-threaded calculation itself. E.g., one approach might be to simply setup the sequence numbers directly in a QList or QVector (other C++ STL containers can work as well), like this:
void TaskDialog::mapTask()
{
// Number of times the map function will be called:
int N = 5;
// Prepare the vector that we operate on with mapFunction:
QList<int> vectorOfInts;
for (int i = 0; i < N; i++) {
vectorOfInts << i;
}
// Start the calc:
QFuture<void> future = QtConcurrent::map(vectorOfInts, mapFunction);
_futureWatcher.setFuture(future);
//_futureWatcher.waitForFinished();
}
The calculation is non-blocking with the line: _futureWatcher.waitForFinished(); commented out, as shown in the code above. Note that when setup as a non-blocking calculation, the GUI thread is responsive, and the progress bar updates as desired.
But when the values in the QList container are queried during the calculation, what appears seem to be the uninitialized garbage values that one would expect when the array is not properly initialized.
Below is the example function I am calling:
void mapFunction(int& n)
{
// Check the n values:
qDebug() << "n = " << n;
/* Below is an arbitrary task but note that we left out n,
* although normally we would want to use it): */
const long work = 10000 * 10000 * 10;
long s = 0;
for (long j = 0; j < work; j++)
s++;
}
And the output of qDebug() is:
n = 30458288
n = 204778
n = 270195923
n = 0
n = 270385260
The n-values are useless but the sum values, s, are correct (although not shown) when the calculation is mapped in this fashion (non-blocking).
Now, if I uncomment the _futureWatcher.waitForFinished(); line then I get the expected values (the order is irrelevant):
n = 0
n = 2
n = 4
n = 3
n = 1
But in this case, with _futureWatcher.waitForFinished(); enabled, my GUI thread is blocked and the progress bar does not update.
What then would be the advantage of using QtConcurrent::map() with blocking enabled, if the goal to not block the main GUI thread?
Secondly, how can get the correct values of n in the non-blocking case, allowing the GUI to remain responsive and have the progress bar keep updating?
My only option may be to use QThread directly but I wanted to take advantage of all the nice tools setup for us in QtConcurrent.
Thoughts? Suggestions? Other options? Thanks.
EDIT: Thanks to user2025983 for the insight which helped me to solve this. The bottom line is that I first needed to dynamically allocate the QList:
QList<int>* vectorOfInts = new QList<int>;
for (int i = 0; i < N; i++)
vectorOfInts->push_back(i);
Next, the vectorOfInts is passed by reference to the map function by de-referencing the pointer, like this:
QFuture<void> future = QtConcurrent::map(*vectorOfInts, mapFunction);
Note also that the prototype of the mapFunction remains the same:
void mapFunction(int& n)
And then it all works properly: the GUI remained responsive, progress bar updated, the values of n are all correct, etc., WITHOUT the need to add blocking through the function:
_futureWatcher.waitForFinished();
Hope these extra details can help someone else.
The problem here is that your QList goes out of the scope when mapTask() finishes.
Since the mapFunction(int &n) takes the parameter by reference, it gets references to integer values which are now part of an array which is out of scope! So then the computer is free to do whatever it likes with that memory, which is why you see garbage values. If you are just using integer parameters, I would recommend passing the parameters by value and then everything should work.
Alternatively, if you must pass by reference you can have the futureWatcher delete the array when its finished.
QList<int>* vectorOfInts = new QList<int>;
// push back into structure
connect(_futureWatcher, SIGNAL(finished()), vectorOfInts, SLOT(deleteLater()));
// launch stuff
QtConcurrent::map...
// profit
I am trying to multithread a piece of code using the boost library. The problem is that each thread has to access and modify a couple of global variables. I am using mutex to lock the shared resources, but the program ends up taking more time then when it was not multithreaded. Any advice on how to optimize the shared access?
Thanks a lot!
In the example below, the *choose_ecount* variable has to be locked, and I cannot take it out of the loop and lock it for only an update at the end of the loop because it is needed with the newest values by the inside function.
for(int sidx = startStep; sidx <= endStep && sidx < d.sents[lang].size(); sidx ++){
sentence s = d.sents[lang][sidx];
int senlen = s.words.size();
int end_symb = s.words[senlen-1].pos;
inside(s, lbeta);
outside(s,lbeta, lalpha);
long double sen_prob = lbeta[senlen-1][F][NO][0][senlen-1];
if (lambda[0] == 0){
mtx_.lock();
d.sents[lang][sidx].prob = sen_prob;
mtx_.unlock();
}
for(int size = 1; size <= senlen; size++)
for(int i = 0; i <= senlen - size ; i++)
{
int j = i + size - 1;
for(int k = i; k < j; k++)
{
int hidx = i; int head = s.words[hidx].pos;
for(int r = k+1; r <=j; r++)
{
int aidx = r; int arg = s.words[aidx].pos;
mtx_.lock();
for(int kids = ONE; kids <= MAX; kids++)
{
long double num = lalpha[hidx][R][kids][i][j] * get_choose_prob(s, hidx, aidx) *
lbeta[hidx][R][kids - 1][i][k] * lbeta[aidx][F][NO][k+1][j];
long double gen_right_prob = (num / sen_prob);
choose_ecount[lang][head][arg] += gen_right_prob; //LOCK
order_ecount[lang][head][arg][RIGHT] += gen_right_prob; //LOCK
}
mtx_.unlock();
}
}
From the code you have posted I can see only writes to choose_ecount and order_ecount. So why not use local per thread buffers to compute the sum and then add them up after the outermost loop and only sync this operation?
Edit:
If you need to access the intermediate values of choose_ecount how do you assure the correct intermediate value is present? One thread might have finished 2 iterations of its loop in the meantime producing different results in another thread.
It kind of sounds like you need to use a barrier for your computation instead.
It's unlikely you're going to get acceptable performance using a mutex in an inner loop. Concurrent programming is difficult, not just for the programmer but also for the computer. A large portion of the performance of modern CPUs comes from being able to treat blocks of code as sequences independent of external data. Algorithms that are efficient for single-threaded execution are often unsuitable for multi-threaded execution.
You might want to have a look at boost::atomic, which can provide lock-free synchronization, but the memory barriers required for atomic operations are still not free, so you may still run into problems, and you will probably have to re-think your algorithm.
I guess that you divide your complete problem into chunks ranging from startStep to endStep to get processed by each thread.
Since you have that locked mutex there, you're effectively serializing all threads:
You divide your problem into some chunks which are processed in serial, yet unspecified order.
That is the only thing you get is the overhead for doing multithreading.
Since you're operating on doubles, using atomic operations is not a choice for you: they're typically implemented for integral types only.
The only possible solution is to follow Kratz' suggestion to have a copy of choose_ecount and order_ecount for each thread and reduce them to a single one after your threads have finished.
I've been learning C++ from the internet for the past 2 years and finally the need has arisen for me to delve into MPI. I've been scouring stackoverflow and the rest of the internet (including http://people.sc.fsu.edu/~jburkardt/cpp_src/mpi/mpi.html and https://computing.llnl.gov/tutorials/mpi/#LLNL). I think I've got some of the logic down, but I'm having a hard time wrapping my head around the following:
#include (stuff)
using namespace std;
vector<double> function(vector<double> &foo, const vector<double> &bar, int dim, int rows);
int main(int argc, char** argv)
{
vector<double> result;//represents a regular 1D vector
int id_proc, tot_proc, root_proc = 0;
int dim;//set to number of "columns" in A and B below
int rows;//set to number of "rows" of A and B below
vector<double> A(dim*rows), B(dim*rows);//represent matrices as 1D vectors
MPI::Init(argc,argv);
id_proc = MPI::COMM_WORLD.Get_rank();
tot_proc = MPI::COMM_WORLD.Get_size();
/*
initialize A and B here on root_proc with RNG and Bcast to everyone else
*/
//allow all processors to call function() so they can each work on a portion of A
result = function(A,B,dim,rows);
//all processors do stuff with A
//root_proc does stuff with result (doesn't matter if other processors have updated result)
MPI::Finalize();
return 0;
}
vector<double> function(vector<double> &foo, const vector<double> &bar, int dim, int rows)
{
/*
purpose of function() is two-fold:
1. update foo because all processors need the updated "matrix"
2. get the average of the "rows" of foo and return that to main (only root processor needs this)
*/
vector<double> output(dim,0);
//add matrices the way I would normally do it in serial
for (int i = 0; i < rows; i++)
{
for (int j = 0; j < dim; j++)
{
foo[i*dim + j] += bar[i*dim + j];//perform "matrix" addition (+= ON PURPOSE)
}
}
//obtain average of rows in foo in serial
for (int i = 0; i < rows; i++)
{
for (int j = 0; j < dim; j++)
{
output[j] += foo[i*dim + j];//sum rows of A
}
}
for (int j = 0; j < dim; j++)
{
output[j] /= rows;//divide to obtain average
}
return output;
}
The code above is to illustrate the concept only. My main concern is to parallelize the matrix addition but what boggles my mind is this:
1) If each processor only works on a portion of that loop (naturally I'd have to modify the loop parameters per processor) what command do I use to merge all portions of A back into a single, updated A that all processors have in their memory. My guess is that I have to do some kind of Alltoall where each processor sends its portion of A to all other processors, but how do I guarantee that (for example) row 3 worked on by processor 3 overwrites row 3 of the other processors, and not row 1 by accident.
2) If I use an Alltoall inside function(), do all processors have to be allowed to step into function(), or can I isolate function() using...
if (id_proc == root_proc)
{
result = function(A,B,dim,rows);
}
… and then inside function() handle all the parallelization. As silly as it sounds, I'm trying to do a lot of the work on one processor (with broadcasts), and just parallelize the big time-consuming for loops. Just trying to keep the code conceptually simple so I can get my results and move on.
3) For the averaging part, I'm sure I can just use a reducing command if I wanted to parallelize it, correct?
Also, as an aside: is there a way to call Bcast() such that it is blocking? I'd like to use it to synchronize all my processors (boost libraries are not an option). If not then I'll just go with Barrier(). Thank you for your answer to this question, and to the community of stackoverflow for learning me how to program over the past two years! :)
1) The function you are looking is MPI_Allgather. MPI_Allgather will let you send a row from each processor and receive the result on all processors.
2) Yes you can use some of the processors in your function. Since MPI functions work with communicators you have to create a separate communicator for this purpose. I don't know how this is implemented in the C++ bindings but C bindings use the MPI_Comm_create function.
3) Yes see MPI_Allreduce.
aside: Bcast blocks a process until send/receive operation assigned to that process is finished. If you want to wait for all processors to finish their work (I don't have any idea why you would want to do this) you should use Barrier().
extra note: I wouldn't recommend using the C++ bindings as they are depreciated and you won't find specific examples on how to use them. Boost MPI is the library to use if you want C++ bindings however it does not cover all of MPI functions.