I am using 3 threads to chunk a for loop, and the 'data' is a global array so I want to lock that part in 'calculateAll' function,
std::vector< int > calculateAll(int ***data,std::vector<LineIndex> indexList)
{
std::vector<int> v_a=std::vector<int>();
for(int a=0;a<indexList.size();a++)
{
mylock.lock();
v_b.push_back(/*something related with data*/);
mylock.unlock();
v_a.push_back(a);
}
return v_a;
}
for(int i=0;i<3;i++)
{
int s =firstone+i*chunk;
int e = ((s+chunk)<indexList.size())? (s+chunk) : indexList.size();
t[i]=std::thread(calculateAll,data,indexList,s,e);
}
for (int i = 0; i < 3; ++i)
{
t[i].join();
}
my question is, how can I get the return value which is a vector from each thread and then combine them together? The reason I want to do that is because if I declare ’v_a‘ as a global vector, when each thread trys to push_back their value in this vector 'v_a' it will be some crash(or not?).So I am thinking to declare a vector for each of the thread, then combine them into a new vector for further use( like I do it without thread ).
Or is there some better methods to deal with that concurrency problem? The order of 'v_a' does not matter.
I appreciate for any suggestion.
First of all, rather than the explicit lock() and unlock() seen in your code, always use std::lock_guard where possible. Secondly, it would be better you use std::futures to this thing. Launch each thread with std::async then get the results in another loop while aggregating the results at the same time. Like this:
using Vector = std::vector<int>;
using Future = std::future<Vector>;
std::vector<Future> futures;
for(int i=0;i<3;i++)
{
int s =firstone+i*chunk;
int e = ((s+chunk)<indexList.size())? (s+chunk) : indexList.size();
auto fut = std::async(std::launch::async, calculateAll, data, indexList, s, e);
futures.push_back( std::move(fut) );
}
//Combine the results
std::vector<int> result;
for(auto& fut : futures){ //Iterate through in the order the future was created
auto vec = fut.get(); //Get the result of the future
result.insert(result.end(), vec.begin(), vec.end()); //append it to results in order
}
Here is a minimal, complete and working example based on your code - which demonstrates what I mean: Live On Coliru
Create a structure with a vector and a lock. Pass an instance of that structure, with the lock pre-locked, to each thread as it starts. Wait for all locks to become unlocked. Each thread then does its work and unlocks its lock when done.
Related
How it is possible, that is such a line just after if statement with unequal, variables are already equal in pull() method? I have already added Mutex variable, but it not helped.
int fQ::pull(void){ // pull element from the queue
while(MutexF);
MutexF = 1;
if (last != first){
fQueue[first++]();
first%=lengthQ;
MutexF = 0;
return 0;
}
else{
MutexF = 0;
return 1;
}
}
STL containers are to heavy for me, I preparing it for a tiny MCU, that's why, I tried to avoid all this complex staff like (std::mutex, std::atomic, std::mutex etc.). Those multitheading is needed only for test purpose, instead of testing with the tiny MCU's interrupts, for a while. I supposed not use any stl/thread libraries at all
photo of the error
https://github.com/WeSpeakEnglish/nortos/blob/master/C_plus_plus_implementation/main.cpp
https://github.com/WeSpeakEnglish/nortos/blob/master/C_plus_plus_implementation/nortos.h
first, you'd better use std::atomic and/or std::mutex for synchronization purposes. At least use std::flag. volatile has issues in general - it isn't suited for atomic operations - it has a different purpose altogether.
Second in your code, there is a bug and I don't know to solve it properly with volatile.
while(MutexF);
MutexF = 1;
Imagine, someone set MutexF to 0, then two threads simultaneously exited the while loop before setting MutexF=1. What do you think is gonna happen?
Perhaps you can synchronize two thread - one for pull and one for push in this manner - but you'd better abandon such approach.
#include <mutex> // std::mutex
typedef void(*FunctionPointer)(void);
class fQ {
private:
std::atomic<int> first;
std::atomic<int> last;
FunctionPointer * fQueue;
int lengthQ;
std::mutex mtx;
public:
fQ(int sizeQ);
~fQ();
int push(FunctionPointer);
int pull(void);
};
fQ::fQ(int sizeQ){ // initialization of Queue
fQueue = new FunctionPointer[sizeQ];
last = 0;
first = 0;
lengthQ = sizeQ;
}
fQ::~fQ(){ // initialization of Queue
delete [] fQueue;
}
int fQ::push(FunctionPointer pointerF){ // push element from the queue
mtx.lock();
if ((last+1)%lengthQ == first){
mtx.unlock();
return 1;
}
fQueue[last++] = pointerF;
last = last%lengthQ;
mtx.unlock();
return 0;
}
int fQ::pull(void){ // pull element from the queue
mtx.lock();
if (last != first){
fQueue[first++]();
first = first%lengthQ;
mtx.unlock();
return 0;
}
else{
mtx.unlock();
return 1;
}
}
I want to split a vector into small vectors, process each of them separately on a thread, then merge them. I want to use std::async for creating threads and my code looks something like this
void func(std::vector<int>& vec)
{
//do some stuff
}
// Calling part
std::vector<std::future<void>> futures;
std::vector<std::vector<int>> temps;
for (int i = 1; i <= threadCount; ++i)
{
auto& curBegin = m_vec.begin() + (i - 1) * size / threadCount;
auto& curEnd = m_vec.begin() + i * size / threadCount;
std::vector<int> tmp(curBegin, curEnd);
temps.push_back(std::move(tmp));
futures.push_back(std::async(std::launch::async, &func, std::ref(temps.back())));
}
for (auto& f : futures)
{
f.wait();
}
std::vector<int> finalVector;
for (int i = 0; i < temps.size() - 1; ++i)
{
std::merge(temps[i].begin(), temps[i].end(), temps[i + 1].begin(), temps[i + 1].end(), std::back_inserter(finalVector));
}
Here m_vec is the main vector, which is being split into small vectors.
The problem is that when I pass a vector to func(), in function it becomes invalid, either size is 0 or with invalid elements. But when I try to call the function without std::async everything works fine.
So what's the problem with std::async and is there anything special that I should do?
Thank you for your time!
If a reallocation happens while you're iteratively expanding the temps vector, then it's very likely std::ref(temps.back()) that the thread operates on is referencing an already invalidated memory area. You can avoid relocations by reserving the memory prior to consecutive push_backs:
temps.reserve(threadCount);
Piotr S. has already given a correct answer with a solution, I will just add some explanation.
So what's the problem with std::async and is there anything special that I should do?
The problem is not with async.
You would get exactly the same effect if you did:
std::vector<std::function<void()>> futures;
// ...
for (int i = 1; i <= threadCount; ++i)
{
// ...
futures.push_back(std::bind(&func, std::ref(temps.back())));
}
for (auto f : futures)
f();
In this version I don't use async, I create several function objects then run them all one by one. You will see the same problem with this code, which is that the function objects (or in your case, the tasks being run by async) hold references to vector elements that get destroyed when you insert into temps and cause it to reallocate.
To solve the problem you need to ensure the elements in temps are stable, i.e. do not get destroyed and recreated at a different location, as Piotr S shows in his answer.
I have a vector<Mat> my_vect that each Mat is float and their size is 90*90. I start loading matrices from the disk that I load 16000 matrices to that vector. After I finish working with those matrices, I clear them. Here is my code for loading and clearing the vector:
Mat mat1(90,90,CV_32F);
load_vector_of_matrices("filename",my_vect); //this loads 16K elements
//do something
for(i = 1:16K)
correlate(mat1, my_vect.at(i));
my_vect.clear();
I'm loading 16K element together for the sake of efficiency.
Now my question is reading all these matrices takes 3-4 second and my_vect.clear() takes approximately same amount of time which is a lot.
According to this answer, it should take O(n) time which I assume vector<Mat> doesn't have a trivial destructor.
Why clearing takes so much time, is matrix destructor overwrite each index in the matrix? Is there a way to decrease the time for clearing the vector?
EDIT:
I'm using Visual Studio 2010 and the level of optimization is Maximize Speed(/O2).
First, a streaming loader. Provide it with a function that, given a max, returns a vector of data (aka loader<T>). It can store internal state, but it will be copied, so store that internal state in a std::shared_ptr. I guarantee that only one copy of it will be invoked.
You are not responsible for returning all max data from your loader, but as written you must return at least 1 element. Returning more is gravy, and may reduce threading overhead.
You then call streaming_loader<T>( your_loader, count ).
It returns a std::shared_ptr< std::vector< std::future< T > > >. You can wait on these futures, but you must wait on them in order (the 2nd one is not guaranteed to be ready to be waited on until the first one has provided data).
template<class T>
using loader = std::function< std::vector<T>(size_t max) >;
template<class T>
using stream_data = std::shared_ptr< std::vector< std::future<T> > >;
namespace details {
template<class T>
T streaming_load_some( loader<T> l, size_t start, stream_data<T> data ) {
auto loaded = l(data->size()-start);
// populate the stuff after start first, so they are ready:
for( size_t i = 1; i < loaded.size(); ++i ) {
std::promise<T> promise;
promise.set_value( std::move(loaded[i]) );
(*data)[start+i] = promise.get_future();
}
if (start+loaded.size() < data->size()) {
// recurse:
std::size_t new_start = start+loaded.size();
(*data)[new_start] = std::async(
std::launch::async,
[l, new_start, data]{return streaming_load_some<T>( l, new_start, data );}
);
}
// populate the future:
return std::move(loaded.front());
}
}
template<class T>
stream_data< T >
streaming_loader( loader<T> l, size_t n ) {
auto retval = std::make_shared<std::vector< std::future<T> >>(n);
if (retval->empty()) return retval;
retval->front() = std::async(
std::launch::async,
[retval, l]()->T{return details::streaming_load_some<T>( l, 0, retval );
});
return retval;
}
For use, you take the stream_data<T> (aka a shared pointer to vector of future data), iterate over it, and .get() each in turn. Then do your processing. If you need a block of 50 of them, call .get() on each in turn until you get to 50 -- do not skip to number 50.
Here is a completely toy loader and test harness:
struct loader_state {
int index = 0;
};
struct test_loader {
std::shared_ptr<loader_state> state; // current loading state stored here
std::vector<int> operator()( std::size_t max ) const {
std::size_t amt = max/2+1;// really, really stupid way to decide how much to load
std::vector<int> retval;
retval.reserve(amt);
for (size_t i = 0; i < amt; ++i) {
retval.push_back( -(int)(state->index + i) ); // populate the return value
}
state->index += amt;
return retval;
}
// in real code, make this constructor do something:
test_loader():state(std::make_shared<loader_state>()) {}
};
int main() {
auto data = streaming_loader<int>( test_loader{}, 1024 );
std::size_t count = 0;
for( std::future<int>& x : *data ) {
++count;
int value = x.get(); // get data
// process. In this case, print out 100 in blocks of 10:
if (count * 100 / data->size() > (count-1) * 100 / data->size())
std::cout << value << ", ";
if (count * 10 / data->size() > (count-1) * 10 / data->size())
std::cout << "\n";
}
std::cout << std::endl;
// your code goes here
return 0;
}
count may or may not be worthless. The internal state of the loader above is pretty darn worthless, I just use it to demonstrate how to store some state.
You can do something similar to destroy a pile of objects without waiting for their destructors to complete. Or, you can rely on the fact that destroying your data can happen while you are working on it and waiting for the next data to load.
live example
In an industrial strength solution, you'd need to include ways to abort all this stuff, among other things. Exceptions might be one way. Also, feedback to the loader about how far behind the processing code is can be helpful (if it is at your heels, return smaller chunks -- if it is way behind, return larger chunks). In theory, that can be arranged via a back channel in loader<T>.
Now that I have played with the above for a bit, a probably better fit is:
#include <iostream>
#include <future>
#include <functional>
#include <vector>
#include <memory>
// if it returns empty, there is nothing more to load:
template<class T>
using loader = std::function< std::vector<T>() >;
template<class T>
struct next_data;
template<class T>
struct streamer {
std::vector<T> data;
std::unique_ptr<next_data<T>> next;
};
template<class T>
struct next_data:std::future<streamer<T>> {
using parent = std::future<streamer<T>>;
using parent::parent;
next_data( parent&& o ):parent(std::move(o)){}
};
live example. It requires some infrastructure to populate that very first streamer<T>, but the code will be simpler, and the strange requirement (of knowing how much data, and only doing a .get() from the first element) goes away.
template<class T>
streamer<T> stream_step( loader<T> l ) {
streamer<T> retval;
retval.data = l();
if (retval.data.empty())
return retval;
retval.next.reset( new next_data<T>(std::async( std::launch::async, [l](){ return stream_step(l); })));
return retval;
}
template<class T>
streamer<T> start_stream( loader<T> l ) {
streamer<T> retval;
retval.next.reset( new next_data<T>(std::async( std::launch::async, [l](){ return stream_step(l); })));
return retval;
}
A downside is that writing a ranged-based iterator becomes a bit trickier.
Here is a sample use of the second implementation:
struct counter {
std::size_t max;
std::size_t current = 0;
counter( std::size_t m ):max(m) {}
std::vector<int> operator()() {
std::vector<int> retval;
std::size_t do_at_most = 100;
while( current < max && (do_at_most-->0)) {
retval.push_back( int(current) );
++current;
}
return retval;
}
};
int main() {
streamer<int> s = start_stream<int>( counter(1024) );
while(true) {
for (int x : s.data) {
std::cout << x << ",";
}
std::cout << std::endl;
if (!s.next)
break;
s = std::move(s.next->get());
}
// your code goes here
return 0;
}
where counter is a trivial loader (an object that reads data into a std::vector<T> in whatever sized chunks it feels like). The processing of the data is in the main code, where we just print them out in get-sized chunks.
The loading happens in a different thread, and will continue asynchronously whatever the main thread does. The main thread just gets delivered std::vector<T> to do with as they will. In your case, you'd make T a Mat.
The Mat objects are complex objects that have internal allocations of memory. When you clear the vector, one will need to iterate through every instance of Mat contained and run its destructor which is itself a non-trivial operation.
Also remember that free-store memory perations are non-trivial, so depending on your heap implementation, the heap may decide to merge cells etc.
If this is a problem, you should run your clear through a profiler and find out where the bottle-neck is.
Be careful using optimization, it can drive the debugger crazy.
If you were to do this in a function and simply let the vector go out of scope??
Since the elements are not pointers I think this will work.
I am using std::deque at function to access elements without popping from the queue since i am using the same queue in different iterations. My solution is based on coarse-grained multithreading. Now i wanted to make it fine-grained multithreading solution. For that, i am using tbb::concurrent_queue. But i need the equivalent function of std::deque at operation in tbb::concurrent_queue?
EDIT
This is how i am implementing with std::deque (coarse-grained multithreading)
Keep in mind that dq is static queue(i.e. using many times in different iterations)
vertext_found = true;
std::deque<T> dq;
while ( i < dq->size())
{
EnterCriticalSection(&h);
if( i < dq.size() )
{
v = dq.at(i); // accessing element of queue without popping
i++;
vertext_found = true;
}
LeaveCriticalSection(&h);
if (vertext_found && (i < dq.size()) && v != NULL)
{
**operation on 'v'
vertext_found = false;
}
}
I want to achieve the same result with tbb::concurrent_queue?
If your algorithm has separate passes that fill the queue or consume the queue, consider using tbb::concurrent_vector. It has a push_back method that could be used for the fill pass, and an at() method for the consumption passes. If threads contend to pop elements in a consumption pass, consider using a tbb::atomic counter to generate indices for at().
If there is no such clean separation of filling and consuming, using at() would probably create more problems than it solves, even if it existed, because it would be racing against a consumer.
If a consumption pass just needs to loop over the concurrent_vector in parallel, consider using tbb::parallel_for for the loop. tbb::concurrent_vector has a range() method that supports this idiom.
void consume( tbb::concurrent_vector<T>& vec ) {
tbb::parallel_for( vec.range(), [&]( const tbb::concurrent_vector<T>::range_type& r ) {
for( auto i=r.begin(); i!=r.end(); ++i ) {
T value = *i;
...process value...;
}
});
}
If a consumption pass cannnot use tbb:parallel_for, consider using a TBB atomic counter to generate the indices. Initialize the counter to zero and use ++ to increment it. Here is an example:
tbb::atomic<size_t> head;
tbb::concurrent_vector<T> vec;
bool pop_one( T& result ) { // Try to grab next item from vec
size_t i = head++; // Fetch-and-increment must be single atomic operation
if( i<vec.size() ) {
result = vec[i];
return true;
} else {
return false; // Failed
}
}
In general, this solution will be less scalable than using tbb::parallel_for, because the counter "head" introduces a point of contention in the memory system.
According to the Doxygen docs in the TBB site (TBB Doxy docs) there's no operation at in the queue. You can push and try_pop elements with a tbb::strict_ppl::concurrent_queue.
If you're using a tbb::deprecated::concurrent_queue (older versions of TBB), there are available the push_if_not_full and pop_if_present operations.
In both queues, "multiple threads may each push and pop concurrently" as stated down in the brief section.
How to return a class object from threads or how to persist its state?
struct DataStructure
{
MapSmoother *m1;
std::vector<Vertex*> v1;
std::vector<Vertex *>::iterator vit;
DataStructure() {
m1 = NULL;
v1;
vit;
}
};
DWORD WINAPI thread_fun(void* p)
{
DataStructure *input = (DataStructure*)p;
for( ; (input->vit) != (input->v1).end(); ){
Vertex *v = *input->vit++;
(*(input->m1)).relax(v);
}
return 0;
}
main()
{
//Reading srcMesh
//All the vertices in srcMesh will be encoded with color
MapSmoother msmoother(srcMesh,dstMesh); //initial dstMesh will be created with no edge weights
DataStructure* input = new DataStructure; //struct datatype which holds msmoother object and vector "verList". I am passing this one to thread as a function argument
for(int color = 1; color <= 7 ; color++)
{
srcMesh.reportVertex(color,verList); //all the vertices in srcMesh with the same color index will be stored in verList datastructure(vector)
std::vector<Vertex *>::iterator vit = verList.begin();
input->vit = vit;
for(int i = 0; i < 100; i++)
HANDLE hThread[i] = createThread(0,0,&thread_fun,&input,0,NULL);
WaitForMultipleObjects(100,hThread,TRUE,INFINITE);
for(int i = 0; i < 100; i++)
CloseHandle(hThread[i]);
}
msmoother.computeEnergy(); // compute harmonic energy based on edge weights
}
In thread_fun, i am calling a method on msmoother object in order to update msmoother object with edge weights as well as dstMesh. dstMesh is updated perfectly with thread function. In order to perform computeEnergy on msmoother object, object should be returned to main thread or its state should be persisted. But it returns energy as '0'. How can i achieve this?
Memory is shared between threads, so all modification they make on shared data eventually become visible without any additional effort (to return or persist something).
Your problem is, apparently, that you don't wait for threads to complete before attempting to use data they should have prepared. As you already have an array of thread handles, WaitForMultipleObjects should be a convenient way to wait for all threads' completion (notice bWaitAll parameter). Note that WaitForMultipleObjects can't wait for more than 64 objects at once, so you need two calls if you have 100 threads.
If computeEnergy() requires all threads to have completed you can pass the handle to each thread to a WaitForMultipleObject which supports waiting for threads to complete. Within each thread you can add or modify a value within the msmoother object (as passed by pointer to thread_fun).
The msmoother object will live until the threads all return so passing a pointer to it is acceptable.