Equivalent of std::deque at in tbb::concurrent_queue container? - c++

I am using std::deque at function to access elements without popping from the queue since i am using the same queue in different iterations. My solution is based on coarse-grained multithreading. Now i wanted to make it fine-grained multithreading solution. For that, i am using tbb::concurrent_queue. But i need the equivalent function of std::deque at operation in tbb::concurrent_queue?
EDIT
This is how i am implementing with std::deque (coarse-grained multithreading)
Keep in mind that dq is static queue(i.e. using many times in different iterations)
vertext_found = true;
std::deque<T> dq;
while ( i < dq->size())
{
EnterCriticalSection(&h);
if( i < dq.size() )
{
v = dq.at(i); // accessing element of queue without popping
i++;
vertext_found = true;
}
LeaveCriticalSection(&h);
if (vertext_found && (i < dq.size()) && v != NULL)
{
**operation on 'v'
vertext_found = false;
}
}
I want to achieve the same result with tbb::concurrent_queue?

If your algorithm has separate passes that fill the queue or consume the queue, consider using tbb::concurrent_vector. It has a push_back method that could be used for the fill pass, and an at() method for the consumption passes. If threads contend to pop elements in a consumption pass, consider using a tbb::atomic counter to generate indices for at().
If there is no such clean separation of filling and consuming, using at() would probably create more problems than it solves, even if it existed, because it would be racing against a consumer.
If a consumption pass just needs to loop over the concurrent_vector in parallel, consider using tbb::parallel_for for the loop. tbb::concurrent_vector has a range() method that supports this idiom.
void consume( tbb::concurrent_vector<T>& vec ) {
tbb::parallel_for( vec.range(), [&]( const tbb::concurrent_vector<T>::range_type& r ) {
for( auto i=r.begin(); i!=r.end(); ++i ) {
T value = *i;
...process value...;
}
});
}
If a consumption pass cannnot use tbb:parallel_for, consider using a TBB atomic counter to generate the indices. Initialize the counter to zero and use ++ to increment it. Here is an example:
tbb::atomic<size_t> head;
tbb::concurrent_vector<T> vec;
bool pop_one( T& result ) { // Try to grab next item from vec
size_t i = head++; // Fetch-and-increment must be single atomic operation
if( i<vec.size() ) {
result = vec[i];
return true;
} else {
return false; // Failed
}
}
In general, this solution will be less scalable than using tbb::parallel_for, because the counter "head" introduces a point of contention in the memory system.

According to the Doxygen docs in the TBB site (TBB Doxy docs) there's no operation at in the queue. You can push and try_pop elements with a tbb::strict_ppl::concurrent_queue.
If you're using a tbb::deprecated::concurrent_queue (older versions of TBB), there are available the push_if_not_full and pop_if_present operations.
In both queues, "multiple threads may each push and pop concurrently" as stated down in the brief section.

Related

Simple Read Write Lock

I find many read write Spin lock implementation over the internet are unnecessarily complex. I have written a simple read-write lock in c++.
Could anybody tell me , if I am missing anything?
int r = 0;
int w = 0;
read_lock(void)
{
atomic_inc(r); //increment value atomically
while( w != 0);
}
read_unlock(void)
{
atomic_dec(r); // Decrement value atomically
}
write_lock(void)
{
while( (r != 0) &&
( w != 0))
atomic_inc(w); //increment value atomically
}
write_unlock(void)
{
atomic_dec(w); //Decrement value atomically
}
The usage would be as below.
read_lock()
// Critical Section
read_unlock();
write_lock()
// Critical Section
write_unlock();
Edit:
Thanks for the answers.
I now changed answer to that of atomic equivalent
If threads access r and w concurrently, they have a data-race. If a C++ program has a data-race, the behaviour of the program is undefined.
int is not guaranteed by the C++ standard to be atomic. Even if we assume a system where accessing an int is atomic, operator++ would probably not be an atomic operation even on such systems. As such, simultaneous increments could "disappear".
Furthermore after the loop in write_lock, another thread could also end their loop before w is incremented, thereby allowing multiple simultaneous writers - which I assume this lock is supposed to prevent.
Lastly, this appears to be an attempt at implementing a spinlock. Spinlocks have advantages and disadvantages. Their disadvantage is that they consume all CPU cycles of their thread while blocking. This is highly inefficient use of resources, and bad for battery time, and bad for other processes that could have used those cycles. But it can be optimal if the wait time is short.
The simplest implemention would be to use a single integral value. -1 shows a current write status, 0 means it is not being read or written to and a positive value indicates it is being read by that many threads.
Use atomic_int and compare_exchange_weak (or strong but weak should suffice)
std::atomic_int l=0;
void write_lock() {
int v = 0;
while( !l.compare_exchange_weak( v, -1 ) )
v = 0; // it will set it to what it currently held
}
void write_unlock() {
l = 0; // no need to compare_exchange
}
void read_lock() {
int v = l.load();
while( v < 0 || !l.compare_exchange_weak(v, v+1) )
v = l.load();
}
void read_unlock() {
--l; // no need to do anything else
}
I think that should work, and have RAII objects, i.e. create an automatic object that locks on construction and unlocks on destruction for each type.
that could be done like this:
class AtomicWriteSpinScopedLock
{
private:
atomic_int& l_;
public:
// handle copy/assign/move issues
explicit AtomicWriteSpinScopedLock( atomic_int& l ) :
l_(l)
{
int v = 0;
while( !l.compare_exchange_weak( v, -1 ) )
v = 0; // it will set it to what it currently held
}
~AtomicWriteSpinScopedLock()
{
l_ = 0;
}
};
class AtomicReadSpinScopedLock
{
private:
atomic_int& l_;
public:
// handle copy/assign/move issues
explicit AtomicReadSpinScopedLock( atomic_int& l ) :
l_(l)
{
int v = l.load();
while( v < 0 || !l.compare_exchange_weak(v, v+1) )
v = l.load(); }
}
~AtomicReadSpinScopedLock()
{
--l_;
}
};
On locking to write the value must be 0 and you must swap it to -1, so just keep trying to do that.
On locking to read the value must be non-negative and then you attempt to increase it, so there may be retries against other readers, not in acquiring the lock but in setting its count.
compare_exchange_weak sets to the first parameter what it actually held if the exchange failed, and the second parameter is what you are trying to change it to. It returns true if it swapped and false if it did not.
How efficient? It's a spin-lock. It will use CPU cycles whilst waiting, so it had better be available very soon: the update or the reading of the data should be swift.

c++ multithreading return value

I am using 3 threads to chunk a for loop, and the 'data' is a global array so I want to lock that part in 'calculateAll' function,
std::vector< int > calculateAll(int ***data,std::vector<LineIndex> indexList)
{
std::vector<int> v_a=std::vector<int>();
for(int a=0;a<indexList.size();a++)
{
mylock.lock();
v_b.push_back(/*something related with data*/);
mylock.unlock();
v_a.push_back(a);
}
return v_a;
}
for(int i=0;i<3;i++)
{
int s =firstone+i*chunk;
int e = ((s+chunk)<indexList.size())? (s+chunk) : indexList.size();
t[i]=std::thread(calculateAll,data,indexList,s,e);
}
for (int i = 0; i < 3; ++i)
{
t[i].join();
}
my question is, how can I get the return value which is a vector from each thread and then combine them together? The reason I want to do that is because if I declare ’v_a‘ as a global vector, when each thread trys to push_back their value in this vector 'v_a' it will be some crash(or not?).So I am thinking to declare a vector for each of the thread, then combine them into a new vector for further use( like I do it without thread ).
Or is there some better methods to deal with that concurrency problem? The order of 'v_a' does not matter.
I appreciate for any suggestion.
First of all, rather than the explicit lock() and unlock() seen in your code, always use std::lock_guard where possible. Secondly, it would be better you use std::futures to this thing. Launch each thread with std::async then get the results in another loop while aggregating the results at the same time. Like this:
using Vector = std::vector<int>;
using Future = std::future<Vector>;
std::vector<Future> futures;
for(int i=0;i<3;i++)
{
int s =firstone+i*chunk;
int e = ((s+chunk)<indexList.size())? (s+chunk) : indexList.size();
auto fut = std::async(std::launch::async, calculateAll, data, indexList, s, e);
futures.push_back( std::move(fut) );
}
//Combine the results
std::vector<int> result;
for(auto& fut : futures){ //Iterate through in the order the future was created
auto vec = fut.get(); //Get the result of the future
result.insert(result.end(), vec.begin(), vec.end()); //append it to results in order
}
Here is a minimal, complete and working example based on your code - which demonstrates what I mean: Live On Coliru
Create a structure with a vector and a lock. Pass an instance of that structure, with the lock pre-locked, to each thread as it starts. Wait for all locks to become unlocked. Each thread then does its work and unlocks its lock when done.

How to remove elements from a vector based on a condition in another vector?

I have two equal length vectors from which I want to remove elements based on a condition in one of the vectors. The same removal operation should be applied to both so that the indices match.
I have come up with a solution using std::erase, but it is extremely slow:
vector<myClass> a = ...;
vector<otherClass> b = ...;
assert(a.size() == b.size());
for(size_t i=0; i<a.size(); i++)
{
if( !a[i].alive() )
{
a.erase(a.begin() + i);
b.erase(b.begin() + i);
i--;
}
}
Is there a way that I can do this more efficiently and preferably using stl algorithms?
If order doesn't matter you could swap the elements to the back of the vector and pop them.
for(size_t i=0; i<a.size();)
{
if( !a[i].alive() )
{
std::swap(a[i], a.back());
a.pop_back();
std::swap(b[i], b.back());
b.pop_back();
}
else
++i;
}
If you have to maintain the order you could use std::remove_if. See this answer how to get the index of the dereferenced element in the remove predicate:
a.erase(remove_if(begin(a), end(a),
[b&](const myClass& d) { return b[&d - &*begin(a)].alive(); }),
end(a));
b.erase(remove_if(begin(b), end(b),
[](const otherClass& d) { return d.alive(); }),
end(b));
The reason it's slow is probably due to the O(n^2) complexity. Why not use list instead? As making a pair of a and b is a good idea too.
A quick win would be to run the loop backwards: i.e. start at the end of the vector. This tends to minimise the number of backward shifts due to element removal.
Another approach would be to consider std::vector<std::unique_ptr<myClass>> etc.: then you'll be essentially moving pointers rather than values.
I propose you create 2 new vectors, reserve memory and swap vectors content in the end.
vector<myClass> a = ...;
vector<otherClass> b = ...;
vector<myClass> new_a;
vector<myClass> new_b;
new_a.reserve(a.size());
new_b.reserve(b.size());
assert(a.size() == b.size());
for(size_t i=0; i<a.size(); i++)
{
if( a[i].alive() )
{
new_a.push_back(a[i]);
new_b.push_back(b[i]);
}
}
swap(a, new_a);
swap(b, new_b);
It can be memory consumed, but should work fast.
erasing from the middle of a vector is slow due to it needing to reshuffle everything after the deletion point. consider using another container instead that makes erasing quicker. It depends on your use cases, will you be iterating often? does the data need to be in order? If you aren't iterating often, consider a list. if you need to maintain order, consider a set. if you are iterating often and need to maintain order, depending on the number of elements, it may be quicker to push back all alive elements to a new vector and set a/b to point to that instead.
Also, since the data is intrinsically linked, it seems to make sense to have just one vector containing data a and b in a pair or small struct.
For performance reason need to use next.
Use
vector<pair<myClass, otherClass>>
as say #Basheba and std::sort.
Use special form of std::sort with comparision predicate. And do not enumerate from 0 to n. Use std::lower_bound instead, becouse vector will be sorted. Insertion of element do like say CashCow in this question: "how do you insert the value in a sorted vector?"
I had a similar problem where I had two :
std::<Eigen::Vector3d> points;
std::<Eigen::Vector3d> colors;
for 3D pointclouds in Open3D and after removing the floor, I wanted to delete all points and colors if the points' z coordinate is greater than 0.05. I ended up overwriting the points based on the index and resizing the vector afterward.
bool invert = true;
std::vector<bool> mask = std::vector<bool>(points.size(), invert);
size_t pos = 0;
for (auto & point : points) {
if (point(2) < CONSTANTS::FLOOR_HEIGHT) {
mask.at(pos) = false;
}
++pos;
}
size_t counter = 0;
for (size_t i = 0; i < points.size(); i++) {
if (mask[i]) {
points.at(counter) = points.at(i);
colors.at(counter) = colors.at(i);
++counter;
}
}
points.resize(counter);
colors.resize(counter);
This maintains order and at least in my case, worked almost twice as fast than the remove_if method from the accepted answer:
for 921600 points the runtimes were:
33 ms for the accepted answer
17 ms for this approach.

Removing code dynamically?

I'm in a for loop in C++ and i want an "if" clause inside of it to disappear in the next iteration (for the sake of performance) after it checks the value as true once. Is that possible in C++ or in any other language?
There is no magic to change to code dynamically. Executing an if clause in a loop is probably not super-expensive if the if condition is cheap to execute.
If the condition is expensive to evaluate, you may want to protect it with an extra boolean variable:
bool mustCheck = true;
size_t const n = ...; // number of iterations
for (size_t i = 0; i < n; ++i) {
if (mustCheck && theExpensiveCheck(...)) {
mustCheck = false; // turn off the check now
....
}
...
}
If the goal is to execute the check on the first iteration only, you could test if the loop index is 0:
for (size_t i = 0; i < n; ++i) {
if (i == 0 && theExpensiveCheck(...)) {
....
}
...
}
Another option that does not have an if inside the loop is to pull out the if completely, and execute it before the loop if the loop has at least one iteration:
size_t const n = ...; // number of iterations
if (n > 0) {
// do check and execute loop body for first item
if (theExpensiveCheck()) {
....
}
}
// start regular loop, starting at index 1
for (size_t i = 1; i < n; ++i) {
// execute loop body for other items
...
}
The modifications above add extra complexity (and thus potential bugs) to your code. I would recommend to not perform any of these modifications if it's unclear whether there actually is a perform problem with the loop or the if condition. Often enough, applying the above modifications will not result in substantial performance gains, but clearly it depends on the if condition.
Compilers nowadays also provide several powerful loop optimization techniques, so you should make sure you're compiling with all these optimizations turned on.
This is not possible in C++. Once it is compiled, that's it. But, you shouldn't worry about checking a value once. The performance impact is insignificant.
I suppose if the value check was pretty involved (ie more than simple checking if it is T or F), you could add some sort of flag to check first and then skip the rest of the check if it is true. This obviously requires its own check/assignment and is most likely not worth doing.

performing vector intersection in C++

I have a vector of vector of unsigned. I need to find the intersection of all these vector of unsigned's for doing so I wrote the following code:
int func()
{
vector<vector<unsigned> > t;
vector<unsigned> intersectedValues;
bool firstIntersection=true;
for(int i=0;i<(t).size();i++)
{
if(firstIntersection)
{
intersectedValues=t[0];
firstIntersection=false;
}else{
vector<unsigned> tempIntersectedSubjects;
set_intersection(t[i].begin(),
t[i].end(), intersectedValues.begin(),
intersectedValues.end(),
std::inserter(tempIntersectedSubjects, tempIntersectedSubjects.begin()));
intersectedValues=tempIntersectedSubjects;
}
if(intersectedValues.size()==0)
break;
}
}
Each individual vector has 9000 elements and there are many such vectors in "t". When I profiled my code I found that set_intersection takes the maximum amount of time and hence makes the code slow when there are many invocations of func(). Can someone please suggest as to how can I make the code more efficient.
I am using: gcc (GCC) 4.8.2 20140120 (Red Hat 4.8.2-15)
EDIT: Individual vectors in vector "t" are sorted.
I don't have a framework to profile the operations but I'd certainly change the code to reuse the readily allocated vector. In addition, I'd hoist the initial intersection out of the loop. Also, std::back_inserter() should make sure that elements are added in the correct location rather than in the beginning:
int func()
{
vector<vector<unsigned> > t = some_initialization();
if (t.empty()) {
return;
}
vector<unsigned> intersectedValues(t[0]);
vector<unsigned> tempIntersectedSubjects;
for (std::vector<std::vector<unsigned>>::size_type i(1u);
i < t.size() && !intersectedValues.empty(); ++i) {
std::set_intersection(t[i].begin(), t[i].end(),
intersectedValues.begin(), intersectedValues.end(),
std::back_inserter(tempIntersectedSubjects);
std::swap(intersectedValues, tempIntersectedSubjects);
tempIntersectedSubjects.clear();
}
}
I think this code has a fair chance to be faster. It may also be reasonable to intersect the sets different: instead of keeping one set and intersecting with that you could create a new intersection for pairs of adjacent sets and then intersect the first sets with their respect adjacent ones:
std::vector<std::vector<unsigned>> intersections(
std::vector<std::vector<unsigned>> const& t) {
std::vector<std::vector<unsigned>> r;
std::vector<std::vector<unsignned>>::size_type i(0);
for (; i + 1 < t.size(); i += 2) {
r.push_back(intersect(t[i], t[i + 1]));
}
if (i < t.size()) {
r.push_back(t[i]);
}
return r;
}
std::vector<unsigned> func(std::vector<std::vector<unsigned>> const& t) {
if (t.empty()) { /* deal with t being empty... */ }
std::vector<std::vector<unsigned>> r(intersections(t))
return r.size() == 1? r[0]: func(r);
}
Of course, you wouldn't really implement it like this: you'd use Stepanov's binary counter to keep the intermediate sets. This approach assumes that the result is most likely non-empty. If the expectation is that the result will be empty that may not be an improvement.
I can't test this but maybe something like this would be faster?
int func()
{
vector<vector<unsigned> > t;
vector<unsigned> intersectedValues;
// remove if() branching from loop
if(t.empty())
return -1;
intersectedValues = t[0];
// now start from 1
for(size_t i = 1; i < t.size(); ++i)
{
vector<unsigned> tempIntersectedSubjects;
tempIntersectedSubjects.reserve(intersectedValues.size()); // pre-allocate
// insert at end() not begin()
set_intersection(t[i].begin(),
t[i].end(), intersectedValues.begin(),
intersectedValues.end(),
std::inserter(tempIntersectedSubjects, tempIntersectedSubjects.end()));
// as these are not used again you can move them rather than copy
intersectedValues = std::move(tempIntersectedSubjects);
if(intersectedValues.empty())
break;
}
return 0;
}
Another possibility:
Thinking about it using swap() could optimize the exchange of data and remove the need to re-allocate. Also then the temp constructor can be moved out of the loop.
int func()
{
vector<vector<unsigned> > t;
vector<unsigned> intersectedValues;
// remove if() branching from loop
if(t.empty())
return -1;
intersectedValues = t[0];
// no need to construct this every loop
vector<unsigned> tempIntersectedSubjects;
// now start from 1
for(size_t i = 1; i < t.size(); ++i)
{
// should already be the correct size from previous loop
// but just in case this should be cheep
// (profile removing this line)
tempIntersectedSubjects.reserve(intersectedValues.size());
// insert at end() not begin()
set_intersection(t[i].begin(),
t[i].end(), intersectedValues.begin(),
intersectedValues.end(),
std::inserter(tempIntersectedSubjects, tempIntersectedSubjects.end()));
// swap should leave tempIntersectedSubjects preallocated to the
// correct size
intersectedValues.swap(tempIntersectedSubjects);
tempIntersectedSubjects.clear(); // will not deallocate
if(intersectedValues.empty())
break;
}
return 0;
}
You can make std::set_intersection as well as a bunch of other standard library algorithms run in parallel by defining _GLIBCXX_PARALLEL during compilation. That probably has the best work-gain ratio. For documentation see this.
Obligatory pitfall warning:
Note that the _GLIBCXX_PARALLEL define may change the sizes and behavior of standard class templates such as std::search, and therefore one can only link code compiled with parallel mode and code compiled without parallel mode if no instantiation of a container is passed between the two translation units. Parallel mode functionality has distinct linkage, and cannot be confused with normal mode symbols.
from here.
Another simple, though probably insignificantly small, optimization would be reserving enough space before filling your vectors.
Also, try to find out whether inserting the values at the back instead of the front and then reversing the vector helps. (Although I even think that your code is wrong right now and your intersectedValues is sorted the wrong way. If I'm not mistaken, you should use std::back_inserter instead of std::inserter(...,begin) and then not reverse.) While shifting stuff through memory is pretty fast, not shifting should be even faster.
To copy elements from vectors from vector for loop with emplace_back() may save your time. And no need of a flag if you change the iterator index of for loop. So for loop can be optimized, and condition check can be removed for each iteration.
void func()
{
vector<vector<unsigned > > t;
vector<unsigned int > intersectedValues;
for(unsigned int i=1;i<(t).size();i++)
{
intersectedValues=t[0];
vector<unsigned > tempIntersectedSubjects;
set_intersection(t[i].begin(),
t[i].end(), intersectedValues.begin(),
intersectedValues.end(),
std::back_inserter(tempIntersectedSubjects);
for(auto &ele: tempIntersectedSubjects)
intersectedValues.emplace_back(ele);
if( intersectedValues.empty())
break;
}
}
set::set_intersection can be rather slow for large vectors. It's possible use create a similar function that uses lower_bound. Something like this:
template<typename Iterator1, typename Iterator2, typename Function>
void lower_bound_intersection(Iterator1 begin_1, Iterator1 end_1, Iterator2 begin_2, Iterator2 end_2, Function func)
{
for (; begin_1 != end_1 && begin_2 != end_2;)
{
if (*begin_1 < *begin_2)
{
begin_1 = begin_1.lower_bound(*begin_2);
//++begin_1;
}
else if (*begin_2 < *begin_1)
{
begin_2 = begin_2.lower_bound(*begin_1);
//++begin_2;
}
else // equivalent
{
func(*begin_1);
++begin_1;
++begin_2;
}
}
}