C++ async only uses 2 cores - c++

I am using async to run a method simultaneously, but when I check my CPU, it shows that only 2 of 8 are in use. My CPU utilization is about 13%-16% the whole time.
The function async should create a new thread with every call and thus should be able to use more processors or did I understand something wrong?
Here's my code:
for (map<string, Cell>::iterator a = cells.begin(); a != cells.end(); ++a)
{
for (map<string, Cell>::iterator b = cells.begin(); b != cells.end(); ++b)
{
if (a->first == b->first)
continue;
if (_paths.count("path_" + b->first + "_" + a->first) > 0)
{
continue;
}
tmp = "path_" + a->first + "_" + b->first;
auto future = async(launch::async, &Pathfinder::findPath, this, &a->second, &b->second, collisionZone);
_paths[tmp] = future.get();
}
}
Did I get the concept wrong?
EDIT:
Thanks guys, I figured it out now. I didn't know, that calling .get() on the future would wait for it to finish, which afterwards seems only logical...
However, I edited my code now:
for (map<string, Cell>::iterator a = cells.begin(); a != cells.end(); ++a)
{
for (map<string, Cell>::iterator b = cells.begin(); b != cells.end(); ++b)
{
if (a->first == b->first)
continue;
if (_paths.count("path_" + b->first + "_" + a->first) > 0)
{
continue;
}
tmp = "path_" + a->first + "_" + b->first;
mapBuffer[tmp] = async(launch::async, &Pathfinder::findPath, this, &a->second, &b->second, collisionZone);
}
}
for (map<string, future<list<Point>>>::iterator i = mapBuffer.begin(); i != mapBuffer.end(); ++i)
{
_paths[i->first] = i->second.get();
}
It works. Now it spawns threads properly and uses all my cpu power. You saved me a lot of trouble! Thanks again.

To answer the underlying problem:
You probably should refactor the code by splitting the loop. In the first loop, you create all the futures and put them in a map indexed by tmp. In the second loop, you loop over this map and get all the values from each future, storing the results in _paths
After the first loop, you'll have a lot of futures running in parallel, so your cores should be busy enough. If cells is big enough (>numCores), it may be wise to just split the inner loop.

std::async runs specified function asynchronously and returns immediately. That's it.
It's up to compiler how to do it. Some compilers create thread per async operation, some compilers have thread pool.
I recommend to read this: https://stackoverflow.com/a/15775870/2786682
By the way, your code does not really use std::async as you're making synchronous call to future.get just after 'spawning' the async operation.

YES, you did get it wrong. Parallel code requires some thoughts before writing any code.
Your code creates a future (which may and probably will spawn a new thread), and immediately after that, you force the newly created future to stop (call its .get()method), to synchronize, and have it returning a result.
So, with this strategy, your code will not utilize more than 2 cpu cores ever, at any point in time. It can't.
Actually, most of the time your code utilizes only a single core!
The trick is "to parallelize" your code.

Related

One more releasing and acquiring locks make performance worse unexpectedly

My program has 8 writing threads and one persistence thread. The following code is the core of the persistence thread
std::string longLine;
myMutex.lock();
while (!myQueue.empty()) {
std::string& head = myQueue.front();
const int hSize = head.size();
if(hSize < blockMaxSize)
break;
longLine += head;
myQueue.pop_front();
}
myMutex.unlock();
flushToFile(longLine);
The performance is acceptable (millions of writings finished in hundreds of milliseconds). I still hope to improve the code by avoiding string copying so that I change the code as followed:
myMutex.lock();
while (!myQueue.empty()) {
const int hsize = myQueu.front().size();
if(hsize < blockMaxSize)
break;
std::string head{std::move(myQueue.front())};
myQueue.pop_front();
myMutex.unlock();
flushToFile(head);
myMutex.lock();
}
myMutex.unlock();
It is surprising that the performance drops sharply to millions of writings finished in quite a few seconds. Debugging shows most of time was spent on waiting for the lock after flushing the file.
But I don't understand why. Any one could help?
Not understand more time spent on wait for the lock
Possibly faster. Do all your string concatenations inside the flush function. That way your string concatenation won't block the writer threads trying to append to the queue. This is possibly a micro-optimization.
While we're at it. Let's establish that myQueue is a vector and not a queue or list class. This will be faster since the only operations on the collection are an append or total erase.
std::string longLine;
std::vector<std::string> tempQueue;
myMutex.lock();
if (myQueue.size() >= blockMaxSize) {
tempQueue = std::move(myQueue);
myQueue = {}; // not sure if this is needed
}
myMutex.unlock();
flushToFileWithQueue(tempQueue);
Where flushToFileWithQueue is this:
void flushToFileWithQueue(std::vector<std::string>& queue) {
string longLine;
for (size_t i = 0; i < queue.size(); i++) {
longline += queue[i];
}
queue.resize(0); // faster than calling .pop() N times
flushToFile(longLine);
}
You didn't show what wakes up the persistence thread. If it's polling instead of using a proper condition variable, let me know and I'll show you how to use that.
Also make use of the .reserve() method on these instances of the vector collection such that all the queue has all the memory it needs to grow. Again, possibly a micro-optimization.

Why is filtering by primality in an inifinite stream of numbers taking forever if processed in parallel?

I'm creating an infinite stream of Integers starting at 200 Million, filter this stream using a naive primality test implementation to generate load and limit the result to 10.
Predicate<Integer> isPrime = new Predicate<Integer>() {
#Override
public boolean test(Integer n) {
for (int i = 2; i < n; i++) {
if (n % i == 0) return false;
}
return true;
}
};
Stream.iterate(200_000_000, n -> ++n)
.filter(isPrime)
.limit(10)
.forEach(i -> System.out.print(i + " "));
This works as expected.
Now, if I add a call to parallel() before filtering, nothing is produced and the processing does not complete.
Stream.iterate(200_000_000, n -> ++n)
.parallel()
.filter(isPrime)
.limit(10)
.forEach(i -> System.out.print(i + " "));
Can someone point me in the right direction of what's happening here?
EDIT: I am not looking for better primality test implementations (it is intended to be a long running implementation) but for an explanation of the negative impact of using a parallel stream.
Processing actually completes, though may take quite a long time depending on number of hardware threads on your machine. API documentation about limit warns that it might be slow for parallel streams.
Actually the parallel stream first splits the computation to the several parts according to the available parallelism level, performs a computation for every part, then join the results together. How many parts do you have in your task? One per common FJP thread (=Runtime.getRuntime().availableProcessors()) plus (sometimes?) one for current thread if it's not in FJP. You can control it adding
System.setProperty("java.util.concurrent.ForkJoinPool.common.parallelism", "4");
Practically for your task the lower number you set, the faster it will compute.
How to split the unlimited task? You particular task is handled by IteratorSpliterator which trySplit method creates chunks of ever-increasing size starting from 1024. You may try by yourself:
Spliterator<Integer> spliterator = Stream.iterate(200_000_000, n -> ++n).spliterator();
Spliterator[] spliterators = new Spliterator[10];
for(int i=0; i<spliterators.length; i++) {
spliterators[i] = spliterator.trySplit();
}
for(int i=0; i<spliterators.length; i++) {
System.out.print((i+1)+": ");
spliterators[i].tryAdvance(System.out::println);
}
So the first chunk handles numbers of range 200000000-200001023, the second handles numbers of range 200001024-200003071 and so on. If you have only 1 hardware thread, your task will be split to two chunks, so 3072 will be checked. If you have 8 hardware threads, your task will be split to 9 chunks and 46080 numbers will be checked. Only after all the chunks are processed the parallel computation will stop. The heuristic of splitting the task to such a big chunks doesn't work good in your case, but you would see the performance boost had the prime numbers around that region appear once in several thousand numbers.
Probably your particular scenario could be optimized internally (i.e. stop the computation if the first thread found that limit condition is already achieved). Feel free to report a bug to Java bug tracker.
Update after digging more inside the Stream API I concluded that current behavior is a bug, raised an issue and posted a patch. It's likely that the patch will be accepted for JDK9 and probably even backported to JDK 8u branch. With my patch the parallel version still does not improve the performance, but at least its working time is comparable to sequential stream working time.
The reason why parallel stream taking so long is due to the fact that all parallel streams uses common fork-join thread pool and since you are submitting a long running task(because your implementation of isPrime method is not efficient), you are blocking all threads in the pool and as a result of which all other tasks using parallel stream are blocked.
In order to make the parallel version faster you can implement isPrime more efficiently. For e.g.
Predicate<Integer> isPrime = new Predicate<Integer>() {
#Override
public boolean test(Integer n) {
if(n < 2) return false;
if(n == 2 || n == 3) return true;
if(n%2 == 0 || n%3 == 0) return false;
long sqrtN = (long)Math.sqrt(n)+1;
for(long i = 6L; i <= sqrtN; i += 6) {
if(n%(i-1) == 0 || n%(i+1) == 0) return false;
}
return true;
}
};
And immediately you will notice the improvement in the performance. In general avoid using parallel stream when there exists the possibility of blocking threads in the pool

C++ amp atomics

I am rewriting an algorithm in C++ AMP and just ran into an issue with atomic writes, more specifically atomic_fetch_add, which apparently is only for integers?
I need to add a double_4 (or if I have to, a float_4) in an atomic fashion. How do I accomplish that with C++ AMP's atomics?
Is the best/only solution really to have a lock variable which my code can use to control the writes? I actually need to do atomic writes for a long list of output doubles, so I would essentially need a lock for every output.
I have already considered tiling this for better performance, but right now I am just in the first iteration.
EDIT:
Thanks for the quick answers already given.
I have a quick update to my question though.
I made the following lock attempt, but it seems that when one thread in a warp gets past the lock, all the other threads in the same warp just tags along. I was expecting the first warp thread to get the lock, but I must be missing something (note that it has been quite a few years since my cuda days, so I have just gotten dumb)
parallel_for_each(attracting.extent, [=](index<1> idx) restrict(amp)
{
.....
for (int j = 0; j < attracted.extent.size(); j++)
{
...
int lock = 0; //the expected lock value
while (!atomic_compare_exchange(&locks[j], &lock, 1));
//when one warp thread gets the lock, ALL threads continue on
...
acceleration[j] += ...; //locked write
locks[j] = 0; //leaving the lock again
}
});
It is as such not a big problem, since I should write into a shared variable at first and only write it to global memory after all threads in a tile have completed, but I just don't understand this behavior.
All the atomic add ops are only for integer types. You can do what you want without locks using 128-bit CAS (compare-and-swap) operations though for float_4 (I'm assuming this is 4 floats), but there's no 256-bit CAS ops what you would need for double_4. What you have to do is to have a loop which atomically reads float_4 from memory, perform the float add in the regular way, and then use CAS to test & swap the value if it's the original (and loop if not, i.e. some other thread changed the value between read & write). Note that the 128-bit CAS is only available on 64-bit architectures and that your data needs to be properly aligned.
if the critical code is short, you can create your own lock using atomic operations:
int lock = 1;
while(__sync_lock_test_and_set(&lock, 0) == 0) // trying to acquire lock
{
//yield the thread or go to sleep
}
//critical section, do the work
// release lock
lock = 1;
the advantage is you save the overhead of the OS locks.
The question has as such been answered by others and the answer is that you need to handle double atomics yourself. There is no function for it in the library.
I would also like to elaborate on my own edit, in case others come here with the same failing lock.
In the following example, my error was in not realizing that when the exchange failed, it actually changed the expected value! Thus the first thread would expect lock being zero and write a 1 in it. The next thread would expect 0 and would fail to write a one - but then the exchange wrote a one in the variable holding the expected value. This means that the next time the thread tried to do an exchange it expects a 1 in the lock! This it gets and then it thinks it gets the lock.
I was absolutely not aware that the &lock would receive a 1 on failed exchange match!
parallel_for_each(attracting.extent, [=](index<1> idx) restrict(amp)
{
.....
for (int j = 0; j < attracted.extent.size(); j++)
{
...
int lock = 0; //the expected lock value
**//note that, if locks[j]!=lock then lock=1
//meaning that ACE will be true the next time if locks[j]==1
//meaning the while will terminate even though someone else has the lock**
while (!atomic_compare_exchange(&locks[j], &lock, 1));
//when one warp thread gets the lock, ALL threads continue on
...
acceleration[j] += ...; //locked write
locks[j] = 0; //leaving the lock again
}
});
It seems that a fix is to do
parallel_for_each(attracting.extent, [=](index<1> idx) restrict(amp)
{
.....
for (int j = 0; j < attracted.extent.size(); j++)
{
...
int lock = 0; //the expected lock value
while (!atomic_compare_exchange(&locks[j], &lock, 1))
{
lock=0; //reset the expected value
};
//when one warp thread gets the lock, ALL threads continue on
...
acceleration[j] += ...; //locked write
locks[j] = 0; //leaving the lock again
}
});

Multiple threads adding values in a unordered_map at the same time makes it crash

unordered_map<std::string,unordered_map<std::string, std::string> >* storing_vars;
I have this variable in the scope declared in the scope.
This is declared in the constructor.
this->storing_vars = new unordered_map<std::string,unordered_map<std::string, std::string> >();
in order to initialize it.
Then what I do is call a function over and over again by my BackgroundWorker
for(int i2 = 0; i2 < 30; i2++){
int index_pos_curr = i2;
//Start the Threads HERE
this->backgroundWorker2 = gcnew System::ComponentModel::BackgroundWorker;
this->backgroundWorker2->WorkerReportsProgress = true;
this->backgroundWorker2->WorkerSupportsCancellation = true;
//this->backgroundWorker2->FieldSetter(L"std::string",L"test","damnnit");
backgroundWorker2->DoWork += gcnew DoWorkEventHandler( this, &MainFacebook::backgroundWorker2_DoWork );
backgroundWorker2->RunWorkerCompleted += gcnew RunWorkerCompletedEventHandler( this, &MainFacebook::backgroundWorker2_RunWorkerCompleted );
backgroundWorker2->ProgressChanged += gcnew ProgressChangedEventHandler( this, &MainFacebook::backgroundWorker2_ProgressChanged );
backgroundWorker2->RunWorkerAsync(index_pos_curr);
Sleep(50); //THE PROBLEM IS HERE, IF I COMMENT THIS OUT it won't work, that's probably because there are a lot of functions trying to add values in the same variable (even though the indexes are differents in each call)
}
After this is done it calls the DoWork Function
void backgroundWorker2_DoWork(Object^ sender, DoWorkEventArgs^ e ){
BackgroundWorker^ worker = dynamic_cast<BackgroundWorker^>(sender);
e->Result = SendThem( safe_cast<Int32>(e->Argument), worker, e );
}
int SendThem(int index){
stringstream st;
st << index;
//...
(*this->storing_vars)[st.str()]["index"] = "testing1";
(*this->storing_vars)[st.str()]["rs"] = "testing2";
return 0;
}
as I added the comment in the Sleep(50) line, I believe the problem is that since the thread in the background call the same function, it has a problem to store the data when it's called a lot of times probably not even waiting for the other storing to finish, it's causing an error in the "xhash.h" file, an error that is sanitized by using Sleep(50), but I can't use those because it freezes my UI and also 50 miliseconds is the time I'm assuming it already stored the variable value, but what if it takes longer in slower computers? it's not the right approach.
How do I do to fix that?
I want to be able to UPDATE the unordered_map WITHOUT the use of SLEEP
Thanks in advance.
You can only modify the standard library containers (including, but not limited to, unordered_map) from one thread at a time. The solution is to use critical sections, mutexes, locks to synchronize access. If you don't know what these are, then you need to know before you try to create multiple threads.
No ifs, buts or why's.
If you have multiple threads, you need mechanism to synchronize them, to serialize access to shared data. Common synchronization mechanisms are the ones mentioned above, so go look them up.
After so many votes down I actually started to look for the Mutex, people were talking about here, after a while I find out that it's really simple to use. and it's the correct way as my fellows here told me. Thank you all for the help =D
Here what I did, I just had to add
//Declare the Mutex
static Mutex^ mut = gcnew Mutex;
//then inside the function called over and over again I used mutex
mut->WaitOne();
//Declare/Update the variables
mut->ReleaseMutex();
//Then I release it.
It works perfectly, Thank you all for the helps and criticism. haha
I found one solution by predefining the index of the unordered_map I wanna use it, the problem is just creating the index, updating seems to be ok with multiple threads.
for(int i2 = 0; i2 < 30; i2++){
int index_pos_curr = i2;
//Start the Threads HERE
this->backgroundWorker2 = gcnew System::ComponentModel::BackgroundWorker;
this->backgroundWorker2->WorkerReportsProgress = true;
this->backgroundWorker2->WorkerSupportsCancellation = true;
backgroundWorker2->DoWork += gcnew DoWorkEventHandler( this, &MainFacebook::backgroundWorker2_DoWork );
backgroundWorker2->RunWorkerCompleted += gcnew RunWorkerCompletedEventHandler( this, &MainFacebook::backgroundWorker2_RunWorkerCompleted ); stringstream st; st << index_pos_curr;
(*this->storing_vars)[st.str()]["index"] = "";
//This ^^^^^ will initialize it and then in the BackgroundWorker will only update, this way it doesn't crash. :)
backgroundWorker2->ProgressChanged += gcnew ProgressChangedEventHandler( this, &MainFacebook::backgroundWorker2_ProgressChanged );
backgroundWorker2->RunWorkerAsync(index_pos_curr);
Sleep(50); //THE PROBLEM IS HERE, IF I COMMENT THIS OUT it won't work, that's probably because there are a lot of functions trying to add values in the same variable (even though the indexes are differents in each call)
}

boost::thread_group - is it ok to call create_thread after join_all?

I have the following situation:
I create a boost::thread_group instance, then create threads for parallel-processing on some data, then join_all on the threads.
Initially I created the threads for every X elements of data, like so:
// begin = someVector.begin();
// end = someVector.end();
// batchDispatcher = boost::function<void(It, It)>(...);
boost::thread_group processors;
// create dispatching thread every ASYNCH_PROCESSING_THRESHOLD notifications
while(end - begin > ASYNCH_PROCESSING_THRESHOLD)
{
NotifItr split = begin + ASYNCH_PROCESSING_THRESHOLD;
processors.create_thread(boost::bind(batchDispatcher, begin, split));
begin = split;
}
// create dispatching thread for the remainder
if(begin < end)
{
processors.create_thread(boost::bind(batchDispatcher, begin, end));
}
// wait for parallel processing to finish
processors.join_all();
but I have a problem with this: When I have lots of data, this code is generating lots of threads (> 40 threads) which keeps the processor busy with thread-switching contexts.
My question is this: Is it possible to call create_thread on the thread_group after the call to join_all.
That is, can I change my code to this?
boost::thread_group processors;
size_t processorThreads = 0; // NEW CODE
// create dispatching thread every ASYNCH_PROCESSING_THRESHOLD notifications
while(end - begin > ASYNCH_PROCESSING_THRESHOLD)
{
NotifItr split = begin + ASYNCH_PROCESSING_THRESHOLD;
processors.create_thread(boost::bind(batchDispatcher, begin, split));
begin = split;
if(++processorThreads >= MAX_ASYNCH_PROCESSORS) // NEW CODE
{ // NEW CODE
processors.join_all(); // NEW CODE
processorThreads = 0; // NEW CODE
} // NEW CODE
}
// ...
Whoever has experience with this, thanks for any insight.
I believe this is not possible. The solution you want might actually be to implement a producer-consumer or a master-worker (main 'master' thread divides the work in several fixed size tasks, creates pool of 'workers' threads and sends one task to each worker until all tasks are done).
These solutions will demand some synchronization through semaphores but they will equalize well the performance one you can create one thread for each available core in the machine avoiding waste of time on context switches.
Another not-so-good-and-fancy option is to join one thread at a time. You can have a vector with 4 active threads, join one and create another. The problem of this approach is that you may waste processing time if your tasks are heterogeneous.