No speedup with many slow cores?

No speedup with many slow cores? - c++

my C++ program calculates lots of features for 3D points. This takes a while, so I do the calculations in different threads. At the end all features (of all threads) have to be stored in one file.
On my local machine the multithread-implementation was a great success (4 threads -> runtime reduced by 73%).
However, on my server (40 slow 2GHz cores, 80 threads) it's even slower than on my local 4 threads.
Runtimes:
Local 1 Core: 7.5 minutes
Local 4 Core: 2 minutes
Server 80 Threads: 3.1 minutes (slower than my local 4 cores)
Server 20 Threads: 6.2 minutes
Server 4 Threads: 4.75 minutes (Interesting - it's faster with less threads)
My code is appended.
I tried:
Making the critical part smaller/faster by building a string within each thread and only writing it to file within critical part: No improvement
Only writing the results to disk at the very end: No improvement (thought it could be Disk I/O)
Using schedule(guided) for OpenMP-Loop for bigger chunk sizes: No improvement
...
std::vector<double*> points;
for(unsigned int j = 0; j < xyz.size(); j++) {
points.push_back(new double[3]{xyz[j][0], xyz[j][1], xyz[j][2]});
}
ofstream featuresOut;
featuresOut.open(...);
...
KDtree t(&points[0], points.size()); // Build tree
float batchSize = ((float)points.size())/jobs;
unsigned int first = job * batchSize;
unsigned int last = ((job+1) * batchSize) - 1;
// Generate features
#ifdef _OPENMP
omp_set_num_threads(OPENMP_NUM_THREADS);
#pragma omp parallel for schedule(dynamic)
#endif
for(unsigned int r = first; r <= last; r++) {
if (r % 100000 == 0) {
cout << "Calculating features for point nr. " << r << endl;
}
#ifdef _OPENMP
int thread_num = omp_get_thread_num();
#else
int thread_num = 0;
#endif
double features[FEATURE_VECTOR_SIZE];
if (!(ignoreClass0 && type.valid() && type[r]==0)) {
double reference[3] = {xyz[r][0], xyz[r][1], xyz[r][2]};
vector<Point> neighbors = t.kNearestNeighbors(reference, kMax, thread_num); // here we have a ordered set of kMax neighbors (at maximum - could be less for small scans)
//std::vector<double> features = generateNeighborhoodFeatures(t, reference, kMin, kMax, kDelta, cylRadius);
unsigned int kOpt = determineKOpt(t, reference, neighbors, kMin, kMax, kDelta);
generateNeighborhoodFeatures(features, t, reference, neighbors, kOpt, cylRadius, false, thread_num);
#pragma omp critical
{
featuresOut << xyz[r][0] << "," << xyz[r][1] << "," << xyz[r][2] << ",";
featuresOut << kOpt;
for(unsigned int j = 0; j < FEATURE_VECTOR_SIZE; j++) {
if (isfinite(features[j])) {
featuresOut << "," << features[j];
}
else {
cout << "Attention! Feature " << j << " (+5) was not finite. Replaced it with very big number" << endl;
featuresOut << "," << DBL_MAX;
}
}
featuresOut << ",";
if (type.valid()) {
featuresOut << type[r];
} else {
featuresOut << 0;
}
featuresOut << endl;
}
}
}
Only writing to disk at the very end (aggregated results of threads) does not solve the problem (see answer of #J.Svejda). Also keeping one KDtree for each thread results in no speedup.
Thanks for your help.

I believe it's because of your critical section. The time to write to disk usually takes much more than to compute something on the CPU. I don't know what complexity the work you do on the KDTree has, but writing to disk can take milliseconds, whereas instructions on CPU are in the order of nanoseconds. Though the data probably does not get flushed to featuresOut until you send an endl, it would explain your poor scaling. The critical section just takes too long and threads have to wait for each other.
Maybe increase the number of work per thread, say a thread does 5% of the points.
And then it would output to the file aggregated data from more points. See if it improves the speedup.

Related

Is there a way to limit the amount of CPU a c++ application uses

I'm working on a program that calculates the recaman sequence. I want to calculate and then later visualize thousands or millions of terms in the sequence. However, I've noticed that it is taking up 10% of CPU, and the task manager says it has very high power usage. I don't want to damage the computer, and am willing to sacrifice speed for the safety of the computer. Is there a way to limit the CPU usage or battery consumption level of this application?
This is for Windows 10.
//My Function for calculating the sequence
//If it helps, you could look up 'Recaman Sequence' on google
void processSequence(int numberOfTerms) {
int* terms;
terms = new int[numberOfTerms];
terms[0] = 0;
cout << "Term Number " << 0 << " is: " << 0 << endl;
int currentTermNumber = 1;
int lastTerm = 0;
int largestTerm = 0;
for (currentTermNumber; currentTermNumber < numberOfTerms; currentTermNumber++) {
int thisTerm;
bool termTaken = false;
for (int j = 0; j < numberOfTerms; j++) {
if (terms[j] == lastTerm - currentTermNumber) {
termTaken = true;
}
}
if (!termTaken && lastTerm - currentTermNumber > 0) {
thisTerm = lastTerm - currentTermNumber;
}
else {
thisTerm = lastTerm + currentTermNumber;
}
if (thisTerm > largestTerm) {
largestTerm = thisTerm;
}
lastTerm = thisTerm;
cout << "Term Number " << currentTermNumber << " is: " << thisTerm << endl;
};
cout << "The Largest Term Number Was: " << largestTerm << endl;
delete[] terms;
}

The simplest way to use less CPU is to sleep from time to time for a short period, for example one or a few miliseconds.
You can do so by calling Sleep (Windows API), or current_thread::sleep (standard since C++11).
However,
You won't ever physically damage your computer when using all cores at 100%. Most of videogames do so greedily anyway. The worst that can happen is an abrupt turn off and the unability to turn on again for the next few minutes, in case the CPU has reached a limit temperature (80-100°C). This security is indeed there to prevent anything dangerous and/or unrecoverable.
It rarely makes sense to intentionally slow down your program like this. IF you are experiencing slowness in the user interface, you'd should move intensive processing to a non-UI thread.

Learning to multithread in C++: Adding threads doesn't make execution faster even though it seems like it should

To make a long story short, I ran into the Monty Hall problem and was interested in throwing something together so I could test it computationally. That worked out fine, but in the process I got curious about multithreading applications in C++. I'm a CS student, but I've only covered that topic briefly with a different language. I wanted to see if I could utilize some of my extra CPU cores to make the Monte Hall simulation go a bit faster.
It seems like I got it working, but alas it doesn't actually have any performance increase. The program performs a large number of iterations over a simple function that essentially boils down to a few rand_r() calls and a couple comparisons. I would expect it to be a trivial example of something that could be split between threads, basically just having each thread handle an equal fraction of the total iterations.
I'm just trying to understand this, and I'm wondering if I'm making a mistake or if there's something going on in the background that's multithreading the execution even if I'm only specifying 1 thread in the code.
Anyway, take a look and share your thoughts. Please also bear in mind that I'm just doing this as a learning experience and didn't originally plan for anyone else to read it :D
#include <cstdlib>
#include <climits>
#include <ctime>
#include <iostream>
#include <thread>
#include <chrono>
enum strategy {STAY = 0, SWITCH = 1};
unsigned ITERATIONS = 1;
unsigned THREADS = 5;
struct counts
{
unsigned stay_correct_c;
unsigned switch_correct_c;
};
void simulate (struct counts&, unsigned&);
bool game (enum strategy, unsigned&);
int main (int argc, char **argv)
{
if (argc < 2)
std::cout << "Usage: " << argv[0] << " -i [t|s|m|l|x] -t [1|2|4|5|10]\n", exit(1);
if (argv[1][1] == 'i') {
switch (argv[2][0]) {
case 's':
ITERATIONS = 1000;
break;
case 'm':
ITERATIONS = 100000;
break;
case 'l':
ITERATIONS = 10000000;
break;
case 'x':
ITERATIONS = 1000000000;
break;
default:
std::cerr << "Invalid argument.\n", exit(1);
}
}
if (argv[3][1] == 't') {
switch (argv[4][0])
{
case '1':
if (argv[4][1] != '0')
THREADS = 1;
else if (argv[4][1] == '0')
THREADS = 10;
break;
case '2':
THREADS = 2;
break;
case '4':
THREADS = 4;
break;
case '5':
THREADS = 5;
break;
}
}
srand(time(NULL));
auto start = std::chrono::high_resolution_clock::now();
struct counts total_counts;
total_counts.stay_correct_c = 0;
total_counts.switch_correct_c = 0;
struct counts per_thread_count[THREADS];
std::thread* threads[THREADS];
unsigned seeds[THREADS];
for (unsigned i = 0; i < THREADS; ++i) {
seeds[i] = rand() % UINT_MAX;
threads[i] = new std::thread (simulate, std::ref(per_thread_count[i]), std::ref(seeds[i]));
}
for (unsigned i = 0; i < THREADS; ++i) {
std::cout << "Waiting for thread " << i << " to finish...\n";
threads[i]->join();
}
for (unsigned i = 0; i < THREADS; ++i) {
total_counts.stay_correct_c += per_thread_count[i].stay_correct_c;
total_counts.switch_correct_c += per_thread_count[i].switch_correct_c;
}
auto stop = std::chrono::high_resolution_clock::now();
std::cout <<
"The simulation performed " << ITERATIONS <<
" iterations on " << THREADS << " threads of both the stay and switch strategies " <<
"taking " << std::chrono::duration_cast<std::chrono::milliseconds>(stop - start).count() <<
" ms." << std::endl <<
"Score:" << std::endl <<
" Stay Strategy: " << total_counts.stay_correct_c << std::endl <<
" Switch Strategy: " << total_counts.switch_correct_c << std::endl << std::endl <<
"Ratios:" << std::endl <<
" Stay Strategy: " << (double)total_counts.stay_correct_c / (double)ITERATIONS << std::endl <<
" Switch Strategy: " << (double)total_counts.switch_correct_c / (double)ITERATIONS << std::endl << std::endl;
}
void simulate (struct counts& c, unsigned& seed)
{
c.stay_correct_c = 0;
c.switch_correct_c = 0;
for (unsigned i = 0; i < (ITERATIONS / THREADS); ++i) {
if (game (STAY, seed))
++c.stay_correct_c;
if (game (SWITCH, seed))
++c.switch_correct_c;
}
}
bool game (enum strategy player_strat, unsigned& seed)
{
unsigned correct_door = rand_r(&seed) % 3;
unsigned player_choice = rand_r(&seed) % 3;
unsigned elim_door;
do {
elim_door = rand_r(&seed) % 3;
}
while ((elim_door != correct_door) && (elim_door != player_choice));
seed = rand_r(&seed);
if (player_strat == SWITCH) {
do
player_choice = (player_choice + 1) % 3;
while (player_choice != elim_door);
return correct_door == player_choice;
}
else
return correct_door == player_choice;
}
Edit: Going to add some supplementary information on the suggestion of some solid comments below.
I'm running on a 6 core/12 thread AMD Ryzen r5 1600. Htop shows the number of logical cores at high utilization that you would expect from the command line arguments. Number of PID's was the same as the number of threads specified plus one, and the number of logical cores with utilization ~= 100% was the same as number of threads specified in every case.
In terms of numbers, here are some data that I gathered using the l flag for a large number of iterations:
CORES AVG MIN MAX
1 102541 102503 102613
4 90183 86770 96248
10 72119 63581 91438
With something as simple to divide as this program, I would have expected to see a linear decrease in total time as I added threads, but I'm clearly missing something. My thinking was that if 1 thread could perform x simulations in y time, that thread should be able to perform x/4 simulations in y/4 time. What am I misunderstanding here?
Edit 2: I should add that as the code exists above, the difference in time was less noticeable with different threads, but I made a couple small optimizations that made the delta a little larger.

Thanks for posting the code; It doesn’t compile on my machine (Apple LLVM version 9.0.0 (clang-900.0.39.2)). Love standards.
I hacked it into a C version, and your problem appears to be false sharing; that is each thread is hitting its “seed” entry a lot, but because memory caches aggregate adjacent locations into “lines”, your cpus are spending all of the time copying these lines back and forth. If you change your definition of “seed” to something like:
struct myseed {
unsigned seed;
unsigned dont_share_me[15];
};
you should see the scalability you expect. You might want to do the same to your struct counts.
Typically, malloc makes this adjustment for you, so if you stamp your ‘per thread’ context into a bag and malloc it, it returns properly cache aligned locations.

How to develop a program that use only one single core?

I want to know how to properly implement a program in C++, in which I have a function func that I want to be executed in a single thread. I want to do this, because I want to test the Single Core Speed of my CPU. I will loop this function(func) for about 20 times, and record the execution time of each repetition, then I will sum the results and get the average execution time.
#include <thread>
int func(long long x)
{
int div = 0;
for(long i = 1; i <= x / 2; i++)
if(x % i == 0)
div++;
return div + 1;
}
int main()
{
std::thread one_thread (func,100000000);
one_thread.join();
return 0;
}
So , in this program, does the func is executed on a single particular core ?
Here is the source code of my program:
#include <iostream>
#include <thread>
#include <iomanip>
#include <windows.h>
#include "font.h"
#include "timer.h"
using namespace std;
#define steps 20
int func(long long x)
{
int div = 0;
for(long i = 1; i <= x / 2; i++)
if(x % i == 0)
div++;
return div + 1;
}
int main()
{
SetFontConsolas(); // Set font consolas
ShowConsoleCursor(false); // Turn off the cursor
timer t;
short int number = 0;
cout << number << "%";
for(int i = 0 ; i < steps ; i++)
{
t.restart(); // start recording
std::thread one_thread (func,100000000);
one_thread.join(); // wait function return
t.stop(); // stop recording
t.record(); // save the time in vector
number += 5;
cout << "\r ";
cout << "\r" << number << "%";
}
double time = 0.0;
for(int i = 0 ; i < steps ; i++)
time += t.times[i]; // sum all recorded times
time /= steps; // get the average execution time
cout << "\nExecution time: " << fixed << setprecision(4) << time << '\n';
double score = 0.0;
score = (1.0 * 100) / time; // calculating benchmark score
cout << "Score: ";
SetColor(12);
cout << setprecision(2) << score << " pts";
SetColor(15);
cout << "\nPress any key to continue.\n";
cin.get();
return 0;
}

No, your program has at least two treads: main, and the one you've created to run func. Moreover, neither of these threads is guaranteed to get executed on particular core. Depending on OS scheduler they may switch cores in unpredictable manner. Though main thread will mostly just wait. If you want to lock thread execution on particular core then you need to set thread core affinity by some platform-specific method such as SetThreadAffinityMask on Windows. But you don't really need to go that deep because there is no core switch sensitive code in your example. There is even no need to spawn separate thread dedicated to perform calculations.

If your program doesn't have multiple threads in the source and if the compiler does not insert automatic parallelization, the program should run on a single core (at a time).
Now depending on your compiler you can use appropriate optimization levels to ensure that it doesn't parallelize.
On the other hand what might happen is that the compiler can completely eliminate the loop in the function if it can statically compute the result. That however doesn't seem to be the issue with your case.

I don't think any C++ compiler makes use of multiple core, behind your back. There would be large language issues in doing that. If you neither spawn threads nor use a parallel library such as MPI, the program should execute on only one core.

Can iterating over unsorted data structure (like array, tree), with multiple thread make iteration faster?

Can iterating over unsorted data structure like array, tree with multiple thread make it faster?
For example I have big array with unsorted data.
int array[1000];
I'm searching array[i] == 8
Can running:
Thread 1:
for(auto i = 0; i < 500; i++)
{
if(array[i] == 8)
std::cout << "found" << std::endl;
}
Thread 2:
for(auto i = 500; i < 1000; i++)
{
if(array[i] == 8)
std::cout << "found" << std::endl;
}
be faster than normal iteration?
#update
I've written simple test witch describe problem better:
For searching int* array = new int[100000000];
and repeating it 1000 times
I got the result:
a
Number of threads = 2
End of multithread iteration
End of normal iteration
Time with 2 threads 73581
Time with 1 thread 154070
Bool values:0
0
0
Process returned 0 (0x0) execution time : 256.216 s
Press any key to continue.
What's more when program was running with 2 threads cpu usage of the process was around ~90% and when iterating with 1 thread it was never more than 50%.
So Smeeheey and erip are right that it can make iteration faster.
Of course it can be more tricky for not such trivial problems.
And as I've learned from this test is that compiler can optimize main thread (when i was not showing boolean storing results of search loop in main thread was ignored) but it will not do that for other threads.
This is code I have used:
#include<cstdlib>
#include<thread>
#include<ctime>
#include<iostream>
#define SIZE_OF_ARRAY 100000000
#define REPEAT 1000
inline bool threadSearch(int* array){
for(auto i = 0; i < SIZE_OF_ARRAY/2; i++)
if(array[i] == 101) // there is no array[i]==101
return true;
return false;
}
int main(){
int i;
std::cin >> i; // stops program enabling to set real time priority of the process
clock_t with_multi_thread;
clock_t normal;
srand(time(NULL));
std::cout << "Number of threads = "
<< std::thread::hardware_concurrency() << std::endl;
int* array = new int[SIZE_OF_ARRAY];
bool true_if_found_t1 =false;
bool true_if_found_t2 =false;
bool true_if_found_normal =false;
for(auto i = 0; i < SIZE_OF_ARRAY; i++)
array[i] = rand()%100;
with_multi_thread=clock();
for(auto j=0; j<REPEAT; j++){
std::thread t([&](){
if(threadSearch(array))
true_if_found_t1=true;
});
std::thread u([&](){
if(threadSearch(array+SIZE_OF_ARRAY/2))
true_if_found_t2=true;
});
if(t.joinable())
t.join();
if(u.joinable())
u.join();
}
with_multi_thread=(clock()-with_multi_thread);
std::cout << "End of multithread iteration" << std::endl;
for(auto i = 0; i < SIZE_OF_ARRAY; i++)
array[i] = rand()%100;
normal=clock();
for(auto j=0; j<REPEAT; j++)
for(auto i = 0; i < SIZE_OF_ARRAY; i++)
if(array[i] == 101) // there is no array[i]==101
true_if_found_normal=true;
normal=(clock()-normal);
std::cout << "End of normal iteration" << std::endl;
std::cout << "Time with 2 threads " << with_multi_thread<<std::endl;
std::cout << "Time with 1 thread " << normal<<std::endl;
std::cout << "Bool values:" << true_if_found_t1<<std::endl
<< true_if_found_t2<<std::endl
<<true_if_found_normal<<std::endl;// showing bool values to prevent compiler from optimization
return 0;
}

The answer is yes, it can make it faster - but not necessarily. In your case, when you're iterating over pretty small arrays, it is likely that the overhead of launching a new thread will be much higher than the benefit gained. If you array was much bigger then this would be reduced as a proportion of the overall runtime and eventually become worth it. Note you will only get speed up if your system has more than 1 physical core available to it.
Additionally, you should note that whilst that the code that reads the array in your case is perfectly thread-safe, writing to std::cout is not (you will get very strange looking output if your try this). Instead perhaps your thread should do something like return an integer type indicating the number of instances found.

Openmp can't create threads automatically

I am trying to learn how to use openmp for multi threading.
Here is my code:
#include <iostream>
#include <math.h>
#include <omp.h>
//#include <time.h>
//#include <cstdlib>
using namespace std;
bool isprime(long long num);
int main()
{
cout << "There are " << omp_get_num_procs() << " cores." << endl;
cout << 2 << endl;
//clock_t start = clock();
//clock_t current = start;
#pragma omp parallel num_threads(6)
{
#pragma omp for schedule(dynamic, 1000)
for(long long i = 3LL; i <= 1000000000000; i = i + 2LL)
{
/*if((current - start)/CLOCKS_PER_SEC > 60)
{
exit(0);
}*/
if(isprime(i))
{
cout << i << " Thread: " << omp_get_thread_num() << endl;
}
}
}
}
bool isprime(long long num)
{
if(num == 1)
{
return 0;
}
for(long long i = 2LL; i <= sqrt(num); i++)
{
if (num % i == 0)
{
return 0;
}
}
return 1;
}
The problem is that I want openmp to automatically create a number of threads based on how many cores are available. If I take out the num_threads(6), then it just uses 1 thread yet the omp_get_num_procs() correctly outputs 64.
How do I get this to work?

You neglected to mention which compiler and OpenMP implementation you are using. I'm going to guess you're using one of the ones, like PGI, which does not automatically assume the number of threads to create in a default parallel region unless asked to do so. Since you did not specify the compiler I cannot be certain that these options will actually help you, but for PGI's compilers the necessary option is -mp=allcores when compiling and linking the executable. With that added, it will cause the system to create one thread per core for parallel regions which do not specify the number of threads or have the appropriate environment variable set.
The number you're getting from omp_get_num_procs is used by default to set the limit on the number of threads, but not necessarily the number created. If you want to dynamically set the number created, set the environment variable OMP_NUM_THREADS to the desired number before running your application and it should behave as expected.

I'm not sure if I understand your question correctly, but it seems that you are almost there. Do you mean something like:
#include <omp.h>
#include <iostream>
int main(){
const int num_procs = omp_get_num_procs();
std::cout<<num_procs;
#pragma omp parallel for num_threads(num_procs) default(none)
for(int i=0; i<(int)1E20; ++i){
}
return 0;
}

Unless I'm rather badly mistaken, OpenMP normally serializes I/O (at least to a single stream) so that's probably at least part of where your problem is arising. Removing that from the loop, and massaging a bit of the rest (not much point in working at parallelizing until you have reasonably efficient serial code), I end up with something like this:
#include <iostream>
#include <math.h>
#include <omp.h>
using namespace std;
bool isprime(long long num);
int main()
{
unsigned long long total = 0;
cout << "There are " << omp_get_num_procs() << " cores.\n";
#pragma omp parallel for reduction(+:total)
for(long long i = 3LL; i < 100000000; i += 2LL)
if(isprime(i))
total += i;
cout << "Total: " << total << "\n";
}
bool isprime(long long num) {
if (num == 2)
return 1;
if(num == 1 || num % 2 == 0)
return 0;
unsigned long long limit = sqrt(num);
for(long long i = 3LL; i <= limit; i+=2)
if (num % i == 0)
return 0;
return 1;
}
This doesn't print out the thread number, but timing it I get something like this:
Real 78.0686
User 489.781
Sys 0.125
Note the fact that the "User" time is more than 6x as large as the "Real" time, indicating that the load is being distributed across the cores 8 available on this machine with about 80% efficiency. With a little more work, you might be able to improve that further, but even with this simple version we're seeing considerably more than one core being used (on your 64-core machine, we should see at least a 50:1 improvement over single-threaded code, and probably quite a bit better than that).

The only problem I see with your code is that when you do the output you need to put it in a critcal section otherwise multiple threads can write to the same line at the same time.
See my code corrections.
In terms of one thread I think what you might be seeing is due to using dynamic. A thread running over small numbers is much quicker then one running over large numbers. When the thread with small numbers finishes and gets another list of small numbers to run it finishes again quick while the thread with large numbers is still running. This does not mean you're only running one thread though. In my output I see long streams of the same thread finding primes but eventually others report as well. You have also set the chuck size to 1000 so if you for example only ran over 1000 numbers only one thread will be used in the loop.
It looks to me like you're trying to find a list of primes or a sum of the number of primes. You're using trial division for that. That's much less efficient than using the "Sieve of Eratosthenes".
Here is an example of the Sieve of Eratosthenes which finds the primes in the the first billion numbers in less than one second on my 4 core system with OpenMP.
http://create.stephan-brumme.com/eratosthenes/
I cleaned up your code a bit but did not try to optimize anything since the algorithm is inefficient anyway.
int main() {
//long long int n = 1000000000000;
long long int n = 1000000;
cout << "There are " << omp_get_num_procs() << " cores." << endl;
double dtime = omp_get_wtime();
#pragma omp parallel
{
#pragma omp for schedule(dynamic)
for(long long i = 3LL; i <= n; i = i + 2LL) {
if(isprime(i)) {
#pragma omp critical
{
cout << i << "\tThread: " << omp_get_thread_num() << endl;
}
}
}
}
dtime = omp_get_wtime() - dtime;
cout << "time " << dtime << endl;
}

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js

No speedup with many slow cores? - c++

Related

Is there a way to limit the amount of CPU a c++ application uses

Learning to multithread in C++: Adding threads doesn't make execution faster even though it seems like it should

How to develop a program that use only one single core?

Can iterating over unsorted data structure (like array, tree), with multiple thread make iteration faster?

Openmp can't create threads automatically

Categories

Resources