I am trying to learn how to use openmp for multi threading.
Here is my code:
#include <iostream>
#include <math.h>
#include <omp.h>
//#include <time.h>
//#include <cstdlib>
using namespace std;
bool isprime(long long num);
int main()
{
cout << "There are " << omp_get_num_procs() << " cores." << endl;
cout << 2 << endl;
//clock_t start = clock();
//clock_t current = start;
#pragma omp parallel num_threads(6)
{
#pragma omp for schedule(dynamic, 1000)
for(long long i = 3LL; i <= 1000000000000; i = i + 2LL)
{
/*if((current - start)/CLOCKS_PER_SEC > 60)
{
exit(0);
}*/
if(isprime(i))
{
cout << i << " Thread: " << omp_get_thread_num() << endl;
}
}
}
}
bool isprime(long long num)
{
if(num == 1)
{
return 0;
}
for(long long i = 2LL; i <= sqrt(num); i++)
{
if (num % i == 0)
{
return 0;
}
}
return 1;
}
The problem is that I want openmp to automatically create a number of threads based on how many cores are available. If I take out the num_threads(6), then it just uses 1 thread yet the omp_get_num_procs() correctly outputs 64.
How do I get this to work?
You neglected to mention which compiler and OpenMP implementation you are using. I'm going to guess you're using one of the ones, like PGI, which does not automatically assume the number of threads to create in a default parallel region unless asked to do so. Since you did not specify the compiler I cannot be certain that these options will actually help you, but for PGI's compilers the necessary option is -mp=allcores when compiling and linking the executable. With that added, it will cause the system to create one thread per core for parallel regions which do not specify the number of threads or have the appropriate environment variable set.
The number you're getting from omp_get_num_procs is used by default to set the limit on the number of threads, but not necessarily the number created. If you want to dynamically set the number created, set the environment variable OMP_NUM_THREADS to the desired number before running your application and it should behave as expected.
I'm not sure if I understand your question correctly, but it seems that you are almost there. Do you mean something like:
#include <omp.h>
#include <iostream>
int main(){
const int num_procs = omp_get_num_procs();
std::cout<<num_procs;
#pragma omp parallel for num_threads(num_procs) default(none)
for(int i=0; i<(int)1E20; ++i){
}
return 0;
}
Unless I'm rather badly mistaken, OpenMP normally serializes I/O (at least to a single stream) so that's probably at least part of where your problem is arising. Removing that from the loop, and massaging a bit of the rest (not much point in working at parallelizing until you have reasonably efficient serial code), I end up with something like this:
#include <iostream>
#include <math.h>
#include <omp.h>
using namespace std;
bool isprime(long long num);
int main()
{
unsigned long long total = 0;
cout << "There are " << omp_get_num_procs() << " cores.\n";
#pragma omp parallel for reduction(+:total)
for(long long i = 3LL; i < 100000000; i += 2LL)
if(isprime(i))
total += i;
cout << "Total: " << total << "\n";
}
bool isprime(long long num) {
if (num == 2)
return 1;
if(num == 1 || num % 2 == 0)
return 0;
unsigned long long limit = sqrt(num);
for(long long i = 3LL; i <= limit; i+=2)
if (num % i == 0)
return 0;
return 1;
}
This doesn't print out the thread number, but timing it I get something like this:
Real 78.0686
User 489.781
Sys 0.125
Note the fact that the "User" time is more than 6x as large as the "Real" time, indicating that the load is being distributed across the cores 8 available on this machine with about 80% efficiency. With a little more work, you might be able to improve that further, but even with this simple version we're seeing considerably more than one core being used (on your 64-core machine, we should see at least a 50:1 improvement over single-threaded code, and probably quite a bit better than that).
The only problem I see with your code is that when you do the output you need to put it in a critcal section otherwise multiple threads can write to the same line at the same time.
See my code corrections.
In terms of one thread I think what you might be seeing is due to using dynamic. A thread running over small numbers is much quicker then one running over large numbers. When the thread with small numbers finishes and gets another list of small numbers to run it finishes again quick while the thread with large numbers is still running. This does not mean you're only running one thread though. In my output I see long streams of the same thread finding primes but eventually others report as well. You have also set the chuck size to 1000 so if you for example only ran over 1000 numbers only one thread will be used in the loop.
It looks to me like you're trying to find a list of primes or a sum of the number of primes. You're using trial division for that. That's much less efficient than using the "Sieve of Eratosthenes".
Here is an example of the Sieve of Eratosthenes which finds the primes in the the first billion numbers in less than one second on my 4 core system with OpenMP.
http://create.stephan-brumme.com/eratosthenes/
I cleaned up your code a bit but did not try to optimize anything since the algorithm is inefficient anyway.
int main() {
//long long int n = 1000000000000;
long long int n = 1000000;
cout << "There are " << omp_get_num_procs() << " cores." << endl;
double dtime = omp_get_wtime();
#pragma omp parallel
{
#pragma omp for schedule(dynamic)
for(long long i = 3LL; i <= n; i = i + 2LL) {
if(isprime(i)) {
#pragma omp critical
{
cout << i << "\tThread: " << omp_get_thread_num() << endl;
}
}
}
}
dtime = omp_get_wtime() - dtime;
cout << "time " << dtime << endl;
}
Related
I made a factoring program that needs to loop as quickly as possible. However, I also want to track the progress with minimal code. To do this, I display the current value of i every second by comparing time_t start - time_t end and an incrementing value marker.
using namespace std; // cause I'm a noob
// logic stuff
int divisor = 0, marker = 0;
int limit = sqrt(num);
for (int i = 1; i <= limit; i++) // odd number = odd factors
{
if (num % i == 0)
{
cout << "\x1b[2K" << "\x1b[1F" << "\x1b[1E"; // clear, up, down
if (i != 1)
cout << "\n";
divisor = num / i;
cout << i << "," << divisor << "\n";
}
end = time(&end); // PROBLEM HERE
if ((end - start) > marker)
{
cout << "\x1b[2K" << "\x1b[1F" << "\x1b[1E"; // clear, up, down
cout << "\t\t\t\t" << i;
marker++;
}
}
Of course, the actual code is much more optimized and uses boost::multiprecision, but I don't think that's the problem. When I remove the line end = time(&end), I see a performance gain of at least 10%. I'm just wondering, how can I track the time (or at least approximate seconds) without unconditionally calling a function every loop? Or is there a faster function?
You observe "When I remove the line end = time(&end), I see a performance gain of at least 10%." I am not surprised, reading time easily is taking inefficient time, compared to doing pure CPU calculations.
I assume hence that the time reading is actually what eats the performance which observe lost when removing the line.
You could use an estimation of the minimum number of iterations your loop does within a second and then only check the time if multiples of (half of) that number have looped.
I.e., if you only want to be aware of time in a resolution of seconds, then you should try to only marginally more often do the time-consuming reading of the time.
I would use a totally different approach where you seperate measurement/display code from the loop completely and even run it on another thread.
Live demo here : https://onlinegdb.com/8nNsGy7EX
#include <iostream>
#include <chrono> // for all things time
#include <future> // for std::async, that allows us to run functions on other threads
void function()
{
const std::size_t max_loop_count{ 500 };
std::atomic<std::size_t> n{ 0ul }; // make access to loopcounter threadsafe
// start another thread that will do the reporting independent of the
// actual work you are doing in your loop.
// for this capture n (loop counter) by reference (so this thread can look at it)
auto future = std::async(std::launch::async,[&n, max_loop_count]
{
while (n < max_loop_count)
{
std::this_thread::sleep_for(std::chrono::milliseconds(100));
std::cout << "\rprogress = " << (100 * n) / max_loop_count << "%";
}
});
// do not initialize n here again. since we share it with reporting
for (; n < max_loop_count; n++)
{
// do your loops work, just a short sleep now to mimmick actual work
std::this_thread::sleep_for(std::chrono::milliseconds(10));
}
// synchronize with reporting thread
future.get();
}
int main()
{
function();
return 0;
}
If you have any questions regarding this example let me know.
my C++ program calculates lots of features for 3D points. This takes a while, so I do the calculations in different threads. At the end all features (of all threads) have to be stored in one file.
On my local machine the multithread-implementation was a great success (4 threads -> runtime reduced by 73%).
However, on my server (40 slow 2GHz cores, 80 threads) it's even slower than on my local 4 threads.
Runtimes:
Local 1 Core: 7.5 minutes
Local 4 Core: 2 minutes
Server 80 Threads: 3.1 minutes (slower than my local 4 cores)
Server 20 Threads: 6.2 minutes
Server 4 Threads: 4.75 minutes (Interesting - it's faster with less threads)
My code is appended.
I tried:
Making the critical part smaller/faster by building a string within each thread and only writing it to file within critical part: No improvement
Only writing the results to disk at the very end: No improvement (thought it could be Disk I/O)
Using schedule(guided) for OpenMP-Loop for bigger chunk sizes: No improvement
...
std::vector<double*> points;
for(unsigned int j = 0; j < xyz.size(); j++) {
points.push_back(new double[3]{xyz[j][0], xyz[j][1], xyz[j][2]});
}
ofstream featuresOut;
featuresOut.open(...);
...
KDtree t(&points[0], points.size()); // Build tree
float batchSize = ((float)points.size())/jobs;
unsigned int first = job * batchSize;
unsigned int last = ((job+1) * batchSize) - 1;
// Generate features
#ifdef _OPENMP
omp_set_num_threads(OPENMP_NUM_THREADS);
#pragma omp parallel for schedule(dynamic)
#endif
for(unsigned int r = first; r <= last; r++) {
if (r % 100000 == 0) {
cout << "Calculating features for point nr. " << r << endl;
}
#ifdef _OPENMP
int thread_num = omp_get_thread_num();
#else
int thread_num = 0;
#endif
double features[FEATURE_VECTOR_SIZE];
if (!(ignoreClass0 && type.valid() && type[r]==0)) {
double reference[3] = {xyz[r][0], xyz[r][1], xyz[r][2]};
vector<Point> neighbors = t.kNearestNeighbors(reference, kMax, thread_num); // here we have a ordered set of kMax neighbors (at maximum - could be less for small scans)
//std::vector<double> features = generateNeighborhoodFeatures(t, reference, kMin, kMax, kDelta, cylRadius);
unsigned int kOpt = determineKOpt(t, reference, neighbors, kMin, kMax, kDelta);
generateNeighborhoodFeatures(features, t, reference, neighbors, kOpt, cylRadius, false, thread_num);
#pragma omp critical
{
featuresOut << xyz[r][0] << "," << xyz[r][1] << "," << xyz[r][2] << ",";
featuresOut << kOpt;
for(unsigned int j = 0; j < FEATURE_VECTOR_SIZE; j++) {
if (isfinite(features[j])) {
featuresOut << "," << features[j];
}
else {
cout << "Attention! Feature " << j << " (+5) was not finite. Replaced it with very big number" << endl;
featuresOut << "," << DBL_MAX;
}
}
featuresOut << ",";
if (type.valid()) {
featuresOut << type[r];
} else {
featuresOut << 0;
}
featuresOut << endl;
}
}
}
Only writing to disk at the very end (aggregated results of threads) does not solve the problem (see answer of #J.Svejda). Also keeping one KDtree for each thread results in no speedup.
Thanks for your help.
I believe it's because of your critical section. The time to write to disk usually takes much more than to compute something on the CPU. I don't know what complexity the work you do on the KDTree has, but writing to disk can take milliseconds, whereas instructions on CPU are in the order of nanoseconds. Though the data probably does not get flushed to featuresOut until you send an endl, it would explain your poor scaling. The critical section just takes too long and threads have to wait for each other.
Maybe increase the number of work per thread, say a thread does 5% of the points.
And then it would output to the file aggregated data from more points. See if it improves the speedup.
To make a long story short, I ran into the Monty Hall problem and was interested in throwing something together so I could test it computationally. That worked out fine, but in the process I got curious about multithreading applications in C++. I'm a CS student, but I've only covered that topic briefly with a different language. I wanted to see if I could utilize some of my extra CPU cores to make the Monte Hall simulation go a bit faster.
It seems like I got it working, but alas it doesn't actually have any performance increase. The program performs a large number of iterations over a simple function that essentially boils down to a few rand_r() calls and a couple comparisons. I would expect it to be a trivial example of something that could be split between threads, basically just having each thread handle an equal fraction of the total iterations.
I'm just trying to understand this, and I'm wondering if I'm making a mistake or if there's something going on in the background that's multithreading the execution even if I'm only specifying 1 thread in the code.
Anyway, take a look and share your thoughts. Please also bear in mind that I'm just doing this as a learning experience and didn't originally plan for anyone else to read it :D
#include <cstdlib>
#include <climits>
#include <ctime>
#include <iostream>
#include <thread>
#include <chrono>
enum strategy {STAY = 0, SWITCH = 1};
unsigned ITERATIONS = 1;
unsigned THREADS = 5;
struct counts
{
unsigned stay_correct_c;
unsigned switch_correct_c;
};
void simulate (struct counts&, unsigned&);
bool game (enum strategy, unsigned&);
int main (int argc, char **argv)
{
if (argc < 2)
std::cout << "Usage: " << argv[0] << " -i [t|s|m|l|x] -t [1|2|4|5|10]\n", exit(1);
if (argv[1][1] == 'i') {
switch (argv[2][0]) {
case 's':
ITERATIONS = 1000;
break;
case 'm':
ITERATIONS = 100000;
break;
case 'l':
ITERATIONS = 10000000;
break;
case 'x':
ITERATIONS = 1000000000;
break;
default:
std::cerr << "Invalid argument.\n", exit(1);
}
}
if (argv[3][1] == 't') {
switch (argv[4][0])
{
case '1':
if (argv[4][1] != '0')
THREADS = 1;
else if (argv[4][1] == '0')
THREADS = 10;
break;
case '2':
THREADS = 2;
break;
case '4':
THREADS = 4;
break;
case '5':
THREADS = 5;
break;
}
}
srand(time(NULL));
auto start = std::chrono::high_resolution_clock::now();
struct counts total_counts;
total_counts.stay_correct_c = 0;
total_counts.switch_correct_c = 0;
struct counts per_thread_count[THREADS];
std::thread* threads[THREADS];
unsigned seeds[THREADS];
for (unsigned i = 0; i < THREADS; ++i) {
seeds[i] = rand() % UINT_MAX;
threads[i] = new std::thread (simulate, std::ref(per_thread_count[i]), std::ref(seeds[i]));
}
for (unsigned i = 0; i < THREADS; ++i) {
std::cout << "Waiting for thread " << i << " to finish...\n";
threads[i]->join();
}
for (unsigned i = 0; i < THREADS; ++i) {
total_counts.stay_correct_c += per_thread_count[i].stay_correct_c;
total_counts.switch_correct_c += per_thread_count[i].switch_correct_c;
}
auto stop = std::chrono::high_resolution_clock::now();
std::cout <<
"The simulation performed " << ITERATIONS <<
" iterations on " << THREADS << " threads of both the stay and switch strategies " <<
"taking " << std::chrono::duration_cast<std::chrono::milliseconds>(stop - start).count() <<
" ms." << std::endl <<
"Score:" << std::endl <<
" Stay Strategy: " << total_counts.stay_correct_c << std::endl <<
" Switch Strategy: " << total_counts.switch_correct_c << std::endl << std::endl <<
"Ratios:" << std::endl <<
" Stay Strategy: " << (double)total_counts.stay_correct_c / (double)ITERATIONS << std::endl <<
" Switch Strategy: " << (double)total_counts.switch_correct_c / (double)ITERATIONS << std::endl << std::endl;
}
void simulate (struct counts& c, unsigned& seed)
{
c.stay_correct_c = 0;
c.switch_correct_c = 0;
for (unsigned i = 0; i < (ITERATIONS / THREADS); ++i) {
if (game (STAY, seed))
++c.stay_correct_c;
if (game (SWITCH, seed))
++c.switch_correct_c;
}
}
bool game (enum strategy player_strat, unsigned& seed)
{
unsigned correct_door = rand_r(&seed) % 3;
unsigned player_choice = rand_r(&seed) % 3;
unsigned elim_door;
do {
elim_door = rand_r(&seed) % 3;
}
while ((elim_door != correct_door) && (elim_door != player_choice));
seed = rand_r(&seed);
if (player_strat == SWITCH) {
do
player_choice = (player_choice + 1) % 3;
while (player_choice != elim_door);
return correct_door == player_choice;
}
else
return correct_door == player_choice;
}
Edit: Going to add some supplementary information on the suggestion of some solid comments below.
I'm running on a 6 core/12 thread AMD Ryzen r5 1600. Htop shows the number of logical cores at high utilization that you would expect from the command line arguments. Number of PID's was the same as the number of threads specified plus one, and the number of logical cores with utilization ~= 100% was the same as number of threads specified in every case.
In terms of numbers, here are some data that I gathered using the l flag for a large number of iterations:
CORES AVG MIN MAX
1 102541 102503 102613
4 90183 86770 96248
10 72119 63581 91438
With something as simple to divide as this program, I would have expected to see a linear decrease in total time as I added threads, but I'm clearly missing something. My thinking was that if 1 thread could perform x simulations in y time, that thread should be able to perform x/4 simulations in y/4 time. What am I misunderstanding here?
Edit 2: I should add that as the code exists above, the difference in time was less noticeable with different threads, but I made a couple small optimizations that made the delta a little larger.
Thanks for posting the code; It doesn’t compile on my machine (Apple LLVM version 9.0.0 (clang-900.0.39.2)). Love standards.
I hacked it into a C version, and your problem appears to be false sharing; that is each thread is hitting its “seed” entry a lot, but because memory caches aggregate adjacent locations into “lines”, your cpus are spending all of the time copying these lines back and forth. If you change your definition of “seed” to something like:
struct myseed {
unsigned seed;
unsigned dont_share_me[15];
};
you should see the scalability you expect. You might want to do the same to your struct counts.
Typically, malloc makes this adjustment for you, so if you stamp your ‘per thread’ context into a bag and malloc it, it returns properly cache aligned locations.
I'm beginner in threads usage in c++. I've read basics about std::thread and mutex, and it seems I understand the purpose of using mutexes.
I decided to check if threads are really so dangerous without mutexes (Well I believe books but prefer to see it with my own eyes). As a testcase of "what I shouldn't do in future" I created 2 versions of the same concept: there are 2 threads, one of them increments a number several times (NUMBER_OF_ITERATIONS), another one decrements the same number the same number of times, so we expect to see the same number after the code is executed as before it. The code is attached.
At first I run 2 threads which do it in unsafe manner - without any mutexes, just to see what can happen. And after this part is finished I run 2 threads which do the same thing but in safe manner (with mutexes).
Expected results: without mutexes a result can differ from initial value, because data could be corrupted if two threads works with it simultaneously. Especially it's usual for huge NUMBER_OF_ITERATIONS - because the probability to corrupt data is higher. So this result I can understand.
Also I measured time spent by both "safe" and "unsafe" parts. For huge number of iterations the safe part spends much more time, than unsafe one, as I expected: there is some time spent for mutex check. But for small numbers of iterations (400, 4000) the safe part execution time is less than unsafe time. Why is that possible? Is it something which operating system does? Or is there some optimization by compiler which I'm not aware of? I spent some time thinking about it and decided to ask here.
I use windows and MSVS12 compiler.
Thus the question is: why the safe part execution could be faster than unsafe part one (for small NUMBER_OF_ITERATIONS < 1000*n)?
Another one: why is it related to NUMBER_OF_ITERATIONS: for smaller ones (4000) "safe" part with mutexes is faster, but for huge ones (400000) the "safe" part is slower?
main.cpp
#include <iostream>
#include <vector>
#include <thread>
#include <mutex>
#include <windows.h>
//
///change number of iterations for different results
const long long NUMBER_OF_ITERATIONS = 400;
//
/// time check counter
class Counter{
double PCFreq_ = 0.0;
__int64 CounterStart_ = 0;
public:
Counter(){
LARGE_INTEGER li;
if(!QueryPerformanceFrequency(&li))
std::cerr << "QueryPerformanceFrequency failed!\n";
PCFreq_ = double(li.QuadPart)/1000.0;
QueryPerformanceCounter(&li);
CounterStart_ = li.QuadPart;
}
double GetCounter(){
LARGE_INTEGER li;
QueryPerformanceCounter(&li);
return double(li.QuadPart-CounterStart_)/PCFreq_;
}
};
/// "dangerous" functions for unsafe threads: increment and decrement number
void incr(long long* j){
for (long long i = 0; i < NUMBER_OF_ITERATIONS; i++) (*j)++;
std::cout << "incr finished" << std::endl;
}
void decr(long long* j){
for (long long i = 0; i < NUMBER_OF_ITERATIONS; i++) (*j)--;
std::cout << "decr finished" << std::endl;
}
///class for safe thread operations with incrment and decrement
template<typename T>
class Safe_number {
public:
Safe_number(int i){number_ = T(i);}
Safe_number(long long i){number_ = T(i);}
bool inc(){
if(m_.try_lock()){
number_++;
m_.unlock();
return true;
}
else
return false;
}
bool dec(){
if(m_.try_lock()){
number_--;
m_.unlock();
return true;
}
else
return false;
}
T val(){return number_;}
private:
T number_;
std::mutex m_;
};
///
template<typename T>
void incr(Safe_number<T>* n){
long long i = 0;
while(i < NUMBER_OF_ITERATIONS){
if (n->inc()) i++;
}
std::cout << "incr <T> finished" << std::endl;
}
///
template<typename T>
void decr(Safe_number<T>* n){
long long i = 0;
while(i < NUMBER_OF_ITERATIONS){
if (n->dec()) i++;
}
std::cout << "decr <T> finished" << std::endl;
}
using namespace std;
// run increments and decrements of the same number
// in threads in "safe" and "unsafe" way
int main()
{
//init numbers to 0
long long number = 0;
Safe_number<long long> sNum(number);
Counter cnt;//init time counter
//
//run 2 unsafe threads for ++ and --
std::thread t1(incr, &number);
std::thread t2(decr, &number);
t1.join();
t2.join();
//check time of execution of unsafe part
double time1 = cnt.GetCounter();
cout <<"finished first thr" << endl;
//
// run 2 safe threads for ++ and --, now we expect final value 0
std::thread t3(incr<long long>, &sNum);
std::thread t4(decr<long long>, &sNum);
t3.join();
t4.join();
//check time of execution of safe part
double time2 = cnt.GetCounter() - time1;
cout << "unsafe part, number = " << number << " time1 = " << time1 << endl;
cout << "safe part, Safe number = " << sNum.val() << " time2 = " << time2 << endl << endl;
return 0;
}
You should not draw conclusions about the speed of any given algorithm if the input size is very small. What defines "very small" can be kind of arbitrary, but on modern hardware, under usual conditions, "small" can refer to any collection size less than a few hundred thousand objects, and "large" can refer to any collection larger than that.
Obviously, Your Milage May Vary.
In this case, the overhead of constructing threads, which, while usually slow, can also be rather inconsistent and could be a larger factor in the speed of your code than what the actual algorithm is doing. It's possible that the compiler has some kind of powerful optimizations it can do on smaller input sizes (which it can definitely know about due to the input size being hard-coded into the code itself) that it cannot then perform on larger inputs.
The broader point being that you should always prefer larger inputs when testing algorithm speed, and to also have the same program repeat its tests (preferably in random order!) to try to "smooth out" irregularities in the timings.
I want to know how to properly implement a program in C++, in which I have a function func that I want to be executed in a single thread. I want to do this, because I want to test the Single Core Speed of my CPU. I will loop this function(func) for about 20 times, and record the execution time of each repetition, then I will sum the results and get the average execution time.
#include <thread>
int func(long long x)
{
int div = 0;
for(long i = 1; i <= x / 2; i++)
if(x % i == 0)
div++;
return div + 1;
}
int main()
{
std::thread one_thread (func,100000000);
one_thread.join();
return 0;
}
So , in this program, does the func is executed on a single particular core ?
Here is the source code of my program:
#include <iostream>
#include <thread>
#include <iomanip>
#include <windows.h>
#include "font.h"
#include "timer.h"
using namespace std;
#define steps 20
int func(long long x)
{
int div = 0;
for(long i = 1; i <= x / 2; i++)
if(x % i == 0)
div++;
return div + 1;
}
int main()
{
SetFontConsolas(); // Set font consolas
ShowConsoleCursor(false); // Turn off the cursor
timer t;
short int number = 0;
cout << number << "%";
for(int i = 0 ; i < steps ; i++)
{
t.restart(); // start recording
std::thread one_thread (func,100000000);
one_thread.join(); // wait function return
t.stop(); // stop recording
t.record(); // save the time in vector
number += 5;
cout << "\r ";
cout << "\r" << number << "%";
}
double time = 0.0;
for(int i = 0 ; i < steps ; i++)
time += t.times[i]; // sum all recorded times
time /= steps; // get the average execution time
cout << "\nExecution time: " << fixed << setprecision(4) << time << '\n';
double score = 0.0;
score = (1.0 * 100) / time; // calculating benchmark score
cout << "Score: ";
SetColor(12);
cout << setprecision(2) << score << " pts";
SetColor(15);
cout << "\nPress any key to continue.\n";
cin.get();
return 0;
}
No, your program has at least two treads: main, and the one you've created to run func. Moreover, neither of these threads is guaranteed to get executed on particular core. Depending on OS scheduler they may switch cores in unpredictable manner. Though main thread will mostly just wait. If you want to lock thread execution on particular core then you need to set thread core affinity by some platform-specific method such as SetThreadAffinityMask on Windows. But you don't really need to go that deep because there is no core switch sensitive code in your example. There is even no need to spawn separate thread dedicated to perform calculations.
If your program doesn't have multiple threads in the source and if the compiler does not insert automatic parallelization, the program should run on a single core (at a time).
Now depending on your compiler you can use appropriate optimization levels to ensure that it doesn't parallelize.
On the other hand what might happen is that the compiler can completely eliminate the loop in the function if it can statically compute the result. That however doesn't seem to be the issue with your case.
I don't think any C++ compiler makes use of multiple core, behind your back. There would be large language issues in doing that. If you neither spawn threads nor use a parallel library such as MPI, the program should execute on only one core.