Is this simple c++ program multithreaded? - c++

I am sort of new to C++, but I was wondering if this basic (and possibly sloppy) program is actually running multiple threads at a time or if it its just pooling:
This is a console application running in visual c++ 2015:
#include <string>
#include "stdafx.h"
#include <iostream>
#include <sstream>
#include <thread>
using namespace std;
#include <stdio.h>
int temp1 = 0;
int num1 = 0;
int temp2 = 0;
int num2 = 0;
void math1() {
int running_total = 23;
for (int i = 0; i < 999999999; i++)
{
running_total = 58 * running_total + i;
}
}
int math2() {
int running_total = 23;
for (int i = 0; i < 999999999; i++)
{
running_total = 58 * running_total + i;
}
return 0;
}
int main()
{
unsigned concurentThreadsSupported = std::thread::hardware_concurrency();
cout << "Current Number of CPU threads: " << concurentThreadsSupported << endl;
thread t1(math1);
thread t2(math2);
t1.join();
t2.join();
cout << "1: " << num1 << endl;
cout << "2: " << num2 << endl;
system("pause");
return 0;
}
I notice when I run the code with thread t1(math1);thread t2(math2);t1.join();t2.join();, it uses 25% total of my cpu for 3.5 seconds, but when I use
thread t1(math1);
t1.join();
thread t2(math2);
t2.join();
it uses ~13% of the CPU for almost 7 seconds.
Is this actually multithreading?

thread t1(math1); thread t2(math2); t1.join(); t2.join(); waits for t1 to finish while also running t2. The math1 and math2 functions do the same thing, so they'll finish approx. at once, which is optimal (it could be just one function as well).
To the numbers you're seeing, you clearly have a CPU with 8 logical cores. The multithreaded version uses two hardware threads (2 / 8) = 25%, while single-threaded just one (1 / 8) = 12,5%. It also runs two times slower, simple.

Related

OpenMP parallel for does not speed up array sum code [duplicate]

This question already has answers here:
C++: Timing in Linux (using clock()) is out of sync (due to OpenMP?)
(3 answers)
Closed 4 months ago.
I'm trying to test the speed up of OpenMP on an array sum program.
The elements are generated using random generator to avoid optimization.
The length of array is also set large enough to indicate the performance difference.
This program is built using g++ -fopenmp -g -O0 -o main main.cpp, -g -O0 are used to avoid optimization.
However OpenMP parallel for code is significant slower than sequential code.
Test result:
Your thread count is: 12
Filling arrays
filling time:66718888
Now running omp code
2thread omp time:11154095
result: 4294903886
Now running omp code
4thread omp time:10832414
result: 4294903886
Now running omp code
6thread omp time:11165054
result: 4294903886
Now running sequential code
sequential time: 3525371
result: 4294903886
#include <iostream>
#include <stdio.h>
#include <omp.h>
#include <ctime>
#include <random>
using namespace std;
long long llsum(char *vec, size_t size, int threadCount) {
long long result = 0;
size_t i;
#pragma omp parallel for num_threads(threadCount) reduction(+: result) schedule(guided)
for (i = 0; i < size; ++i) {
result += vec[i];
}
return result;
}
int main(int argc, char **argv) {
int threadCount = 12;
omp_set_num_threads(threadCount);
cout << "Your thread count is: " << threadCount << endl;
const size_t TEST_SIZE = 8000000000;
char *testArray = new char[TEST_SIZE];
std::mt19937 rng;
rng.seed(std::random_device()());
std::uniform_int_distribution<std::mt19937::result_type> dist6(0, 4);
cout << "Filling arrays\n";
auto fillingStartTime = clock();
for (int i = 0; i < TEST_SIZE; ++i) {
testArray[i] = dist6(rng);
}
auto fillingEndTime = clock();
auto fillingTime = fillingEndTime - fillingStartTime;
cout << "filling time:" << fillingTime << endl;
// test omp time
for (int i = 1; i <= 3; ++i) {
cout << "Now running omp code\n";
auto ompStartTime = clock();
auto ompResult = llsum(testArray, TEST_SIZE, i * 2);
auto ompEndTime = clock();
auto ompTime = ompEndTime - ompStartTime;
cout << i * 2 << "thread omp time:" << ompTime << endl << "result: " << ompResult << endl;
}
// test sequential addition time
cout << "Now running sequential code\n";
auto seqStartTime = clock();
long long expectedResult = 0;
for (int i = 0; i < TEST_SIZE; ++i) {
expectedResult += testArray[i];
}
auto seqEndTime = clock();
auto seqTime = seqEndTime - seqStartTime;
cout << "sequential time: " << seqTime << endl << "result: " << expectedResult << endl;
delete[]testArray;
return 0;
}
As pointed out by #High Performance Mark, I should use omp_get_wtime() instead of clock().
clock() is 'active processor time', not 'elapsed time.
See
OpenMP time and clock() give two different results
https://en.cppreference.com/w/c/chrono/clock
After using omp_get_wtime(), and fixing the int i to size_t i, the result is more meaningful:
Your thread count is: 12
Filling arrays
filling time:267.038
Now running omp code
2thread omp time:26.1421
result: 15999820788
Now running omp code
4thread omp time:7.16911
result: 15999820788
Now running omp code
6thread omp time:5.66505
result: 15999820788
Now running sequential code
sequential time: 30.4056
result: 15999820788

threaded for loop reaching values it's not supposed to reach

I'm messing around with multithreading in c++ and here is my code:
#include <iostream>
#include <vector>
#include <string>
#include <thread>
void read(int i);
bool isThreadEnabled;
std::thread threads[100];
int main()
{
isThreadEnabled = true; // I change this to compare the threaded vs non threaded method
if (isThreadEnabled)
{
for (int i = 0;i < 100;i++) //this for loop is what I'm confused about
{
threads[i] = std::thread(read,i);
}
for (int i = 0; i < 100; i++)
{
threads[i].join();
}
}
else
{
for (int i = 0; i < 100; i++)
{
read(i);
}
}
}
void read(int i)
{
int w = 0;
while (true) // wasting cpu cycles to actually see the difference between the threaded and non threaded
{
++w;
if (w == 100000000) break;
}
std::cout << i << std::endl;
}
in the for loop that uses threads the console prints values in a random order ex(5,40,26...) which is expected and totally fine since threads don't run in the same order as they were initiated...
but what confuses me is that the values printed are sometimes more than the maximum value that int i can reach (which is 100), values like 8000,2032,274... are also printed to the console even though i will never reach that number, I don't understand why ?
This line:
std::cout << i << std::endl;
is actually equivalent to
std::cout << i;
std::cout << std::endl;
And thus while thread safe (meaning there's no undefined behaviour), the order of execution is undefined. Given two threads the following execution is possible:
T20: std::cout << 20
T32: std::cout << 32
T20: std::cout << std::endl
T32: std::cout << std::endl
which results in 2032 in console (glued numbers) and an empty line.
The simplest (not necessarily the best) fix for that is to wrap this line with a shared mutex:
{
std::lock_guard lg { mutex };
std::cout << i << std::endl;
}
(the brackets for a separate scope are not needed if the std::cout << i << std::endl; is the last line in the function)

C++ high_resolution_clock::now() showing inconsistent results with wallclock time

I was trying to calculate the time required to sample 7680 bit primes in FLINT library. I had a loop running for 100 iterations and finally calculated the average time required. On my mac, the below code took more than 5 hours to run (I left the code running in mac at 2:00clock without closing the lid. When I saw again at 7:00clock, the code is still running). But finally, it showed an output of "33.3442" seconds. How is this possible?
#include "fmpz.h"
#include <iostream>
#include <chrono>
using namespace std;
using namespace std::chrono;
int main() {
int count = 100;
int length = 7680;
fmpz_t primes[count];
flint_rand_t state;
flint_randinit(state);
for (int i = 0; i < count; i++)
fmpz_init(primes[i]);
auto start = high_resolution_clock::now();
for (int i = 0; i < count; i++)
{
while (true)
{
fmpz_randbits(primes[i], state, length);
if (fmpz_is_probabprime(primes[i]))
break;
}
}
auto stop = high_resolution_clock::now();
auto duration = duration_cast<microseconds>(stop - start);
cout << "Generating random primes of length " << length << " " << ((double)duration.count()/1000000)/count << endl;
}

How to develop a program that use only one single core?

I want to know how to properly implement a program in C++, in which I have a function func that I want to be executed in a single thread. I want to do this, because I want to test the Single Core Speed of my CPU. I will loop this function(func) for about 20 times, and record the execution time of each repetition, then I will sum the results and get the average execution time.
#include <thread>
int func(long long x)
{
int div = 0;
for(long i = 1; i <= x / 2; i++)
if(x % i == 0)
div++;
return div + 1;
}
int main()
{
std::thread one_thread (func,100000000);
one_thread.join();
return 0;
}
So , in this program, does the func is executed on a single particular core ?
Here is the source code of my program:
#include <iostream>
#include <thread>
#include <iomanip>
#include <windows.h>
#include "font.h"
#include "timer.h"
using namespace std;
#define steps 20
int func(long long x)
{
int div = 0;
for(long i = 1; i <= x / 2; i++)
if(x % i == 0)
div++;
return div + 1;
}
int main()
{
SetFontConsolas(); // Set font consolas
ShowConsoleCursor(false); // Turn off the cursor
timer t;
short int number = 0;
cout << number << "%";
for(int i = 0 ; i < steps ; i++)
{
t.restart(); // start recording
std::thread one_thread (func,100000000);
one_thread.join(); // wait function return
t.stop(); // stop recording
t.record(); // save the time in vector
number += 5;
cout << "\r ";
cout << "\r" << number << "%";
}
double time = 0.0;
for(int i = 0 ; i < steps ; i++)
time += t.times[i]; // sum all recorded times
time /= steps; // get the average execution time
cout << "\nExecution time: " << fixed << setprecision(4) << time << '\n';
double score = 0.0;
score = (1.0 * 100) / time; // calculating benchmark score
cout << "Score: ";
SetColor(12);
cout << setprecision(2) << score << " pts";
SetColor(15);
cout << "\nPress any key to continue.\n";
cin.get();
return 0;
}
No, your program has at least two treads: main, and the one you've created to run func. Moreover, neither of these threads is guaranteed to get executed on particular core. Depending on OS scheduler they may switch cores in unpredictable manner. Though main thread will mostly just wait. If you want to lock thread execution on particular core then you need to set thread core affinity by some platform-specific method such as SetThreadAffinityMask on Windows. But you don't really need to go that deep because there is no core switch sensitive code in your example. There is even no need to spawn separate thread dedicated to perform calculations.
If your program doesn't have multiple threads in the source and if the compiler does not insert automatic parallelization, the program should run on a single core (at a time).
Now depending on your compiler you can use appropriate optimization levels to ensure that it doesn't parallelize.
On the other hand what might happen is that the compiler can completely eliminate the loop in the function if it can statically compute the result. That however doesn't seem to be the issue with your case.
I don't think any C++ compiler makes use of multiple core, behind your back. There would be large language issues in doing that. If you neither spawn threads nor use a parallel library such as MPI, the program should execute on only one core.

Forcing race between threads using C++11 threads

Just got started on multithreading (and multithreading in general) using C++11 threading library, and and wrote small short snipped of code.
#include <iostream>
#include <thread>
int x = 5; //variable to be effected by race
//This function will be called from a thread
void call_from_thread1() {
for (int i = 0; i < 5; i++) {
x++;
std::cout << "In Thread 1 :" << x << std::endl;
}
}
int main() {
//Launch a thread
std::thread t1(call_from_thread1);
for (int j = 0; j < 5; j++) {
x--;
std::cout << "In Thread 0 :" << x << std::endl;
}
//Join the thread with the main thread
t1.join();
std::cout << x << std::endl;
return 0;
}
Was expecting to get different results every time (or nearly every time) I ran this program, due to race between two threads. However, output is always: 0, i.e. two threads run as if they ran sequentially. Why am I getting same results and is there any ways to simulate or force race between two threads ?
Your sample size is rather small, and somewhat self-stalls on the continuous stdout flushes. In short, you need a bigger hammer.
If you want to see a real race condition in action, consider the following. I purposely added an atomic and non-atomic counter, sending both to the threads of the sample. Some test-run results are posted after the code:
#include <iostream>
#include <atomic>
#include <thread>
#include <vector>
void racer(std::atomic_int& cnt, int& val)
{
for (int i=0;i<1000000; ++i)
{
++val;
++cnt;
}
}
int main(int argc, char *argv[])
{
unsigned int N = std::thread::hardware_concurrency();
std::atomic_int cnt = ATOMIC_VAR_INIT(0);
int val = 0;
std::vector<std::thread> thrds;
std::generate_n(std::back_inserter(thrds), N,
[&cnt,&val](){ return std::thread(racer, std::ref(cnt), std::ref(val));});
std::for_each(thrds.begin(), thrds.end(),
[](std::thread& thrd){ thrd.join();});
std::cout << "cnt = " << cnt << std::endl;
std::cout << "val = " << val << std::endl;
return 0;
}
Some sample runs from the above code:
cnt = 4000000
val = 1871016
cnt = 4000000
val = 1914659
cnt = 4000000
val = 2197354
Note that the atomic counter is accurate (I'm running on a duo-core i7 macbook air laptop with hyper threading, so 4x threads, thus 4-million). The same cannot be said for the non-atomic counter.
There will be significant startup overhead to get the second thread going, so its execution will almost always begin after the first thread has finished the for loop, which by comparison will take almost no time at all. To see a race condition you will need to run a computation that takes much longer, or includes i/o or other operations that take significant time, so that the execution of the two computations actually overlap.