Thread not improving the code performance - c++

I am trying to convert a basic long loop into thread to improve the loop performance.
Here is the threaded version:
#include <iostream>
#include <thread>
#include <chrono>
using namespace std;
using namespace std::chrono;
void funcSum(long long int start, long long int end, long long int *sum)
{
for(auto i = start; i <= end; ++i)
{
*sum += i;
}
}
int main()
{
long long int start = 10, end = 1900000000;
long long int sum = 0;
auto startTime = high_resolution_clock::now();
thread t1(funcSum, start, end / 2, &sum);
thread t2(funcSum, end / 2 + 1 , end, &sum);
t1.join();
t2.join();
auto stopTime = high_resolution_clock::now();
auto duration = duration_cast<seconds>(stopTime - startTime);
cout << "Sum: " << sum << endl;
cout << duration.count() << " Seconds";
return 0;
}
And here is the normal code (Without threads):
#include <iostream>
#include <thread>
#include <chrono>
using namespace std;
using namespace std::chrono;
void funcSum(long long int start, long long int end, long long int *sum)
{
for(auto i = start; i <= end; ++i)
{
*sum += i;
}
}
int main()
{
long long int start = 10, end = 1900000000;
long long int sum = 0;
auto startTime = high_resolution_clock::now();
funcSum(start, end, &sum);
auto stopTime = high_resolution_clock::now();
auto duration = duration_cast<seconds>(stopTime - startTime);
cout << "Sum: " << sum << endl;
cout << duration.count() << " Seconds";
return 0;
}
Sum: 1805000000949999955
5 Seconds
Process finished with exit code 0
In both the cases, time spent is 5 seconds.
Why the first threaded version does not improve the performance? How do I decrease the time using threads for this sum of range?

Fixed version of threaded code:
// Compute the sum of start ... end
class Summer {
public:
long long int start;
long long int end;
long long int sum = 0;
Summer(long long int aStart, long long int aEnd)
: start(aStart),
end(aEnd)
{
}
void funcSum()
{
sum = 0;
for (auto i = start; i <= end; ++i)
{
sum += i;
}
}
};
class SummerFunctor {
Summer& mSummer;
public:
SummerFunctor(Summer& aSummer)
: mSummer(aSummer)
{
}
void operator()()
{
mSummer.funcSum();
}
};
// Version with n thread objects reports
// 1 threads, sum = 1805000000949999955, 1587 ms
// 2 threads, sum = 1805000000949999955, 2547 ms
// 4 threads, sum = 1805000000949999955, 1251 ms
// 6 threads, sum = 1805000000949999955, 916 ms
int main()
{
long long int start = 10, end = 1900000000;
long long int sum = 0;
auto startTime = high_resolution_clock::now();
const size_t threadCount = 6;
if (threadCount < 2) {
funcSum(start, end, &sum);
} else {
Summer* summers[threadCount];
std::thread* threads[threadCount];
// Start threads
auto val = start;
auto partitionSize = (end-start) / threadCount;
for (size_t i = 0; i < threadCount; ++i) {
auto partitionEnd = std::min(start + partitionSize, end);
summers[i] = new Summer(start, partitionEnd);
start = partitionEnd + 1;
SummerFunctor functor (*summers[i]);
threads[i] = new std::thread(functor);
}
// Join threads
for (size_t i = 0; i < threadCount; ++i) {
threads[i]->join();
sum += summers[i]->sum;
delete threads[i];
delete summers[i];
}
}
auto stopTime = high_resolution_clock::now();
auto duration = duration_cast<milliseconds>(stopTime - startTime);
cout << threadCount << " threads, sum = " << sum << ", " << duration.count() << " ms" << std::endl;
return 0;
}
I had to wrap the Summer object with a functor because std::thread insists on making a copy of a functor handed to it, that we can't access later. The execution gets better when more threads are used (running times see comments). Possible reasons for this:
The CPU has to synchronize access to the memory pages even though the threads use separate variables here because the variables likely lie in the same page
If there is only one thread running on a CPU, that thread may run at higher CPU frequency, but several threads may run only at normal CPU frequency
CPU cores often share arithmetic units
Without threads, the compiler can make optimizations that are not possible with threads. In theory, the compiler could unroll the loop and directly print the result.

Related

How to improved performance of code when dealing with large input in C++?

How would it be possible to make this code run faster in C++. The code takes a lot of time to run. The purpose is to determine how many gates are required to handle a prescribed
arrivals-and-departures schedule.
#include <vector>
struct Airplane {
int arrival_time_seconds;
int departure_time_seconds;
};
class Schedule {
private:
const std::vector<Airplane> airplanes_;
public:
Schedule(const std::vector<Airplane>& airplanes) :
airplanes_(airplanes) {}
int MaximumNumberOfPlanes() const {
int rv = 0;
for (const Airplane& airplane : airplanes_) {
int num_planes = NumberOfPlanes(airplane.arrival_time_seconds);
if (num_planes > rv) {
rv = num_planes;
}
}
return rv;
}
private:
int NumberOfPlanes(int time_seconds) const {
int rv = 0;
for (const Airplane& airplane : airplanes_) {
if (airplane.arrival_time_seconds < time_seconds &&
time_seconds <= airplane.departure_time_seconds) {
rv++;
}
}
return rv;
}
};
A lot of people stated that this can be made O(N), and it is possible to some extent. At least I was able to make it O(max(N,86400)) which is better than your version for N>294 and better than a O(NlogN) for N>6788.
I assume that if a plane departs the next day it has a departure_time_seconds = 86400 (the number of seconds in a day), while all arrival_time_seconds are lower than 86400.
You can compile a vector of the change in number of planes in O(N) and than use it to compute the current number of planes in the airport at every second in O(86400):
int MaximumNumberOfPlanes2() const {
int delta[24 * 60 * 60 + 1] = { 0 };
for (const Airplane& x : airplanes_) {
delta[x.arrival_time_seconds]++;
delta[x.departure_time_seconds]--;
}
int rv = 0;
int np = 0;
for (int i = 0; i < 24 * 60 * 60; ++i) {
np += delta[i];
rv = std::max(rv, np);
}
return rv;
}
A test program with some timing:
#include <vector>
#include <iostream>
#include <fstream>
#include <random>
#include <chrono>
#include <queue>
int main()
{
using namespace std;
using namespace std::chrono;
default_random_engine eng;
uniform_int_distribution<int> arr_dist(0, 24*60*60);
gamma_distribution<double> dep_dist(5, 3);
std::vector<Airplane> a;
for (int i = 0; i < 100000; ++i) {
int arrival = arr_dist(eng);
int departure = arrival + (20 + lround(dep_dist(eng))) * 60;
departure = min(departure, 24*60*60);
a.push_back({ arrival, departure });
}
Schedule s(a);
{
const auto& start = steady_clock::now();
int mnp = s.MaximumNumberOfPlanes();
const auto& stop = steady_clock::now();
duration<double> elapsed = stop - start;
std::cout << "MaximumNumberOfPlanes : " << mnp << " - Elapsed: " << elapsed.count() << " s\n";
}
{
const auto& start = steady_clock::now();
int mnp = s.MaximumNumberOfPlanes2();
const auto& stop = steady_clock::now();
duration<double> elapsed = stop - start;
std::cout << "MaximumNumberOfPlanes2: " << mnp << " - Elapsed: " << elapsed.count() << " s\n";
}
return 0;
}
This gives (on my laptop):
MaximumNumberOfPlanes : 2572 - Elapsed: 48.8979 s
MaximumNumberOfPlanes2: 2572 - Elapsed: 0.0010778 s

How to dynamically allocate work to threads

I am trying to write code for finding if pairwise sums are even or not(among all possible pairs from 0 to 100000). I have written code using pthreads where the work allocation is done statically. Here is the code
#include<iostream>
#include<chrono>
#include<iomanip>
#include<pthread.h>
using namespace std;
#define MAX_THREAD 4
vector<long long> cnt(MAX_THREAD,0);
long long n = 100000;
int work_per_thread;
void *count_array(void* arg)
{
int t = *((int*)arg);
long long sum = 0;
int counter = 0;
for(int i = t*work_per_thread + 1; i <= (t+1)*work_per_thread; i++)
for(int j = i-1; j >= 0; j--)
{
sum = i + j;
if(sum%2 == 0)
counter++;
}
cnt[t] = counter;
cout<<"thread"<<t<<" finished work"<<endl;
return NULL;
}
int main()
{
pthread_t threads[MAX_THREAD];
vector<int> arr;
for(int i = 0; i < MAX_THREAD; i++)
arr.push_back(i);
long long total_count = 0;
work_per_thread = n/MAX_THREAD;
auto start = chrono::high_resolution_clock::now();
for (int i = 0; i < MAX_THREAD; i++)
pthread_create(&threads[i], NULL, count_array, &arr[i]);
for (int i = 0; i < MAX_THREAD; i++)
pthread_join(threads[i], NULL);
for (int i = 0; i < MAX_THREAD; i++)
total_count += cnt[i];
cout << "count is " << total_count << endl;
auto end = chrono::high_resolution_clock::now();
double time_taken = chrono::duration_cast<chrono::nanoseconds>(end - start).count();
time_taken *= 1e-9;
cout << "Time taken by program is : " << fixed << time_taken << setprecision(9)<<" secs"<<endl;
return 0;
}
Now I want to do the work allocation part dynamically. To be specific, let's say I have 5 threads. Initially I give the threads a certain range to work with, let's say thread1 works on all pairs from 0-1249, thread2 from 1250-2549 and so on. Now as soon as a thread completes its work I want to give it a new range to work on. This way no threads will be idle for most of the time, like was in the case of static allocation.
This is the classic usage of a thread pool. Typically you set up a synchronized queue that can be pushed and pulled by any number of threads. Then you start N threads, the "thread pool". These threads wait on a condition variable that locks a mutex. When you have work to do from the main thread, it pushes work into the queue (it can be as simple as a struct with a range) and then signals the condition variable, which will release one thread.
See this answer: https://codereview.stackexchange.com/questions/221617/thread-pool-c-implementation

Most insanely efficient way to find index of the minimum of four numbers

#include <iostream>
#include <chrono>
#include <random>
using namespace std;
class MyTimer
{
private:
std::chrono::time_point<std::chrono::steady_clock> starter;
std::chrono::time_point<std::chrono::steady_clock> ender;
public:
void startCounter() {
starter = std::chrono::steady_clock::now();
}
long long getCounter() {
ender = std::chrono::steady_clock::now();
return std::chrono::duration_cast<std::chrono::microseconds>(ender - starter).count();
}
};
int findBestKey(int keys[4], int values[4])
{
int index = 0;
for (int i = 1; i <= 3; i++)
if (keys[index] > keys[i])
index = i;
return values[index];
}
int findBestKeyPro(int keys[4], int values[4])
{
int index = keys[0] > keys[1];
if (keys[index] > keys[2]) index = 2;
if (keys[index] > keys[3]) return values[3];
else return values[index];
}
int findBestKeyProMax(int keys[4], int values[4])
{
// fill your implementation here. Not necessary to read the parts below
return 0;
}
void benchMethod(int (*findBestKeyFunc)(int keys[4], int values[4]), int n, int* keys, int* values, int& res, double& totalTime)
{
MyTimer timer;
timer.startCounter();
// In my actual problems, values of arrays "keys" are completely unrelated. They are not the same continuous values in memory. The line below is just an example for benchmark purposes
for (int i = 0; i < n - 4; i+=4)
res += findBestKeyFunc(&keys[i], &values[i]);
totalTime += timer.getCounter();
/*
it is possible to calculate 4 arrays "keys","values", then process them all at once.
for (int i=0; i<n-4; i+=16)
{
keys[4][4] = ...; values[4][4] = ...;
res += find4BestKeyAtOnce(&keys, &values);
}
*/
}
double totalTimeNormal = 0, totalTimePro = 0, totalTimeProMax = 0;
void benching(int& res1, int& res2, int& res3)
{
const int n = 10000000;
int* keys1 = new int[n], * values1 = new int[n];
int* keys2 = new int[n], * values2 = new int[n];
MyTimer timer;
double tmp;
for (int i = 0; i < n; i++) {
keys1[i] = rand() % 100; // need 2 arrays to prevent caching
keys2[i] = rand() % 100; // this should be % (256*256)
values1[i] = rand() % 100; // and % 256
values2[i] = rand() % 100; // but I use % 100 so that in this example it doesn't overflow int32
}
// the size of keys2/values2 is big enough to flush out keys1/values1 from cache completely.
// so order of execution doesn't affect performance here
benchMethod(&findBestKey, n, keys1, values1, res1, totalTimeNormal);
benchMethod(&findBestKey, n, keys2, values2, res1, totalTimeNormal);
benchMethod(&findBestKeyPro, n, keys1, values1, res2, totalTimePro);
benchMethod(&findBestKeyPro, n, keys2, values2, res2, totalTimePro);
benchMethod(&findBestKeyProMax, n, keys1, values1, res2, totalTimeProMax);
benchMethod(&findBestKeyProMax, n, keys2, values2, res2, totalTimeProMax);
delete[] keys1;
delete[] keys2;
delete[] values1;
delete[] values2;
}
void testIf()
{
int res1 = 0, res2 = 0, res3 = 0;
for (int t = 1; t <= 100; t++) {
benching(res1, res2, res3);
res1 %= 100;
res2 %= 100;
res3 %= 100;
cout << "Lap " << t << "\n";
cout << "time normal = " << totalTimeNormal/1000 << " ms\n";
cout << "time pro = " << totalTimePro/1000 << " ms\n";
cout << "time pro max = " << totalTimeProMax/1000 << " ms\n";
cout << "\n";
}
cout << "**********************\n" << res1 << " " << res2 << "\n";
}
int main()
{
testIf();
return 0;
}
There are two arrays, keys and values, both completely random. This function returns the value that has the minimum key. So: index = indexOfMin(keys); return values[index]; See function findBestKey. I need to fill in findBestKeyProMax
findBestKeyPro is around 30-35% faster than findBestKey, on my computer and on here: https://www.onlinegdb.com/online_c++_compiler . Compiler option is -std=c++14 -O2 Update: I get ~~5-10% more performance just by changing to -O3
Is there anyway I can make this faster? Every nanosecond matters, since this function is called ~~10^6-10^7 times (once for each pixel); saving 1 ns per call would translate to 1ms less, which is the difference between 200fps and 250fps.
Edit: no multi-threading or GPU. It's already done (each thread performs findBestKey on distinct keys/values arrays), so I want to improve this function directly. Maybe something like SIMD for CPU? Or branchless function.
Also the functions findBest... are what matters, function benchMethod() is just for benchmarking.
Edit 2: target architecture is CPUs with AVX256 capability, mainly Intel Skylake or AMD Zen 2.

Get average of time spent using std::chrono

I have a function running more than a million times. I want to print out the duration of how long the function takes to run by printing the sum of durations of 10,000 calls to the function.
At the start of each function I have something like this:
int counter = 0;
auto duration_total = 0; //not sure about the type
std::chrono::high_resolution_clock::time_point t1, t2, duration;
t1 = std::chrono::high_resolution_clock::now();
Function f(){
counter++;
}
t2 = std::chrono::high_resolution_clock::now();
duration= std::chrono::duration_cast<std::chrono::nanoseconds>( t2 - t1 ).count();
duration_total += duration;
if(counter %10000 == 0){
long int average_duration = duration_total/10000;
duration_total = 0;
cout << average_duration << "\n";
}
I can't find a way to add durations and then get their average.
If you look at std::chrono::duration<Rep,Period>::count, you can see that you can use
int duration = std::chrono::duration_cast<std::chrono::nanoseconds>( t2 - t1 ).count();
(or something else, e.g., unsigned long), as the return value is
The number of ticks for this duration.
in full:
#include <iostream>
#include <chrono>
int main()
{
int counter = 0;
auto duration_total = 0; //not sure about the type
std::chrono::high_resolution_clock::time_point t1, t2;
t1 = std::chrono::high_resolution_clock::now();
t2 = std::chrono::high_resolution_clock::now();
int duration = std::chrono::duration_cast<std::chrono::nanoseconds>( t2 - t1 ).count();
duration_total += duration;
if(counter %10000 == 0){
long int average_duration = duration_total/10000;
duration_total = 0;
std::cout << average_duration << "\n";
}
}
See it in Coliru.
You create a clock when you start and one when you stop.
When subtracting one clock from another, you get a duration. Divide the duration with the number of iterations.
Example:
#include <chrono>
#include <functional>
#include <iostream>
template<typename T>
auto timeit(size_t iterations, std::function<void()> func_to_test) {
auto start = std::chrono::high_resolution_clock::now();
for(size_t i = 0; i < iterations; ++i)
func_to_test();
auto end = std::chrono::high_resolution_clock::now();
return std::chrono::duration_cast<T>(end - start) / iterations;
}
int main() {
auto dur =
timeit<std::chrono::microseconds>(10000, [] { system("echo Hello World"); });
std::cout << dur.count() << " µs\n";
}
If you need to sum up individual runs, keep a duration variable that you add to. I'm reusing the same timeit function, but you can remove the iteration stuff in it if you only want to run it once.
int main() {
std::chrono::microseconds tot{0};
size_t iterations = 0;
for(size_t i = 0; i < 10; ++i) {
// sum up the total time spent
tot += timeit<decltype(tot)>(1, [] { system("echo Hello World"); });
++iterations;
}
// divide with the number of iterations
std::cout << (tot / iterations).count() << " µs\n";
}
First, the type here is int:
auto duration_total = 0;
You should do something similar to it:
auto t1 = std::chrono::steady_clock::now();
//do some work
auto t2 = std::chrono::steady_clock::now();
double duration_in_seconds = std::chrono::duration_cast<std::chrono::duration<double>>(t2 - t1).count();
Note that I'm casting the duration to double. Then you can use the duration value more freely.
If you prefer nanoseconds:
double duration_in_nanoseconds = std::chrono::duration_cast<std::chrono::nanoseconds>(t2 - t1).count();

why in reverse sorting is faster, than re-sorting

I wrote such a test -
std::vector<int> test_vector;
for (int i = 0; i < 100000000; ++i) {
test_vector.push_back(i);
}
QElapsedTimer timer;
timer.start();
std::sort(test_vector.begin(),test_vector.end(), [](int a, int b) { return a < b; });
qDebug() << "The slow operation took" << timer.elapsed() << "milliseconds";
qDebug() << "The slow operation took" << timer.nsecsElapsed() << "nanoseconds";
here i start re-sorting and the result
The slow operation took 4091 milliseconds
The slow operation took 4091842000 nanoseconds
but when I changed
std::sort(test_vector.begin(),test_vector.end(), [](int a, int b) { return a > b; });
result
The slow operation took 2867 milliseconds
The slow operation took 2867591800 nanoseconds
i tested on Qt_5_12_3_MinGW_64_bit-Release , and can't understand why in reverse sorting is faster, than re-sorting?
Resolved!
I tested the same example on Qt_5_12_3_MSVC2017_64bit and the issue is resolved, the problem was in MinGW_64
However, I still have a question, why if I sort the vector into a feed all the elements of 10,
#include <chrono>
#include<iostream>
#include <vector>
#include <algorithm>
int main() {
std::vector<int> test_vector;
for (int i = 0; i < 100000000; ++i) {
test_vector.push_back(10);
}
auto begin = chrono::high_resolution_clock::now();
std::sort(test_vector.begin(), test_vector.end(), [](int a, int b) { return a < b; });
auto end = std::chrono::high_resolution_clock::now();
auto dur = end - begin;
auto ms = std::chrono::duration_cast<std::chrono::milliseconds>(dur).count();
std::cout << ms << endl;
return 0;
}
result 167 milliseconds,
and re-sorting 2553 milliseconds
for (int i = 0; i < 100000000; ++i) {
test_vector.push_back(i);
}