Whenever working on a specific problem, I may come across different solutions. I'm not sure how to choose the better of the two options. The first idea is to compute the complexity of the two solutions, but sometimes they may share the same complexity, or they may differ but the range of the input is small that the constant factor matters.
The second idea is to benchmark both solutions. However, I'm not sure how to time them using c++. I have found this question:
How to Calculate Execution Time of a Code Snippet in C++ , but I don't know how to properly deal with compiler optimizations or processor inconsistencies.
In short: is the code provided in the question above sufficient for everyday tests? is there some options that I should enable in the compiler before I run the tests? (I'm using Visual C++) How many tests should I do, and how much time difference between the two benchmarks matters?
Here is an example of a code I want to test. Which of these is faster? How can I calculate that myself?
unsigned long long fiborecursion(int rank){
if (rank == 0) return 1;
else if (rank < 0) return 0;
return fiborecursion(rank-1) + fiborecursion(rank-2);
}
double sq5 = sqrt(5);
unsigned long long fiboconstant(int rank){
return pow((1 + sq5) / 2, rank + 1) / sq5 + 0.5;
}
Using the clock from this answer
#include <iostream>
#include <chrono>
class Timer
{
public:
Timer() : beg_(clock_::now()) {}
void reset() { beg_ = clock_::now(); }
double elapsed() const {
return std::chrono::duration_cast<second_>
(clock_::now() - beg_).count(); }
private:
typedef std::chrono::high_resolution_clock clock_;
typedef std::chrono::duration<double, std::ratio<1> > second_;
std::chrono::time_point<clock_> beg_;
};
You can write a program to time both of your functions.
int main() {
const int N = 10000;
Timer tmr;
tmr.reset();
for (int i = 0; i < N; i++) {
auto value = fiborecursion(i%50);
}
double time1 = tmr.elapsed();
tmr.reset();
for (int i = 0; i < N; i++) {
auto value = fiboconstant(i%50);
}
double time2 = tmr.elapsed();
std::cout << "Recursion"
<< "\n\tTotal: " << time1
<< "\n\tAvg: " << time1 / N
<< "\n"
<< "\nConstant"
<< "\n\tTotal: " << time2
<< "\n\tAvg: " << time2 / N
<< "\n";
}
I would try compiling with no compiler optimizations (-O0) and max compiler optimizations (-O3) just to see what the differences are. It is likely that at max optimizations the compiler may eliminate the loops entirely.
Related
i'm trying to optimize my code using multithreading and is not just that the program is not the double speed as is suposed to be in this dual-core computer, it is SO MUCH SLOW. And i just wanna know if i'm doing something wrong or is pretty normal that in this case use multithreading does not help. I make this recreation of how i used the multithreading, and in my computer the parallel versions take's 4 times the time in the comparation of the normal version:
#include <iostream>
#include <random>
#include <thread>
#include <chrono>
using namespace std;
default_random_engine ran;
inline bool get(){
return ran() % 3;
}
void normal_serie(unsigned repetitions, unsigned &result){
for (unsigned i = 0; i < repetitions; ++i)
result += get();
}
unsigned parallel_series(unsigned repetitions){
const unsigned hardware_threads = std::thread::hardware_concurrency();
cout << "Threads in this computer: " << hardware_threads << endl;
const unsigned threads_number = (hardware_threads != 0) ? hardware_threads : 2;
const unsigned its_per_thread = repetitions / threads_number;
unsigned *results = new unsigned[threads_number]();
std::thread *threads = new std::thread[threads_number - 1];
for (unsigned i = 0; i < threads_number - 1; ++i)
threads[i] = std::thread(normal_serie, its_per_thread, std::ref(results[i]));
normal_serie(its_per_thread, results[threads_number - 1]);
for (unsigned i = 0; i < threads_number - 1; ++i)
threads[i].join();
auto result = std::accumulate(results, results + threads_number, 0);
delete[] results;
delete[] threads;
return result;
}
int main()
{
constexpr unsigned repetitions = 100000000;
auto to = std::chrono::high_resolution_clock::now();
cout << parallel_series(repetitions) << endl;
auto tf = std::chrono::high_resolution_clock::now();
auto duration = std::chrono::duration_cast<std::chrono::milliseconds>(tf - to).count();
cout << "Parallel duration: " << duration << "ms" << endl;
to = std::chrono::high_resolution_clock::now();
unsigned r = 0;
normal_serie(repetitions, r);
cout << r << endl;
tf = std::chrono::high_resolution_clock::now();
duration = std::chrono::duration_cast<std::chrono::milliseconds>(tf - to).count();
cout << "Normal duration: " << duration << "ms" << endl;
return 0;
}
Things that i already know, but i didn't to make this code shorter:
I should set a max_iterations_per_thread because you don't wanna make 10 iterations per thread, but in this case we are doing one billion iterations so that is not gonna happend.
The number of iterations must be divisible by the number or threads, otherwise the code will not do an effective work.
This is the output that i get in my computer:
Threads in this computer: 2
66665160
Parallel duration: 4545ms
66664432
Normal duration: 1019ms
(Solved partially doing this changes: )
inline bool get(default_random_engine &ran){
return ran() % 3;
}
void normal_serie(unsigned repetitions, unsigned &result){
default_random_engine eng;
unsigned saver_result = 0;
for (unsigned i = 0; i < repetitions; ++i)
saver_result += get(eng);
result += saver_result;
}
All your threads are tripping over each other fighting for access to ran which can only perform one operation at a time because it only has one state and each operation advances its state. There is no point in running operations in parallel if the vast majority of each operation involves a choke point that cannot support any concurrency.
All elements of results are likely to share a cache line, which means there is lots of inter-core communication going on.
Try modifying normal_serie to accumulate into a local variable and only write it to results in the end.
I'm trying to make a simple benchmarking algorithm, to compare different operations. Before I moved on to the actual functions i wanted to check a trivial case with a well-documented outcome : multiplication vs. division.
Division should lose by a fair margin from the literature i have read. When I compiled and ran the algorithm the times were just about 0. I added an accumulator that is printed to make sure the operations are actually carried out and tried again. Then i changed the loop, the numbers, shuffled and more. All in order to prevent any and all things that could cause "divide" to do anything but floating point division. To no avail. The times are still basically equal.
At this point I don't see where it could weasel its way out of the floating point divide and I give up. It wins. But I am really curious why the times are so close, what caveats/bugs i missed, and how to fix them.
(I know filling the vector with random data and then shuffling is redundant but I wanted to make sure the data was accessed and not just initialized before the loop.)
("String compares are evil", i am aware. If it is the cause of the equal times, i will gladly join the witch hunt. If not, please don't mention it.)
compile:
g++ -std=c++14 main.cc
tests:
./a.out multiply
2.42202e+09
1000000
t1 = 1.52422e+09 t2 = 1.52422e+09
difference = 0.218529
Average length of function : 2.18529e-07 seconds
./a.out divide
2.56147e+06
1000000
t1 = 1.52422e+09 t2 = 1.52422e+09
difference = 0.242061
Average length of function : 2.42061e-07 seconds
the code :
#include <iostream>
#include <string>
#include <vector>
#include <algorithm>
#include <random>
#include <sys/time.h>
#include <sys/resource.h>
double get_time()
{
struct timeval t;
struct timezone tzp;
gettimeofday(&t, &tzp);
return t.tv_sec + t.tv_usec*1e-6;
}
double multiply(double lhs, double rhs){
return lhs * rhs;
}
double divide(double lhs, double rhs){
return lhs / rhs;
}
int main(int argc, char *argv[]){
if (argc == 1)
return 0;
double grounder = 0; //prevent optimizations
std::default_random_engine generator;
std::uniform_real_distribution<double> distribution(1.0, 100.0);
size_t loop1 = argc > 2 ? std::stoi (argv[2]) : 1000;
size_t loop2 = argc > 3 ? std::stoi (argv[3]) : 1000;
std::vector<size_t>vecL1(loop1);
std::generate(vecL1.begin(), vecL1.end(), [generator, distribution] () mutable { return distribution(generator); });
std::vector<size_t>vecL2(loop2);
std::generate(vecL2.begin(), vecL2.end(), [generator, distribution] () mutable { return distribution(generator); });
double (*fp)(double, double);
std::string function(argv[1]);
if (function == "multiply")
fp = (*multiply);
if (function == "divide")
fp = (*divide);
std::random_shuffle(vecL1.begin(), vecL1.end());
std::random_shuffle(vecL2.begin(), vecL2.end());
double t1 = get_time();
for (auto outer = vecL1.begin(); outer != vecL1.end(); outer++)
for (auto inner = vecL2.begin(); inner != vecL2.end(); inner++)
grounder += (*fp)(*inner, *outer);
double t2 = get_time();
std::cout << grounder << '\n';
std::cout << (loop1 * loop2) << '\n';
std::cout << "t1 = " << t1 << "\tt2 = " << t2
<< "\ndifference = " << (t2 - t1) << '\n';
std::cout << "Average length of function : " << (t2 - t1) * 1/(loop1 * loop2) << " seconds \n";
return 0;
}
You aren't just measuring the speed of multiplication/divide. If you put your code into https://godbolt.org/ you can see the assembly generated.
You are measuring the speed of calling a function and then doing multiply/divide inside the function. The time taken for the single multiply/divide instruction is tiny compared to the cost of the function calls so gets lost in the noise. If you move your loop to inside your function you'll probably see more of a difference. Note that with the loop inside your function your compiler may decide to vectorise your code which will still show whether there is a difference between multiply and divide but it wont be measuring the difference for the single mul/div instruction.
I have run into a problem where i am trying to optimize my query which is created to calculate Nmin values for the increasing values of N and error approximation.
I am not from programming background and have just started to take it up.
This is the calculation which is inefficient as it calculates Nmin even after finding Nmin.
Now to reduce the time i did below changes reduce function call with no improvement:
#include<iostream>
#include<cmath>
#include<time.h>
#include<iomanip>
using namespace std;
double f(int);
int main(void)
{
double err;
double pi = 4.0*atan(1.0);
cout<<fixed<<setprecision(7);
clock_t start = clock();
for (int n=1;;n++)
{
if((f(n)-pi)>= 1e-6)
{
cout<<"n_min is "<< n <<"\t"<<f(n)-pi<<endl;
}
else
{
break;
}
}
clock_t stop = clock();
//double elapsed = (double)(stop - start) * 1000.0 / CLOCKS_PER_SEC; //this one in ms
cout << "time: " << (stop-start)/double(CLOCKS_PER_SEC)*1000 << endl; //this one in s
return 0;
}
double f(int n)
{
double sum=0;
for (int i=1;i<=n;i++)
{
sum += 1/(1+pow((i-0.5)/n,2));
}
return (4.0/n)*sum;
}
Is there any way to reduce the time and make the second query efficient?
Any help would be greatly appreciated.
I do not see any immediate way of optimizing the algorithm itself. You could however reduce the time significantly by not writing to the standard output for every iteration. Also, do not calculate f(n) more than once per iteration.
for (int n=1;;n++)
{
double val = f(n);
double diff = val-pi;
if(diff < 1e-6)
{
cout<<"n_min is "<< n <<"\t"<<diff<<endl;
break;
}
}
Note however that this will yield a higher n_min (increased by 1 compared to the result of your version) since we changed the condition to diff < 1e-6.
According to my professor loops are faster and more deficient than using recursion yet I came up with this c++ code that calculates the Fibonacci series using both recursion and loops and the results prove that they are very similar. So I maxed the possible input to see if there was a difference in performance and for some reason recursion clocked in better than using a loop. Anyone know why? Thanks in advanced.
Here's the code:
#include "stdafx.h"
#include "iostream"
#include <time.h>
using namespace std;
double F[200000000];
//double F[5];
/*int Fib(int num)
{
if (num == 0)
{
return 0;
}
if (num == 1)
{
return 1;
}
return Fib(num - 1) + Fib(num - 2);
}*/
double FiboNR(int n) // array of size n
{
for (int i = 2; i <= n; i++)
{
F[i] = F[i - 1] + F[i - 2];
}
return (F[n]);
}
double FibMod(int i,int n) // array of size n
{
if (i==n)
{
return F[i];
}
F[i] = F[i - 1] + F[i - 2];
return (F[n]);
}
int _tmain(int argc, _TCHAR* argv[])
{
/*cout << "----------------Recursion--------------"<<endl;
for (int i = 0; i < 36; i=i+5)
{
clock_t tStart = clock();
cout << Fib(i);
printf("Time taken: %.2fs\n", (double)(clock() - tStart) / CLOCKS_PER_SEC);
cout << " : Fib(" << i << ")" << endl;
}*/
cout << "----------------Linear--------------"<<endl;
for (int i = 0; i < 200000000; i = i + 20000000)
//for (int i = 0; i < 50; i = i + 5)
{
clock_t tStart = clock();
F[0] = 0; F[1] = 1;
cout << FiboNR(i);
printf("Time taken: %.2fs\n", (double)(clock() - tStart) / CLOCKS_PER_SEC);
cout << " : Fib(" << i << ")" << endl;
}
cout << "----------------Recursion Modified--------------" << endl;
for (int i = 0; i < 200000000; i = i + 20000000)
//for (int i = 0; i < 50; i = i + 5)
{
clock_t tStart = clock();
F[0] = 0; F[1] = 1;
cout << FibMod(0,i);
printf("Time taken: %.2fs\n", (double)(clock() - tStart) / CLOCKS_PER_SEC);
cout << " : Fib(" << i << ")" << endl;
}
std::cin.ignore();
return 0;
}
You you go by the conventional programming approach loops are faster. But there is category of languages called functional programming languages which does not contain loops. I am a big fan of functional programming and I am an avid Haskell user. Haskell is a type of functional programming language. In this instead of loops you use recursions. To implement fast recursion there is something known as tail recursion. Basically to prevent having a lot of extra info to the system stack, you write the function such a way that all the computations are stored as function parameters so that nothing needs to be stored on the stack other that the function call pointer. So once the final recursive call has been called, instead of unwinding the stack the program just needs to go to the first function call stack entry. Functional programming language compilers have an inbuilt design to deal with this. Now even non functional programming languages are implementing tail recursion.
For example consider finding the recursive solution for finding the factorial of a positive number. The basic implementation in C would be
int fact(int n)
{
if(n == 1 || n== 0)
return 1
return n*fact(n-1);
}
In the above approach, each time the stack is called n is stored in the stack so that it can be multiplied with the result of fact(n-1). This basically happens during stack unwinding. Now check out the following implementation.
int fact(int n,int result)
{
if(n == 1 || n== 0)
return result
return fact(n-1,n*result);
}
In this approach we are passing the computation result in the variable result. So in the end we directly get the answer in the variable result. The only thing you have to do is that in the initial call pass a value of 1 for the result in this case. The stack can be unwound directly to its first entry. Of course I am not sure that C or C++ allows tail recursion detection, but functional programming languages do.
Your "recursion modified" version doesn't have recursion at all.
In fact, the only thing enabling a non-recursive version that fills in exactly one new entry of the array is the for-loop in your main function -- so it is actually a solution using iteration also (props to immibis and BlastFurnace for noticing that).
But your version doesn't even do that correctly. Rather since it is always called with i == 0, it illegally reads F[-1] and F[-2]. You are lucky (?)1 the program didn't crash.
The reason you are getting correct results is that the entire F array is prefilled by the correct version.
Your attempt to calculate Fib(2000....) isn't successful anyway, since you overflow a double. Did you even try running that code?
Here's a version that works correctly (to the precision of double, anyway) and doesn't use a global array (it really is iteration vs recursion and not iteration vs memoization).
#include <cstdio>
#include <ctime>
#include <utility>
double FiboIterative(int n)
{
double a = 0.0, b = 1.0;
if (n <= 0) return a;
for (int i = 2; i <= n; i++)
{
b += a;
a = b - a;
}
return b;
}
std::pair<double,double> FiboRecursive(int n)
{
if (n <= 0) return {};
if (n == 1) return {0, 1};
auto rec = FiboRecursive(n-1);
return {rec.second, rec.first + rec.second};
}
int main(void)
{
const int repetitions = 1000000;
const int n = 100;
volatile double result;
std::puts("----------------Iterative--------------");
std::clock_t tStart = std::clock();
for( int i = 0; i < repetitions; ++i )
result = FiboIterative(n);
std::printf("[%d] = %f\n", n, result);
std::printf("Time taken: %.2f us\n", (std::clock() - tStart) / 1.0 / CLOCKS_PER_SEC);
std::puts("----------------Recursive--------------");
tStart = std::clock();
for( int i = 0; i < repetitions; ++i )
result = FiboRecursive(n).second;
std::printf("[%d] = %f\n", n, result);
std::printf("Time taken: %.2f us\n", (std::clock() - tStart) / 1.0 / CLOCKS_PER_SEC);
return 0;
}
--
1Arguably anything that hides a bug is actually unlucky.
I don't think this is not a good question. But maybe the answer why is somehow interesting.
At first let me say that generally the statement is probably true. But ...
Questions about performance of c++ programs are very localized. It's never possible to give a good general answer. Every example should be profiled an analyzed separately. It involves lots of technicalities. c++ compilers are allowed to modify program practically as they wish as long as they don't produce visible side effects (whatever precisely that means). So as long as your computation gives the same result is fine. This technically allows to transform one version of your program into an equivalent even from recursive version into the loop based and vice versa. So it depends on compiler optimizations and compiler effort.
Also, to compare one version to another you would need to prove that the versions you compare are actually equivalent.
It might also happen that somehow a recursive implementation of algorithm is faster than a loop based one if it's easier to optimize for the compiler. Usually iterative versions are more complex, and generally the simpler the code is, the easier it is for the compiler to optimize because it can make assumptions about invariants, etc.
I have this function prototype code for factorial calculation by iteration
How do I include a timer to produce total time spent for looping 100 times of the function?
for (unsigned long i=number; i>=1; i--) result *=i;
My C++ knowledge is barely basic, so not sure if "loop" is correctly mentioned here.
However, I was hinted to use .
Pls advice
thank you
Here's a proper C++11 version of some timing logic:
using namespace std;
using namespace chrono;
auto start_time = system_clock::now();
// your loop goes here:
for (unsigned long i=number; i>=1; i--) result *=i;
auto end_time = system_clock::now();
auto durationInMicroSeconds = duration_cast<microseconds>(end_time - start_time);
cout << "Looping " << number << " times took " << durationInMicroSeconds << "microseconds" << endl;
Just for sport, here's a simple RAII-based variation:
class Timer {
public:
explicit Timer(const string& name)
: name_(name)
, start_time_(system_clock::now()) {
}
~Timer() {
auto end_time = system_clock::now();
auto durationInMicroSeconds = duration_cast<microseconds>(end_time - start_time);
cout << "Timer: " << name << " took " << durationInMicroSeconds << "microseconds" << endl;
}
private:
string name_;
system_clock::time_point start_time_;
};
Sure, it's a bit more code, but once you have that, you can reuse it fairly efficiently:
{
Timer timer("loops");
// your loop goes here:
for (unsigned long i=number; i>=1; i--) result *=i;
}
If you looking for time spent in executing number of looping statements in the program code try making use of gettimeofday() as below,
#include <sys/time.h>
struct timeval tv1, tv2;
gettimeofday(&tv1, NULL);
/* Your loop code to execute here */
gettimeofday(&tv2, NULL);
printf("Time taken in execution = %f seconds\n",
(double) (tv2.tv_usec - tv1.tv_usec) / 1000000 +
(double) (tv2.tv_sec - tv1.tv_sec));
This solution is more towards C which can be employed in your case to calculate time spent.
This is a perfect situation for a lambda. Honestly I don't know the syntax in C++ but it should be something like this:
duration timer(function f) {
auto start = system_clock::now();
f();
return system_clock::now() - start;
}
To use it, you wrap your code in a lambda and pass it to the timer. The effect is very similar to #Martin J.'s code.
duration code_time = timer([] () {
// put any code that you want to time here
}
duration loop_time = timer([] () {
for (unsigned long i=number; i>=1; i--) {
result *=i;
}
}