I'm trying to make a simple benchmarking algorithm, to compare different operations. Before I moved on to the actual functions i wanted to check a trivial case with a well-documented outcome : multiplication vs. division.
Division should lose by a fair margin from the literature i have read. When I compiled and ran the algorithm the times were just about 0. I added an accumulator that is printed to make sure the operations are actually carried out and tried again. Then i changed the loop, the numbers, shuffled and more. All in order to prevent any and all things that could cause "divide" to do anything but floating point division. To no avail. The times are still basically equal.
At this point I don't see where it could weasel its way out of the floating point divide and I give up. It wins. But I am really curious why the times are so close, what caveats/bugs i missed, and how to fix them.
(I know filling the vector with random data and then shuffling is redundant but I wanted to make sure the data was accessed and not just initialized before the loop.)
("String compares are evil", i am aware. If it is the cause of the equal times, i will gladly join the witch hunt. If not, please don't mention it.)
compile:
g++ -std=c++14 main.cc
tests:
./a.out multiply
2.42202e+09
1000000
t1 = 1.52422e+09 t2 = 1.52422e+09
difference = 0.218529
Average length of function : 2.18529e-07 seconds
./a.out divide
2.56147e+06
1000000
t1 = 1.52422e+09 t2 = 1.52422e+09
difference = 0.242061
Average length of function : 2.42061e-07 seconds
the code :
#include <iostream>
#include <string>
#include <vector>
#include <algorithm>
#include <random>
#include <sys/time.h>
#include <sys/resource.h>
double get_time()
{
struct timeval t;
struct timezone tzp;
gettimeofday(&t, &tzp);
return t.tv_sec + t.tv_usec*1e-6;
}
double multiply(double lhs, double rhs){
return lhs * rhs;
}
double divide(double lhs, double rhs){
return lhs / rhs;
}
int main(int argc, char *argv[]){
if (argc == 1)
return 0;
double grounder = 0; //prevent optimizations
std::default_random_engine generator;
std::uniform_real_distribution<double> distribution(1.0, 100.0);
size_t loop1 = argc > 2 ? std::stoi (argv[2]) : 1000;
size_t loop2 = argc > 3 ? std::stoi (argv[3]) : 1000;
std::vector<size_t>vecL1(loop1);
std::generate(vecL1.begin(), vecL1.end(), [generator, distribution] () mutable { return distribution(generator); });
std::vector<size_t>vecL2(loop2);
std::generate(vecL2.begin(), vecL2.end(), [generator, distribution] () mutable { return distribution(generator); });
double (*fp)(double, double);
std::string function(argv[1]);
if (function == "multiply")
fp = (*multiply);
if (function == "divide")
fp = (*divide);
std::random_shuffle(vecL1.begin(), vecL1.end());
std::random_shuffle(vecL2.begin(), vecL2.end());
double t1 = get_time();
for (auto outer = vecL1.begin(); outer != vecL1.end(); outer++)
for (auto inner = vecL2.begin(); inner != vecL2.end(); inner++)
grounder += (*fp)(*inner, *outer);
double t2 = get_time();
std::cout << grounder << '\n';
std::cout << (loop1 * loop2) << '\n';
std::cout << "t1 = " << t1 << "\tt2 = " << t2
<< "\ndifference = " << (t2 - t1) << '\n';
std::cout << "Average length of function : " << (t2 - t1) * 1/(loop1 * loop2) << " seconds \n";
return 0;
}
You aren't just measuring the speed of multiplication/divide. If you put your code into https://godbolt.org/ you can see the assembly generated.
You are measuring the speed of calling a function and then doing multiply/divide inside the function. The time taken for the single multiply/divide instruction is tiny compared to the cost of the function calls so gets lost in the noise. If you move your loop to inside your function you'll probably see more of a difference. Note that with the loop inside your function your compiler may decide to vectorise your code which will still show whether there is a difference between multiply and divide but it wont be measuring the difference for the single mul/div instruction.
Related
This assignment is about a ship race on a lake.
I have an N array, where I input wind speed.
I have to give a K number, which determines how many consecutive days have the speed of wind between 10 and 100.
If I find the K amount of consecutive elements, I have to console out the first element's index of this sequence.
The goal is to find which day the "race" can be started.
For example:
S[10] = {50,40,0,5,0,80,70,90,100,120}
K=3
The output has to be 6, because it is the 6th element of the array, where this sequence started.
I don't know any method how can I implement this examination.
I tried this:
for (int i=0; i<N-2; i++){
if (((10<=S[i]) && (S[i]<=100)) && ((10<=S[i+1]) && (S[i+1]<=100)) && ((10<=S[i+2]) && (S[i+2]<=100))){
canBeStarted = true;
whichDayItCanBeStarted = i;
}
}
cout << whichDayItCanBeStarted << endl;
But I realised that K can be any number, so I have to examine K elements at once.
Making use of the algorithms standard library
(Restriction: the following answer provides an approach valid for C++17 and beyond)
For a problem such as this one, rather than re-inventing the wheel, you might want to consider turning to the algorithms library in the standard library, making use of std::transform and std::search_n to
produce an integer -> bool transform over wind speeds to validity of said wind speeds, followed by
searching over the result of the transform for a number of (K) sub-sequent true (valid wind speed) elements,
respectively.
E.g.:
#include <algorithm> // std::search_n, std::transform
#include <cstdint> // uint8_t (for wind speeds)
#include <iostream> // std::cout
#include <iterator> // std::back_inserter, std::distance
#include <vector> // std::vector
int main() {
// Wind data and wind restrictions.
const std::vector<uint8_t> wind_speed{50U, 40U, 0U, 5U, 0U,
80U, 70U, 90U, 100U, 120U};
const uint8_t minimum_wind_speed = 10U;
const uint8_t maximum_wind_speed = 100U;
const std::size_t minimum_consecutive_days = 3;
// Map wind speeds -> wind speed within limits.
std::vector<bool> wind_within_limits;
std::transform(wind_speed.begin(), wind_speed.end(),
std::back_inserter(wind_within_limits),
[](uint8_t wind_speed) -> bool {
return (wind_speed >= minimum_wind_speed) &&
(wind_speed <= maximum_wind_speed);
});
// Find the first K (minimum_consecutive_days) consecutive days with
// wind speed within limits.
const auto starting_day =
std::search_n(wind_within_limits.begin(), wind_within_limits.end(),
minimum_consecutive_days, true);
if (starting_day != wind_within_limits.end()) {
std::cout << "Race may start at day "
<< std::distance(wind_within_limits.begin(), starting_day) + 1
<< ".";
} else {
std::cout
<< "Wind speeds during the specified days exceed race conditions.";
}
}
Alternatively, we can integrate the transform into a binary predicate in the std::search_n invocation. This yields a more compact solution, but with, imo, somewhat worse semantics and readability.
#include <algorithm> // std::search_n
#include <cstdint> // uint8_t (for wind speeds)
#include <iostream> // std::cout
#include <iterator> // std::distance
#include <vector> // std::vector
int main() {
// Wind data and wind restrictions.
const std::vector<uint8_t> wind_speed{50U, 40U, 0U, 5U, 0U,
80U, 70U, 90U, 100U, 120U};
const uint8_t minimum_wind_speed = 10U;
const uint8_t maximum_wind_speed = 100U;
const std::size_t minimum_consecutive_days = 3;
// Find any K (minimum_consecutive_days) consecutive days with wind speed
// within limits.
const auto starting_day = std::search_n(
wind_speed.begin(), wind_speed.end(), minimum_consecutive_days, true,
[](uint8_t wind_speed, bool) -> bool {
return (wind_speed >= minimum_wind_speed) &&
(wind_speed <= maximum_wind_speed);
});
if (starting_day != wind_speed.end()) {
std::cout << "Race may start at day "
<< std::distance(wind_speed.begin(), starting_day) + 1 << ".";
} else {
std::cout
<< "Wind speeds during the specified days exceed race conditions.";
}
}
Both of the programs above, given the particular (hard-coded) wind data and restrictions that you've provided, results in:
Race may start at day 6.
You'd need to have a counter variable that's initially set to 0, and another variable to store the index where the sequence begins. and iterate through the array one element at a time. If you find an element between 10 and 100, check if the counter is equal to '0'. If it is, store the index in the other variable. Increment the counter by one. If the counter is equal to K, you're done, so break from the loop. Otherwise, if the element isn't between 10 and 100, set the counter to 0.
Whenever working on a specific problem, I may come across different solutions. I'm not sure how to choose the better of the two options. The first idea is to compute the complexity of the two solutions, but sometimes they may share the same complexity, or they may differ but the range of the input is small that the constant factor matters.
The second idea is to benchmark both solutions. However, I'm not sure how to time them using c++. I have found this question:
How to Calculate Execution Time of a Code Snippet in C++ , but I don't know how to properly deal with compiler optimizations or processor inconsistencies.
In short: is the code provided in the question above sufficient for everyday tests? is there some options that I should enable in the compiler before I run the tests? (I'm using Visual C++) How many tests should I do, and how much time difference between the two benchmarks matters?
Here is an example of a code I want to test. Which of these is faster? How can I calculate that myself?
unsigned long long fiborecursion(int rank){
if (rank == 0) return 1;
else if (rank < 0) return 0;
return fiborecursion(rank-1) + fiborecursion(rank-2);
}
double sq5 = sqrt(5);
unsigned long long fiboconstant(int rank){
return pow((1 + sq5) / 2, rank + 1) / sq5 + 0.5;
}
Using the clock from this answer
#include <iostream>
#include <chrono>
class Timer
{
public:
Timer() : beg_(clock_::now()) {}
void reset() { beg_ = clock_::now(); }
double elapsed() const {
return std::chrono::duration_cast<second_>
(clock_::now() - beg_).count(); }
private:
typedef std::chrono::high_resolution_clock clock_;
typedef std::chrono::duration<double, std::ratio<1> > second_;
std::chrono::time_point<clock_> beg_;
};
You can write a program to time both of your functions.
int main() {
const int N = 10000;
Timer tmr;
tmr.reset();
for (int i = 0; i < N; i++) {
auto value = fiborecursion(i%50);
}
double time1 = tmr.elapsed();
tmr.reset();
for (int i = 0; i < N; i++) {
auto value = fiboconstant(i%50);
}
double time2 = tmr.elapsed();
std::cout << "Recursion"
<< "\n\tTotal: " << time1
<< "\n\tAvg: " << time1 / N
<< "\n"
<< "\nConstant"
<< "\n\tTotal: " << time2
<< "\n\tAvg: " << time2 / N
<< "\n";
}
I would try compiling with no compiler optimizations (-O0) and max compiler optimizations (-O3) just to see what the differences are. It is likely that at max optimizations the compiler may eliminate the loops entirely.
I am playing around with Eigen doing some calculations with matrices and logs/exp, but I found the expressions I got a bit clumsy (and also possibly slower?). Is there a better way to write calculations like this ?
MatrixXd m = MatrixXd::Random(3,3);
m = m * (m.array().log()).matrix();
That is, not having to convert to arrays, then back to a matrix ?
If you are mixing array and matrix operations you can't really avoid them, except for some functions which have a cwise function which works directly on matrices (e.g., cwiseSqrt(), cwiseAbs()).
However, neither .array() nor .matrix() will have an impact on runtime when compiled with optimization (on any reasonable compiler).
If you consider that more readable, you can work with unaryExpr().
I agree fully with chtz's answer, and reiterate that there is no runtime cost to the "casts." You can confirm using the following toy program:
#include "Eigen/Core"
#include <iostream>
#include <chrono>
using namespace Eigen;
int main()
{
typedef MatrixXd matType;
//typedef MatrixXf matType;
volatile int vN = 1024 * 4;
int N = vN;
auto startAlloc = std::chrono::system_clock::now();
matType m = matType::Random(N, N).array().abs();
matType r1 = matType::Zero(N, N);
matType r2 = matType::Zero(N, N);
auto finishAlloc = std::chrono::system_clock::now();
r1 = m * (m.array().log()).matrix();
auto finishLog = std::chrono::system_clock::now();
r2 = m * m.unaryExpr<float(*)(float)>(&std::log);
auto finishUnary = std::chrono::system_clock::now();
std::cout << (r1 - r2).array().abs().maxCoeff() << '\n';
std::cout << "Allocation\t" << std::chrono::duration<double>(finishAlloc - startAlloc).count() << '\n';
std::cout << "Log\t\t" << std::chrono::duration<double>(finishLog - finishAlloc).count() << '\n';
std::cout << "unaryExpr\t" << std::chrono::duration<double>(finishUnary - finishLog).count() << '\n';
return 0;
}
On my computer, there is a slight advantage (~4%) to the first form which probably has to do with the way that the memory is loaded (unchecked). Beyond that, the reason for "casting" the type is to remove any ambiguities. For a clear example, consider operator *. In the matrix form, it should be considered matrix multiplication, whereas in the array form, it should be coefficient wise multiplication. The ambiguity in the case of exp and log are the matrix exponential and matrix logarithm respectively. Presumably, you want the element wise exp and log and therefore the cast is necessary.
I have this very simple function that checks the value of (N^N-1)^(N-2):
int main() {
// Declare Variables
double n;
double answer;
// Function
cout << "Please enter a double number >= 3: ";
cin >> n;
answer = pow(n,(n-1)*(n-2));
cout << "n to the n-1) to the n-2 for doubles is " << answer << endl;
}
Based on this formula, it is evident it will reach to infinity, but I am curious until what number/value of n would it hit infinity? Using a loop seems extremely inefficient, but that's all I can think of. Basically, creating a loop that says let n be a number between 1 - 100, iterate until n == inf
Is there a more efficient approach to this problem?
I think you are approaching this the wrong way.
Let : F(N) be the function (N^(N-1))(N-2)
Now you actually know whats the largest number that could be stored in a double type variable
is 0x 7ff0 0000 0000 0000 Double Precision
So now you have F(N) = max_double
just solve for X now.
Does this answer your question?
Two things: the first is that (N^(N-1))^(N-2)) can be written as N^((N-1)*(N-2)). So this would remove one pow call making your code faster.
pow(n, (n-1)*(n-2));
The second is that to know what practical limits you hit, testing all N will literally take a fraction of a second, so there really is no reason to find another practical way.
You could compute it by hand knowing variable size limits and all, but testing it is definitely faster. An example for code (C++11, since I use std::isinf):
#include <iostream>
#include <cmath>
#include <iomanip>
int main() {
double N = 1.0, diff = 10.0;
const unsigned digits = 10;
unsigned counter = digits;
while ( true ) {
double X = std::pow( N, (N-1.0) * (N-2.0) );
if ( std::isinf(X) ) {
--counter;
if ( !counter ) {
std::cout << std::setprecision(digits) << N << "\n";
break;
}
N -= diff;
diff /= 10;
}
N += diff;
}
return 0;
}
This example takes less than a millisecond on my computer, and prints 17.28894235
I have written the following codes in R and C++ which perform the same algorithm:
a) To simulate the random variable X 500 times. (X has value 0.9 with prob 0.5 and 1.1 with prob 0.5)
b) Multiply these 500 simulated values together to get a value. Save that value in a container
c) Repeat 10000000 times such that the container has 10000000 values
R:
ptm <- proc.time()
steps <- 500
MCsize <- 10000000
a <- rbinom(MCsize,steps,0.5)
b <- rep(500,times=MCsize) - a
result <- rep(1.1,times=MCsize)^a*rep(0.9,times=MCsize)^b
proc.time()-ptm
C++
#include <numeric>
#include <vector>
#include <iostream>
#include <random>
#include <thread>
#include <mutex>
#include <cmath>
#include <algorithm>
#include <chrono>
const size_t MCsize = 10000000;
std::mutex mutex1;
std::mutex mutex2;
unsigned seed_;
std::vector<double> cache;
void generatereturns(size_t steps, int RUNS){
mutex2.lock();
// setting seed
try{
std::mt19937 tmpgenerator(seed_);
seed_ = tmpgenerator();
std::cout << "SEED : " << seed_ << std::endl;
}catch(int exception){
mutex2.unlock();
}
mutex2.unlock();
// Creating generator
std::binomial_distribution<int> distribution(steps,0.5);
std::mt19937 generator(seed_);
for(int i = 0; i!= RUNS; ++i){
double power;
double returns;
power = distribution(generator);
returns = pow(0.9,power) * pow(1.1,(double)steps - power);
std::lock_guard<std::mutex> guard(mutex1);
cache.push_back(returns);
}
}
int main(){
std::chrono::steady_clock::time_point start = std::chrono::steady_clock::now();
size_t steps = 500;
seed_ = 777;
unsigned concurentThreadsSupported = std::max(std::thread::hardware_concurrency(),(unsigned)1);
int remainder = MCsize % concurentThreadsSupported;
std::vector<std::thread> threads;
// starting sub-thread simulations
if(concurentThreadsSupported != 1){
for(int i = 0 ; i != concurentThreadsSupported - 1; ++i){
if(remainder != 0){
threads.push_back(std::thread(generatereturns,steps,MCsize / concurentThreadsSupported + 1));
remainder--;
}else{
threads.push_back(std::thread(generatereturns,steps,MCsize / concurentThreadsSupported));
}
}
}
//starting main thread simulation
if(remainder != 0){
generatereturns(steps, MCsize / concurentThreadsSupported + 1);
remainder--;
}else{
generatereturns(steps, MCsize / concurentThreadsSupported);
}
for (auto& th : threads) th.join();
std::chrono::steady_clock::time_point end = std::chrono::steady_clock::now() ;
typedef std::chrono::duration<int,std::milli> millisecs_t ;
millisecs_t duration( std::chrono::duration_cast<millisecs_t>(end-start) ) ;
std::cout << "Time elapsed : " << duration.count() << " milliseconds.\n" ;
return 0;
}
I can't understand why my R code is so much faster than my C++ code (3.29s vs 12s) even though I have used four threads in the C++ code? Can anyone enlighten me please? How should I improve my C++ code to make it run faster?
EDIT:
Thanks for all the advice! I reserved capacity for my vectors and reduced the amount of locking in my code. The crucial update in the generatereturns() function is :
std::vector<double> cache(MCsize);
std::vector<double>::iterator currit = cache.begin();
//.....
// Creating generator
std::binomial_distribution<int> distribution(steps,0.5);
std::mt19937 generator(seed_);
std::vector<double> tmpvec(RUNS);
for(int i = 0; i!= RUNS; ++i){
double power;
double returns;
power = distribution(generator);
returns = pow(0.9,power) * pow(1.1,(double)steps - power);
tmpvec[i] = returns;
}
std::lock_guard<std::mutex> guard(mutex1);
std::move(tmpvec.begin(),tmpvec.end(),currit);
currit += RUNS;
Instead of locking every time, I created a temporary vector and then used std::move to shift the elements in that tempvec into cache. Now the elapsed time has reduced to 1.9seconds.
First of all, are you running it in release mode?
Switching from debug to release reduced the running time from ~15s to ~4.5s on my laptop (windows 7, i5 3210M).
Also, reducing the number of threads to 2 instead of 4 in my case (I just have 2 cores but with hyperthreading) further reduced the running time to ~2.4s.
Changing the variable power to int (as jimifiki also suggested) also offered a slight boost, reducing the time to ~2.3s.
I really enjoyed your question and I tried the code at home. I tried to change the random number generator, my implementation of std::binomial_distribution requires on average about 9.6 calls of generator().
I know the question is more about comparing R with C++ performances, but since you ask "How should I improve my C++ code to make it run faster?" I insist with pow optimization. You can easily avoid one half of the call by precomputing either 0.9^steps or 1.1^steps before the for loop. This makes your code run a bit faster:
double power1 = pow(0.9,steps);
double ratio = 1.1/0.9;
for(int i = 0; i!= RUNS; ++i){
...
returns = myF1 * pow(myF2, (double)power);
Analogously you can improve the R code:
...
ratio <-1.1/0.9
pow1 = 0.9^steps
result <- rep(ratio,times=MCsize)^rep(pow1,times=MCsize)
...
Probably doesn't help you that much, but
start by using pow(double,int) when your exponent is an int.
int power;
returns = pow(0.9,power) * pow(1.1,(int)steps - power);
Can you see any improvement?