Huge differences in implementations? - c++

I am writing a few functionalities for distributions and used the normal-distribution to run the tests between my implementation and C++ Boost.
Given the Probability Density Function (pdf: http://www.mathworks.com/help/stats/normpdf.html)
Which I wrote like that:
double NormalDistribution1D::prob(double x) {
return (1 / (sigma * (std::sqrt(boost::math::constants::pi<double>()*2))))*std::exp((-1 / 2)*(((x - mu) / sigma)*((x - mu) / sigma)));
}
To compare my results the way how it is done with C++ Boost:
boost::math::normal_distribution <> d(mu, sigma);
return boost::math::pdf(d, x);
I'm not very positively surprised - my version took 44278 nano Seconds, boost
only 326.
So I played a bit around and wrote the method probboost in my NormalDistribution1D-Class and compared all three:
void MATTest::runNormalDistribution1DTest1() {
double mu = 0;
double sigma = 1;
double x = 0;
std::chrono::high_resolution_clock::time_point tn_start = std::chrono::high_resolution_clock::now();
NormalDistribution1D *n = new NormalDistribution1D(mu, sigma);
double nres = n->prob(x);
std::chrono::high_resolution_clock::time_point tn_end = std::chrono::high_resolution_clock::now();
std::chrono::high_resolution_clock::time_point tdn_start = std::chrono::high_resolution_clock::now();
NormalDistribution1D *n1 = new NormalDistribution1D(mu, sigma);
double nres1 = n1->probboost(x);
std::chrono::high_resolution_clock::time_point tdn_end = std::chrono::high_resolution_clock::now();
std::chrono::high_resolution_clock::time_point td_start = std::chrono::high_resolution_clock::now();
boost::math::normal_distribution <> d(mu, sigma);
double dres = boost::math::pdf(d, x);
std::chrono::high_resolution_clock::time_point td_end = std::chrono::high_resolution_clock::now();
std::cout << "Mu : " << mu << "; Sigma: " << sigma << "; x" << x << std::endl;
if (nres == dres) {
std::cout << "Result" << nres << std::endl;
} else {
std::cout << "\033[1;31mRes incorrect: " << nres << "; Correct: " << dres << "\033[0m" << std::endl;
}
auto duration_n = std::chrono::duration_cast<std::chrono::nanoseconds>(tn_end - tn_start).count();
auto duration_d = std::chrono::duration_cast<std::chrono::nanoseconds>(td_end - td_start).count();
auto duration_dn = std::chrono::duration_cast<std::chrono::nanoseconds>(tdn_end - tdn_start).count();
std::cout << "own boost: " << duration_dn << std::endl;
if (duration_n < duration_d) {
std::cout << "Boost: " << (duration_d) << "; own implementation: " << duration_n << std::endl;
} else {
std::cout << "\033[1;31mBoost faster: " << (duration_d) << "; than own implementation: " << duration_n << "\033[0m" << std::endl;
}
}
The results are (Was compiling and running the checking-Method 3 times)
own boost: 1082 Boost faster: 326; than own implementation: 44278
own boost: 774 Boost faster: 216; than own implementation: 34291
own boost: 769 Boost faster: 230; than own implementation: 33456
Now this puzzles me:
How is it possible the method from the class takes 3 times longer than the statements which were called directly?
My compiling options:
g++ -O2 -c -g -std=c++11 -MMD -MP -MF "build/Debug/GNU-Linux-x86/main.o.d" -o build/Debug/GNU-Linux-x86/main.o main.cpp
g++ -O2 -o ***Classes***

Firs of all, you are allocating your object dynamically, with new:
NormalDistribution1D *n = new NormalDistribution1D(mu, sigma);
double nres = n->prob(x);
If you did just like you have done with boost, that alone would be sufficient to have the same (or comparable) speed:
NormalDistribution1D n(mu, sigma);
double nres = n.prob(x);
Now, I don't know if the way you spelled out your expression in NormalDistribution1D::prob() is significant, but I am skeptical that it would make any difference to write it in a more "optimized" way, because arithmetic expressions like that is the kind of thing compilers can optimize very well. Maybe it will get faster if you user the --ffast-math switch, that will give more optimization freedom to the compiler.
Also, if the definition of double NormalDistribution1D::prob(double x) is in another compilation unit (another .cpp file), the compiler will be unable to inline it, and that will also make a noticeable overhead (maybe twice slower or less). In boost, almost everithing is implemented inside headers, so inlines will always happen when the compiler seems fit. You can overcome this problem if you compile and link with gcc's -flto switch.

You didn't compile with the -ffast-math option. That means the compiler cannot (in fact, must not!) simplify (-1 / 2)*(((x - mu) / sigma)*((x - mu) / sigma)) to a form similar to that used in boost::math::pdf,
expo = (x - mu) / sigma
expo *= -x
expo /= 2
result = std::exp(expo)
result /= sigma * std::sqrt(2 * boost::math::constants::pi<double>())
The above forces the compiler to do the fast (but potentially unsafe / inaccurate) calculation without resorting to using -ffast_math.
Secondly, the time difference between the above and your code is minimal compared to the time needed to allocate from the heap (new) vs. the stack (local variable). You are timing the cost of allocating dynamic memory.

Related

Preventing compiler from optimising a loop

I want to measure the time it takes to call a function.
Here is the code:
for (int i = 0; i < 5; ++i) {
std::cout << "Pass : " << i << "\n";
const auto t0 = std::chrono::high_resolution_clock::now();
system1.euler_intregration(0.0166667);
const auto t1 = std::chrono::high_resolution_clock::now();
std::cout << std::chrono::duration_cast<std::chrono::nanoseconds>(t1 - t0).count() << "\n";
but the compiler keeps optimising the loop so time is not measured and returns zero.
I have tried using asm("") and __asm__("") as advised here but nothing works for me.
I must admit that I don't really know how these asm() functions works so I might be using them in the wrong way.

Chrono C++ timings not correct

I'm just comparing the speed of a couple Fibonacci functions, one gives an output almost immediately and reads it got done in 500 nanoseconds, while the other, depending on the depth, may sit there loading for many seconds, yet when it is done, it will read that it took it only 100 nanoseconds... After I just sat there and waited like 20 seconds for it.
It's not a big deal as I can prove the other is slower just with raw human perception, but why would chrono not be working? Something to do with recursion?
PS I know that fibonacci2() doesn't give the correct output on odd numbered depths, I'm just testing some things and the output is actually just there so the compiler doesn't optimize it away or something. Go ahead and just copy this code and you'll see fibonacci2() immediately output but you'll have to wait like 5 seconds for fibonacci(). Thank you.
#include <iostream>
#include <chrono>
int fibonacci2(int depth) {
static int a = 0;
static int b = 1;
if (b > a) {
a += b; //std::cout << a << '\n';
}
else {
b += a; //std::cout << b << '\n';
}
if (depth > 1) {
fibonacci2(depth - 1);
}
return a;
}
int fibonacci(int n) {
if (n <= 1) {
return n;
}
return fibonacci(n - 1) + fibonacci(n - 2);
}
int main() {
int f = 0;
auto start2 = std::chrono::steady_clock::now();
f = fibonacci2(44);
auto stop2 = std::chrono::steady_clock::now();
std::cout << f << '\n';
auto duration2 = std::chrono::duration_cast<std::chrono::nanoseconds>(stop2 - start2);
std::cout << "faster function time: " << duration2.count() << '\n';
auto start = std::chrono::steady_clock::now();
f = fibonacci(44);
auto stop = std::chrono::steady_clock::now();
std::cout << f << '\n';
auto duration = std::chrono::duration_cast<std::chrono::nanoseconds>(stop - start);
std::cout << "way slower function with incorrect time: " << duration.count() << '\n';
}
I don't know what compiler you are using and with which compiler options, but I tested x64 msvc v19.28 using /O2 in godbolt. Here the compiled instructions are reordered such that it queries the perf_counter twice before invoking the fibonacci(int) function, which in code would look like
auto start = ...;
auto stop = ...;
f = fibonacci(44);
A solution to disallow this reordering might be to use a atomic_thread_fence just before and after the fibonacci function call.
As Mestkon answered the compiler can reorder your code.
Examples of how to prevent the compiler from reordering Memory Ordering - Compile Time Memory Barrier
It would be beneficial in the future if you provided information on what compiler you were using.
gcc 7.5 with -O2 for example does not reorder the timer instructions in this given scenario.

Timing Functions: Double Returning 0 MS

I am writing an in-depth test program for a data structure I had to write for a class. I am trying to time how long it takes functions to execute and store them in an array for later printing. To double check that it was working I decided to print it immediately, and I found out it is not working.
Here is the code where I get the times and store them in an array that is in a struct.
void test1(ArrayLinkedBag<ItemType> &bag,TestAnalytics &analytics){
clock_t totalStart;
clock_t incrementalStart;
clock_t stop; //Both timers stop at the same time;
// Start TEST 1
totalStart = clock();
bag.debugPrint();
cout << "Bag Should Be Empty, Checking..." << endl;
incrementalStart = clock();
checkEmpty<ItemType>(bag);
stop = clock();
analytics.test1Times[0] = analytics.addTimes(incrementalStart,stop);
analytics.test1Times[1] = analytics.addTimes(totalStart,stop);
cout << analytics.test1Times[0] << setprecision(5) << "ms" << endl;
std::cout << "Time: "<< setprecision(5) << (stop - totalStart) / (double)(CLOCKS_PER_SEC / 1000) << " ms" << std::endl;
cout << "===========================================" << endl; //So I can find the line easier
}
Here is the code where I am doing the calculation that I am putting in the array, this function is located in a TestAnalytics struct
double addTimes(double start, double stop){
return (stop - start)/ (double)(CLOCKS_PER_SEC/1000);
}
Here is a snippet of the output I am getting:
Current Head: -1
Current Size: 0
Cell: 1, Index: 0, Item: 6317568, Next Index: -2
Cell: 2, Index: 1, Item: 4098, Next Index: -2
Cell: 3, Index: 2, Item: 6317544, Next Index: -2
Cell: 4, Index: 3, Item: -683175280, Next Index: -2
Cell: 5, Index: 4, Item: 4201274, Next Index: -2
Cell: 6, Index: 5, Item: 6317536, Next Index: -2
Bag Should Be Empty, Checking...
The Bag Is Empty
0ms
Time: 0 ms
===========================================
I am trying to calculate the time as per a different post on this site.
I am using clang compiler on an UNIX system. Is it possible that the number is still too small to show above 0?
Unless you're stuck with an old (pre-C++ 11) compiler/library, I'd use the functions from the <chrono> header:
template <class ItemType>
void test1(ArrayLinkedBag<ItemType> &bag){
using namespace std::chrono;
auto start = high_resolution_clock::now();
bag.debugPrint();
auto first = high_resolution_clock::now();
checkEmpty(bag);
auto stop = high_resolution_clock::now();
std::cout << " first time: " << duration_cast<microseconds>(first - start).count() << " us\n";
std::cout << "second time: " << duration_cast<microseconds>(stop - start).count() << " us\n";
}
Some parts are a bit verbose (to put it nicely) but it still works reasonably well. duration_cast supports difference types down to (at least) nanoseconds, which is typically sufficient for timing even relatively small/fast pieces of code (though it's not guaranteed that it uses a timer with nanosecond precision).
In addition to Jerry's good answer (which I've upvoted), I wanted to add just a little more information that might be helpful.
For timing I recommend steady_clock over high_resolution_clock because steady_clock is guaranteed to not be adjusted (especially backwards) during your timing. Now on Visual Studio and clang, this can't possibly happen because high_resolution_clock and steady_clock are exactly the same type. However if you're using gcc, high_resolution_clock is the same type as system_clock, which is subject to being adjusted at any time (say by an NTP correction).
But if you use steady_clock, then on every platform you have a stop-watch-like timer: Not good for telling you the time of day, but not subject to being corrected at an inopportune moment.
Also, if you use my free, open-source, header-only <chrono> extension library, it can stream out durations in a much more friendly manner, without having to use duration_cast nor .count(). It will print out the duration units right along with the value.
Finally, if you call steady_clock::now() multiple times in a row (with nothing in between), and print out that difference, then you can get a feel for how precisely your implementation is able to time things. Can it time something as short as femtoseconds? Probably not. Is it as coarse as milliseconds? We hope not.
Putting this all together, the following program was compiled like this:
clang++ test.cpp -std=c++14 -O3 -I../date/include
The program:
#include "date/date.h"
#include <iostream>
int
main()
{
using namespace std::chrono;
using date::operator<<;
for (int i = 0; i < 100; ++i)
{
auto t0 = steady_clock::now();
auto t1 = steady_clock::now();
auto t2 = steady_clock::now();
auto t3 = steady_clock::now();
auto t4 = steady_clock::now();
auto t5 = steady_clock::now();
auto t6 = steady_clock::now();
std::cout << t1-t0 << '\n';
std::cout << t2-t1 << '\n';
std::cout << t3-t2 << '\n';
std::cout << t4-t3 << '\n';
std::cout << t5-t4 << '\n';
std::cout << t6-t5 << '\n';
}
}
And output for me on macOS:
150ns
80ns
69ns
53ns
63ns
64ns
88ns
54ns
66ns
66ns
59ns
56ns
59ns
69ns
76ns
74ns
73ns
73ns
64ns
60ns
58ns
...

Clang performance drop for specific C++ random number generation

Using C++11's random module, I encountered an odd performance drop when using std::mt19937 (32 and 64bit versions) in combination with a uniform_real_distribution (float or double, doesn't matter). Compared to a g++ compile, it's more than an order of magnitude slower!
The culprit isn't just the mt generator, as it's fast with a uniform_int_distribution. And it isn't a general flaw in the uniform_real_distribution since that's fast with other generators like default_random_engine. Just that specific combination is oddly slow.
I'm not very familiar with the intrinsics, but the Mersenne Twister algorithm is more or less strictly defined, so a difference in implementation couldn't account for this difference I guess? measure Program is following, but here are my results for clang 3.4 and gcc 4.8.1 on a 64bit linux machine:
gcc 4.8.1
runtime_int_default: 185.6
runtime_int_mt: 179.198
runtime_int_mt_64: 175.195
runtime_float_default: 45.375
runtime_float_mt: 58.144
runtime_float_mt_64: 94.188
clang 3.4
runtime_int_default: 215.096
runtime_int_mt: 201.064
runtime_int_mt_64: 199.836
runtime_float_default: 55.143
runtime_float_mt: 744.072 <--- this and
runtime_float_mt_64: 783.293 <- this is slow
Program to generate this and try out yourself:
#include <iostream>
#include <vector>
#include <chrono>
#include <random>
template< typename T_rng, typename T_dist>
double time_rngs(T_rng& rng, T_dist& dist, int n){
std::vector< typename T_dist::result_type > vec(n, 0);
auto t1 = std::chrono::high_resolution_clock::now();
for (int i = 0; i < n; ++i)
vec[i] = dist(rng);
auto t2 = std::chrono::high_resolution_clock::now();
auto runtime = std::chrono::duration_cast<std::chrono::microseconds>(t2-t1).count()/1000.0;
auto sum = vec[0]; //access to avoid compiler skipping
return runtime;
}
int main(){
const int n = 10000000;
unsigned seed = std::chrono::system_clock::now().time_since_epoch().count();
std::default_random_engine rng_default(seed);
std::mt19937 rng_mt (seed);
std::mt19937_64 rng_mt_64 (seed);
std::uniform_int_distribution<int> dist_int(0,1000);
std::uniform_real_distribution<float> dist_float(0.0, 1.0);
// print max values
std::cout << "rng_default_random.max(): " << rng_default.max() << std::endl;
std::cout << "rng_mt.max(): " << rng_mt.max() << std::endl;
std::cout << "rng_mt_64.max(): " << rng_mt_64.max() << std::endl << std::endl;
std::cout << "runtime_int_default: " << time_rngs(rng_default, dist_int, n) << std::endl;
std::cout << "runtime_int_mt: " << time_rngs(rng_mt_64, dist_int, n) << std::endl;
std::cout << "runtime_int_mt_64: " << time_rngs(rng_mt_64, dist_int, n) << std::endl;
std::cout << "runtime_float_default: " << time_rngs(rng_default, dist_float, n) << std::endl;
std::cout << "runtime_float_mt: " << time_rngs(rng_mt, dist_float, n) << std::endl;
std::cout << "runtime_float_mt_64: " << time_rngs(rng_mt_64, dist_float, n) << std::endl;
}
compile via clang++ -O3 -std=c++11 random.cpp or g++ respectively. Any ideas?
edit: Finally, Matthieu M. had a great idea: The culprit is inlining, or rather a lack thereof. Increasing the clang inlining limit eliminated the performance penalty. That actually solved a number of performance oddities I encountered. Thanks, I learned something new.
As already stated in the comments, the problem is caused by the fact that gcc inlines more aggressive than clang. If we make clang inline very aggressively, the effect disappears:
Compiling your code with g++ -O3 yields
runtime_int_default: 3000.32
runtime_int_mt: 3112.11
runtime_int_mt_64: 3069.48
runtime_float_default: 859.14
runtime_float_mt: 1027.05
runtime_float_mt_64: 1777.48
while clang++ -O3 -mllvm -inline-threshold=10000 yields
runtime_int_default: 3623.89
runtime_int_mt: 751.484
runtime_int_mt_64: 751.132
runtime_float_default: 1072.53
runtime_float_mt: 968.967
runtime_float_mt_64: 1781.34
Apparently, clang now out-inlines gcc in the int_mt cases, but all of the other runtimes are now in the same order of magnitude. I used gcc 4.8.3 and clang 3.4 on Fedora 20 64 bit.

C++ clock measures time incorrectly

I have a program which reads 2 input files. First file contains some random words which are put into an BST and AVL tree. Then the program looks for the words listed in the second read file and says if they exist in the trees, then writes an output file with the information gathered. While doing this the program prints out the time spent for finding a certain item. However the program does not seem to be measuring the time spent.
BST* b = new BST();
AVLTree* t = new AVLTree();
string s;
ifstream in;
in.open(argv[1]);
while(!in.eof())
{
in >> s;
b->insert(s);
t->insert(s);
}
ifstream q;
q.open(argv[2]);
ofstream out;
out.open(argv[3]);
int bstItem = 0;
int avlItem = 0;
float diff1 = 0;
float diff2 = 0;
clock_t t1, t1e, t2, t2e;
while(!q.eof())
{
q >> s;
t1 = clock();
bstItem = b->findItem(s);
t1e = clock();
diff1 = (float)(t1e - t1)/CLOCKS_PER_SEC;
t2 = clock();
avlItem = t->findItem(s);
t2e = clock();
diff2 = (float)(t2e - t2)/CLOCKS_PER_SEC;
if(avlItem == 0 && bstItem == 0)
cout << "Query " << s << " not found in " << diff1 << " microseconds in BST, " << diff2 << " microseconds in AVL" << endl;
else
cout << "Query " << s << " found in " << diff1 << " microseconds in BST, " << diff2 << " microseconds in AVL" << endl;
out << bstItem << " " << avlItem << " " << s << "\n";
}
The clock() value I get just before entering while and just after finishing it is exactly the same. So it appears as if the program does not even run the while loop at all, so it print 0. I know that this is not the case since it takes around 10 seconds for the program the finish as it should. Also the output file contains correct results, so the possibility of having bad findItem() functions is also not true.
I did a little bit research in Stack Overflow, and saw that many people experience the same problem as me. However none of the answers I read solved it.
I solved my problem using a higher resolution clock, though the clock resolution was not my problem. I used clock_gettime() from time.h. As far as I know higher clock resolutions than clock() is platform dependent and this particular method I used in my code is only available for Linux. I still haven't figured out why I wasn't able to obtain healthy results from clock(), but I suspect platform dependency again.
An important note, the use of clock_gettime() requires you to include POSIX real time extension when compiling the code.
So you should do:
g++ a.cpp b.cpp c.cpp -lrt -o myProg
where -lrt is the parameter to include POSIX extensions.
If (t1e - t1) is < CLOCKS_PER_SEC your result will always be 0 because integer division is truncated. Cast CLOCKS_PER_SEC to float.
diff1 = (t1e - t1)/((float)CLOCKS_PER_SEC);