Speed difference between assign function value and not assign to a variable - c++

So this is really a mystery for me. I am Measuring time of my own sine function and comparing it to the standard sin(). There is a strange behavior though. When I use the functions just standalone like:
sin(something);
I get an average time like (measuring 1000000 calls in 10 rounds) 3.1276 ms for the standard sine function and 51.5589 ms for my implementation.
But when I use something like this:
float result = sin(something);
I get suddenly 76.5621 ms for standard sin() and 49.3675 ms for my one. I understand that it takes some time to assign the value to a variable but why doesn't it add time to my sine too? It's more or less the same while the standard one increases rapidly.
EDIT:
My code for measuring:
ofstream file("result.txt",ios::trunc);
file << "Measured " << repeat << " rounds with " << callNum << " calls in each \n";
for (int i=0;i<repeat;i++)
{
auto start = chrono::steady_clock::now();
//call the function here dattebayo!
for (int o=0; o<callNum;o++)
{
double g = sin((double)o);
}
auto end = chrono::steady_clock::now();
auto difTime = end-start;
double timeD = chrono::duration <double,milli> (difTime).count();
file << i << ": " << timeD << " ms\n";
sum += timeD;
}

In any modern compiler, the compiler will know functions such as sin, cos, printf("%s\n", str) and many more, and either translate to simpler form [constant if the value is constant, printf("%s\n", str); becomes puts(str);] or completely remove [if known that the function itself does not have "side-effects", in other words, it JUST calculates the returned value, and has no effect on the system in other ways].
This often happens even for standard function even when the compiler is in low or even no optimisation modes.
You need to make sure that the result of your function is REALLY used for it to be called in optimised mode. Add the returned values together in the loop...

Related

Cannot make sense of this error (ERROR: endpoints do not enclose a minimum (gsl: fsolver.c:126))

I'm working with this code: https://github.com/UCLA-TMD/Ogata. If performs a fast fourier transform.
It integrates the function test, according to some predefined options that are argument of the FBT - FBT(bessel order,some option that doesnt matter,Number of function calls, estimate value where function has its maximum). So far so good.
This code works fine with that function from test, but that's not the function I actually need to use, so I switched to something like exp(x) just to test it, and I no matter what I do I always get:
(It always compiles okay, but when I run the .o file, it gives me this)
gsl: fsolver.c:126: ERROR: endpoints do not enclose a minimum
Default GSL error handler invoked.
Aborted (core dumped)
At first I thought it could be a problem with the function's maximum value Q in FBT, but whenever I change it, it gives me the same error.
Would really appreciate any help.
double test( double x, double width ){ return x*exp(-x/width);} // test function to transform data allows to send anything else to the function
int main( void )
{
//FBT(Bessel Order nu, option of function (always zero), number of function calls, rough estimate where the maximum of the function f(x) is).
FBT ogata0 = FBT(0.0,0,10,1.0); // Fourier Transform with Jnu, nu=0.0 and N=10
double qT = 1.;
double width = 1.;
//call da ogata 0 para função teste
auto begin = std::chrono::high_resolution_clock::now();
double res = ogata0.fbt(std::bind(test, std::placeholders::_1, width),qT);
auto end = std::chrono::high_resolution_clock::now();
std::cout << std::setprecision(30) << " FT( J0(x*qT) x*exp(-x) ) at qT= " << qT << std::endl;
std::cout << std::setprecision(30) << "Numerical transformed = " << res << std::endl;
auto overhead=std::chrono::duration_cast<std::chrono::nanoseconds>(end-begin).count();
std::cout<<"Calc time: "<<overhead<<" nanoseconds\n";
}

How do I optimize parameters in my code?

I have some code written in c++ that simulates a prefetcher for a CPU. In the code I have some definitions that look like this
#define x 5
...
for(int i = 0; i < x; i++)
...
At the end of the simulation the simulator outputs the average access time which is a measure of how good the prefetcher did. The performance of the prefetcher depends on x and some other similar definitions.
I would like to have a program that changes x, recompiles the new code, runs it, looks at the value, and based on the change in simulated access time repeats the process.
Does anyone know of an easy way to do this that isn't manually changing values?
EDIT: I think I need to clarify that I do not want to have to program a learning algorithm since I have never done it and probably couldn't do it nearly as well as others.
I guess your current program looks something like this
int main() {
#define x 5
<do the simulation>
cout << "x=" << x << " time=" << aat << endl;
Instead you might create a simulate function that takes x as an explicit parameter and returns the average access time ...
double simulate( int x ) {
<do simulation>
}
And call it from main
int main() {
x= initial x value
While ( necessary ) {
Double aat = simulate(x)
Cout << "x=" << x << " time=" << aat << endl;
x = <updated x according to some strategy>
This way your machine learning to learn x happens in main.
But ... If you're writing a program to simulate CPU prefetching I can't help thinking that you know all this perfectly well already. I don't really understand why you were using the compiler to change a simulation parameter in the first place.

How much performance difference when using string vs char array?

I have the following code:
char fname[255] = {0}
snprintf(fname, 255, "%s_test_no.%d.txt", baseLocation, i);
vs
std::string fname = baseLocation + "_test_no." + std::to_string(i) + ".txt";
Which one performs better? Does the second one involve temporary creation? Is there any better way to do this?
Let's run the numbers:
2022 edit:
Using Quick-Bench with GCC 10.3 and compiling with C++20 (with some minor changes for constness) demonstrates that std::string is now faster, almost 3x as much:
Original answer (2014)
The code (I used PAPI Timers)
main.cpp
#include <iostream>
#include <string>
#include <stdio.h>
#include "papi.h"
#include <vector>
#include <cmath>
#define TRIALS 10000000
class Clock
{
public:
typedef long_long time;
time start;
Clock() : start(now()){}
void restart(){ start = now(); }
time usec() const{ return now() - start; }
time now() const{ return PAPI_get_real_usec(); }
};
int main()
{
int eventSet = PAPI_NULL;
PAPI_library_init(PAPI_VER_CURRENT);
if(PAPI_create_eventset(&eventSet)!=PAPI_OK)
{
std::cerr << "Failed to initialize PAPI event" << std::endl;
return 1;
}
Clock clock;
std::vector<long_long> usecs;
const char* baseLocation = "baseLocation";
//std::string baseLocation = "baseLocation";
char fname[255] = {};
for (int i=0;i<TRIALS;++i)
{
clock.restart();
snprintf(fname, 255, "%s_test_no.%d.txt", baseLocation, i);
//std::string fname = baseLocation + "_test_no." + std::to_string(i) + ".txt";
usecs.push_back(clock.usec());
}
long_long sum = 0;
for(auto vecIter = usecs.begin(); vecIter != usecs.end(); ++vecIter)
{
sum+= *vecIter;
}
double average = static_cast<double>(sum)/static_cast<double>(TRIALS);
std::cout << "Average: " << average << " microseconds" << std::endl;
//compute variance
double variance = 0;
for(auto vecIter = usecs.begin(); vecIter != usecs.end(); ++vecIter)
{
variance += (*vecIter - average) * (*vecIter - average);
}
variance /= static_cast<double>(TRIALS);
std::cout << "Variance: " << variance << " microseconds" << std::endl;
std::cout << "Std. deviation: " << sqrt(variance) << " microseconds" << std::endl;
double CI = 1.96 * sqrt(variance)/sqrt(static_cast<double>(TRIALS));
std::cout << "95% CI: " << average-CI << " usecs to " << average+CI << " usecs" << std::endl;
}
Play with the comments to get one way or the other.
10 million iterations of both methods on my machine with the compile line:
g++ main.cpp -lpapi -DUSE_PAPI -std=c++0x -O3
Using char array:
Average: 0.240861 microseconds
Variance: 0.196387microseconds
Std. deviation: 0.443156 microseconds
95% CI: 0.240586 usecs to 0.241136 usecs
Using string approach:
Average: 0.365933 microseconds
Variance: 0.323581 microseconds
Std. deviation: 0.568842 microseconds
95% CI: 0.365581 usecs to 0.366286 usecs
So at least on MY machine with MY code and MY compiler settings, I saw about a 50% slowdown when moving to strings. that character arrays incur a 34% speedup over strings using the following formula:
((time for string) - (time for char array) ) / (time for string)
Which gives the difference in time between the approaches as a percentage on time for string alone. My original percentage was correct; I used the character array approach as a reference point instead, which shows a 52% slowdown when moving to string, but I found it misleading.
I'll take any and all comments for how I did this wrong :)
2015 Edit
Compiled with GCC 4.8.4:
string
Average: 0.338876 microseconds
Variance: 0.853823 microseconds
Std. deviation: 0.924026 microseconds
95% CI: 0.338303 usecs to 0.339449 usecs
character array
Average: 0.239083 microseconds
Variance: 0.193538 microseconds
Std. deviation: 0.439929 microseconds
95% CI: 0.238811 usecs to 0.239356 usecs
So the character array approach remains significantly faster although less so. In these tests, it was about 29% faster.
The snprintf() version will almost certainly be quite a bit faster. Why? Simply because no memory allocation takes place. The new operator is surprisingly expensive, roughly 250ns on my system - snprintf() will have finished quite a bit of work in the meantime.
That is not to say that you should use the snprintf() approach: The price you pay is safety. It is just so easy to get things wrong with the fixed buffer size you are supplying to snprintf(), and you absolutely need to supply code for the case that the buffer is not large enough. So, only think about using snprintf() when you have identified this part of code to be really performance critical.
If you have a POSIX-2008 compliant system, you may also think about trying asprintf() instead of snprintf(), it will malloc() the memory for you, giving you pretty much the same comfort as C++ strings. At least on my system, malloc() is quite a bit faster than the builtin new-operator (don't ask me why, though).
Edit:
Just saw, that you used filenames in your example. If filenames are your concern, forget about the performance of string operation! Your code will spend virtually no time in them. Unless you have on the order of 100000 such string operations per second, they are irrelevant to your performance.
If it's REALLY important, measure the two solutions. If not, whichever you think makes most sense from what data you have, company/private coding style standards, etc. Make sure you use an optimised build [with the same optimisation you are going to use in the actual production build, not -O3 because that is the highest, if your production build is using -O1]
I expect that either will be pretty close if you only do a few. If you have several millions, there may be a difference. Which is faster? I'd guess the second [1], but it depends on who wrote the implementation of snprintf and who wrote the std::string implementation. Both certainly have the potential to take a lot longer than you would expect from a naive approach to how the function works (and possibly also run faster than you'd expect)
[1] Because I have worked with printf, and it's not a simple function, it spends a lot of time messing about with various groking of the format string. It's not very efficient (and I have looked at the ones in glibc and such too, and they are not noticeably better).
On the other hand std::string functions are often inlined since they are template implementations, which improves the efficiency. The joker in the pack is whether the memory allocation for std::string that is likely to happen. Of course, if somehow baselocation turns to be rather large, you probably don't want to store it as a fixed size local array anyway, so that evens out in that case.
I would recommend using strcat in that case. It is by far the fastest method:

How to zero a vector<bool>?

I have a vector<bool> and I'd like to zero it out. I need the size to stay the same.
The normal approach is to iterate over all the elements and reset them. However, vector<bool> is a specially optimized container that, depending on implementation, may store only one bit per element. Is there a way to take advantage of this to clear the whole thing efficiently?
bitset, the fixed-length variant, has the set function. Does vector<bool> have something similar?
There seem to be a lot of guesses but very few facts in the answers that have been posted so far, so perhaps it would be worthwhile to do a little testing.
#include <vector>
#include <iostream>
#include <time.h>
int seed(std::vector<bool> &b) {
srand(1);
for (int i = 0; i < b.size(); i++)
b[i] = ((rand() & 1) != 0);
int count = 0;
for (int i = 0; i < b.size(); i++)
if (b[i])
++count;
return count;
}
int main() {
std::vector<bool> bools(1024 * 1024 * 32);
int count1= seed(bools);
clock_t start = clock();
bools.assign(bools.size(), false);
double using_assign = double(clock() - start) / CLOCKS_PER_SEC;
int count2 = seed(bools);
start = clock();
for (int i = 0; i < bools.size(); i++)
bools[i] = false;
double using_loop = double(clock() - start) / CLOCKS_PER_SEC;
int count3 = seed(bools);
start = clock();
size_t size = bools.size();
bools.clear();
bools.resize(size);
double using_clear = double(clock() - start) / CLOCKS_PER_SEC;
int count4 = seed(bools);
start = clock();
std::fill(bools.begin(), bools.end(), false);
double using_fill = double(clock() - start) / CLOCKS_PER_SEC;
std::cout << "Time using assign: " << using_assign << "\n";
std::cout << "Time using loop: " << using_loop << "\n";
std::cout << "Time using clear: " << using_clear << "\n";
std::cout << "Time using fill: " << using_fill << "\n";
std::cout << "Ignore: " << count1 << "\t" << count2 << "\t" << count3 << "\t" << count4 << "\n";
}
So this creates a vector, sets some randomly selected bits in it, counts them, and clears them (and repeats). The setting/counting/printing is done to ensure that even with aggressive optimization, the compiler can't/won't optimize out our code to clear the vector.
I found the results interesting, to say the least. First the result with VC++:
Time using assign: 0.141
Time using loop: 0.068
Time using clear: 0.141
Time using fill: 0.087
Ignore: 16777216 16777216 16777216 16777216
So, with VC++, the fastest method is what you'd probably initially think of as the most naive -- a loop that assigns to each individual item. With g++, the results are just a tad different though:
Time using assign: 0.002
Time using loop: 0.08
Time using clear: 0.002
Time using fill: 0.001
Ignore: 16777216 16777216 16777216 16777216
Here, the loop is (by far) the slowest method (and the others are basically tied -- the 1 ms difference in speed isn't really repeatable).
For what it's worth, in spite of this part of the test showing up as much faster with g++, the overall times were within 1% of each other (4.944 seconds for VC++, 4.915 seconds for g++).
Try
v.assign(v.size(), false);
Have a look at this link:
http://www.cplusplus.com/reference/vector/vector/assign/
Or the following
std::fill(v.begin(), v.end(), 0)
You are out of luck. std::vector<bool> is a specialization that apparently does not even guarantee contiguous memory or random access iterators (or even forward?!), at least based on my reading of cppreference -- decoding the standard would be the next step.
So write implementation specific code, pray and use some standard zeroing technique, or do not use the type. I vote 3.
The recieved wisdom is that it was a mistake, and may become deprecated. Use a different container if possible. And definitely do not mess around with the internal guts, or rely on its packing. Check if you have dynamic bitset in your std library mayhap, or roll your own wrapper around std::vector<unsigned char>.
I ran into this as a performance issue recently. I hadn't tried looking for answers on the web but did find that using assignment with the constructor was 10x faster using g++ O3 (Debian 4.7.2-5) 4.7.2. I found this question because I was looking to avoid the additional malloc. Looks like the assign is optimized as well as the constructor and about twice as good in my benchmark.
unsigned sz = v.size(); for (unsigned ii = 0; ii != sz; ++ii) v[ii] = false;
v = std::vector(sz, false); // 10x faster
v.assign(sz, false); > // 20x faster
So, I wouldn't say to shy away from using the specialization of vector<bool>; just be very cognizant of the bit vector representation.
Use the std::vector<bool>::assign method, which is provided for this purpose.
If an implementation is specific for bool, then assign, most likely, also implemented appropriately.
If you're able to switch from vector<bool> to a custom bit vector representation, then you can use a representation designed specifically for fast clear operations, and get some potentially quite significant speedups (although not without tradeoffs).
The trick is to use integers per bit vector entry and a single 'rolling threshold' value that determines which entries actually then evaluate to true.
You can then clear the bit vector by just increasing the single threshold value, without touching the rest of the data (until the threshold overflows).
A more complete write up about this, and some example code, can be found here.
It seems that one nice option hasn't been mentioned yet:
auto size = v.size();
v.resize(0);
v.resize(size);
The STL implementer will supposedly have picked the most efficient means of zeroising, so we don't even need to know which particular method that might be. And this works with real vectors as well (think templates), not just the std::vector<bool> monstrosity.
There can be a minuscule added advantage for reused buffers in loops (e.g. sieves, whatever), where you simply resize to whatever will be needed for the current round, instead of to the original size.
As an alternative to std::vector<bool>, check out boost::dynamic_bitset (https://www.boost.org/doc/libs/1_72_0/libs/dynamic_bitset/dynamic_bitset.html). You can zero one (ie, set each element to false) out by calling the reset() member function.
Like clearing, say, std::vector<int>, reset on a boost::dynamic_bitset can also compile down to a memset, whereas you probably won't get that with std::vector<bool>. For example, see https://godbolt.org/z/aqSGCi

Performance issues with C++ (using VC++ 2010): at runtime, my program seems to randomly wait for a while

I'm currently trying to code a certain dynamic programming approach for a vehicle routing problem. At a certain point, I have a partial route that I want to add to a minmaxheap in order to keep the best 100 partial routes at a same stage. Most of the program runs smooth but when I actually want to insert a partial route into the heap, things tend to go a bit slow. That particural code is shown below:
clock_t insert_start, insert_finish, check1_finish, check2_finish;
insert_start = clock();
check2_finish = clock();
if(heap.get_vector_size() < 100) {
check1_finish= clock();
heap.insert(expansion);
cout << "node added" << endl;
}
else {
check1_finish = clock();
if(expansion.get_cost() < heap.find_max().get_cost() ) {
check2_finish = clock();
heap.delete_max();
heap.insert(expansion);
cout<< "worst node deleted and better one added" <<endl;
}
else {
check2_finish = clock();
cout << "cost too high check"<<endl;
}
}
number_expansions++;
cout << "check 1 takes " << check1_finish - insert_start << " ms" << endl;
cout << "check 2 takes " << check2_finish - check1_finish << "ms " << endl;
insert_finish = clock();
cout << "Inserting an expanded state into the heap takes " << insert_finish - insert_start << " clocks" << endl;
A typical output is this:
cost too high check
check1 takes 0 ms
check2 takes 0ms
Inserting an expanded state into the heap takes 0 clocks
cost too high check
check1 takes 0 ms
check2 takes 0ms
Inserting an expanded state into the heap takes 16 clocks
cost too high check
check1 takes 0 ms
check2 takes 0ms
Inserting an expanded state into the heap takes 0 clocks
I know it's hard to say something about the code when this block uses functions that are implemented elsewhere but I'm flabbergasted as to why this sometimes takes less than a ms and sometimes takes up to 16 ms. The program should execute this block thousands of times so these small hiccups are really slowing things down enormously.
My only guess is that something happens with the vector in the heap class that stores all these states but I reserve place for a 100 items in the constructor using vector::reserve so I don't see how this could still be a problem.
Thanks!
Preempting. Your program may be preempted by the operating system, so some other program can run for a bit.
Also, it's not 16 ms. It's 16 clock ticks: http://www.cplusplus.com/reference/clibrary/ctime/clock/
If you want ms, you need to do:
cout << "Inserting an expanded state into the heap takes "
<< (insert_finish - insert_start) * 1000 / CLOCKS_PER_SEC
<< " ms " << endl;
Finally, you're setting insert_finish after printing out the other results. Try setting it immediately after your if/else block. The cout command is a good time to get preempted by another process.
My only guess is that something
happens with the vector in the heap
class that stores all these states but
I reserve place for a 100 items in the
constructor using vector::reserve so I
don't see how this could still be a
problem.
Are you using std::vector to implement it? Insert is taking linear time for std::vector. Also delete max is can take time if you are not using a sorted container.
I will suggest you to use a std::set or std::multiset. Insert, delete and find take always ln(n).
Try to measure time using QueryPerformanceCounter, because I think that clock function could not be very accurate. Probably clock has the same accuracy as windows scheduler - 10 ms for single cpu and 15 or 16 ms for multicore cpu. QueryPerformanceCounter together with QueryPerformanceFreq can give you nanosecond resolution.
It looks like you are measureing "wall time", not CPU time. Windows itself is not a realtime OS. Occasional large hiccups from high-priority things like device drivers is not at all uncommon.
On Windows if I'm manually trying to look for bottlenecks in code, I use RDTSC instead. Even better would be to not do it manually, but use a profiler.