I am trying to insert multiple items in a hashtable and measure the insertion times in milisseconds. Basically, it works like this (this function belongs to the class of my hashtable):
double benchmark(int amountOfInsertions){
int valueToInsert;
timeval tv_timeStart, tv_timeEnd;
double totalTime = 0;
double db_timeStart, db_timeEnd;
for (int i = 0; i < amountOfInsertions; i++){
valueToInsert = generateRandomVariable();
gettimeofday(&tv_timeStart, NULL);
insert(valueToInsert);
gettimeofday(&tv_timeEnd, NULL);
db_timeStart = tv_timeStart.tv_sec*1000 + tv_timeStart.tv_usec/1000.0;
db_timeEnd = tv_timeEnd.tv_sec*1000 + tv_timeEnd.tv_usec/1000.0;
totalTime += (db_timeEnd - db_timeStart);
}
return totalTime;
}
The problem is that the insertion times used to look like this, obviously showing a clear progression of times, the more items I inserted:
But now, I notice that the insertion times kind of alternate between the same values (around multiples of 15.625), creating extremely inaccurate results:
And it just started happening all of a sudden, even with old versions of my code that I know output correct times. Is it a particular problem with gettimeofday()? If not, what could it be?
This problem is so mysterious to me that even wonder if this is the right place of right way to ask about it.
UPDATE: I've also tried with clock() and std::chrono::steady_clock, as well as measuring the time of the whole loop instead of each individual insertion (example below), and still got the same behaviour:
double benchmark(int amountOfInsertions){
int valueToInsert;
double totalTime = 0;
steady_clock::time_point t1 = steady_clock::now();
for (int i = 0; i < amountOfInsertions; i++){
valueToInsert = generateRandomVariable();
insert(valueToInsert);
}
steady_clock::time_point t2 = steady_clock::now();
duration<double> time_span = duration_cast<duration<double>>(t2 - t1);
totalTime = time_span.count()*1000;
return totalTime;
}
I do not know what caused this sudden change in timer resolution for gettimeofday, but i understand that it should not be used to measure time anyway. Even the man page of gettimeofday says so.
Please use clock_gettime instead. Or if you can use fancy C++11 features: std::chrono::steady_clock
If you want to truly benchmark this you need to see which optimization flags are you using, is something optimized away, is something running in the background are the context switches from the hyperthreading affecting you and some more. The maybe use Celero or Hayai depends on how precise you need this. Then perform the test at least 5 times and play around with the sample count in the test.
I found out that std chrono is not the most reliable clock if you are benchmarking and trying to define a benchmarking test.
Related
I am programming a game using OpenGL GLUT code, and I am applying a game developing technique that consists in measuring the time consumed on each iteration of the game's main loop, so you can use it to update the game scene proportionally to the last time it was updated. To achieve this, I have this at the start of the loop:
void logicLoop () {
float finalTime = (float) clock() / CLOCKS_PER_SEC;
float deltaTime = finalTime - initialTime;
initialTime = finalTime;
...
// Here I move things using deltaTime value
...
}
The problem came when I added a bullet to the game. If the bullet does not hit any target in two seconds, it must be destroyed. Then, what I did was to keep a reference to the moment the bullet was created like this:
class Bullet: public GameObject {
float birthday;
public:
Bullet () {
...
// Some initialization staff
...
birthday = (float) clock() / CLOCKS_PER_SEC;
}
float getBirthday () { return birthday; }
}
And then I added this to the logic just beyond the finalTime and deltaTime measurement:
if (bullet != NULL) {
if (finalTime - bullet->getBirthday() > 2) {
world.remove(bullet);
bullet = NULL;
}
}
It looked nice, but when I ran the code, the bullet keeps alive too much time. Looking for the problem, I printed the value of (finalTime - bullet->getBirthday()), and I watched that it increases really really slow, like it was not a time measured in seconds.
Where is the problem? I though that the result would be in seconds, so the bullet would be removed in two seconds.
This is a common mistake. clock() does not measure the passage of actual time; it measures how much time has elapsed while the CPU was running this particular process.
Other processes also take CPU time, so the two clocks are not the same. Whenever your operating system is executing some other process's code, including when this one is "sleeping", does not count to clock(). And if your program is multithreaded on a system with more than one CPU, clock() may "double count" time!
Humans have no knowledge or perception of OS time slices: we just perceive the actual passage of actual time (known as "wall time"). Ultimately, then, you will see clock()'s timebase being different to wall time.
Do not use clock() to measure wall time!
You want something like gettimeofday() or clock_gettime() instead. In order to allay the effects of people changing the system time, on Linux I personally recommend clock_gettime() with the system's "monotonic clock", a clock that steps in sync with wall time but has an arbitrary epoch unaffected by people playing around with the computer's time settings. (Obviously switch to a portable alternative if needs be.)
This is actually discussed on the cppreference.com page for clock():
std::clock time may advance faster or slower than the wall clock, depending on the execution resources given to the program by the operating system. For example, if the CPU is shared by other processes, std::clock time may advance slower than wall clock. On the other hand, if the current process is multithreaded and more than one execution core is available, std::clock time may advance faster than wall clock.
Please get into the habit of reading documentation for all the functions you use, when you are not sure what is going on.
Edit: Turns out GLUT itself has a function you can use for this, which is might convenient. glutGet(GLUT_ELAPSED_TIME) gives you the number of wall milliseconds elapsed since your call to glutInit(). So I guess that's what you need here. It may be slightly more performant, particularly if GLUT (or some other part of OpenGL) is already requesting wall time periodically, and if this function merely queries that already-obtained timeā¦ thus saving you from an unnecessary second system call (which costs).
If you are on windows you can use QueryPerformanceFrequency / QueryPerformanceCounter which gives pretty accurate time measurements.
Here's an example.
#include <Windows.h>
using namespace std;
int main()
{
LARGE_INTEGER freq = {0, 0};
QueryPerformanceFrequency(&freq);
LARGE_INTEGER startTime = {0, 0};
QueryPerformanceCounter(&startTime);
// STUFF.
for(size_t i = 0; i < 100; ++i) {
cout << i << endl;
}
LARGE_INTEGER stopTime = {0, 0};
QueryPerformanceCounter(&stopTime);
const double ellapsed = ((double)stopTime.QuadPart - (double)startTime.QuadPart) / freq.QuadPart;
cout << "Ellapsed: " << ellapsed << endl;
return 0;
}
Consider the following code:
#include <iostream>
#include <chrono>
using Time = std::chrono::high_resolution_clock;
using us = std::chrono::microseconds;
int main()
{
volatile int i, k;
const int n = 1000000;
for(k = 0; k < 200; ++k) {
auto begin = Time::now();
for (i = 0; i < n; ++i); // <--
auto end = Time::now();
auto dur = std::chrono::duration_cast<us>(end - begin).count();
std::cout << dur << std::endl;
}
return 0;
}
I am repeatedly measuring the execution time of the inner for loop.
The results are shown in the following plot (y: duration, x: repetition):
What is causing the decreasing of the loop execution time?
Environment: linux (kernel 4.2) # Intel i7-2600, compiled using: g++ -std=c++11 main.cpp -O0 -o main
Edit 1
The question is not about compiler optimization or performance benchmarks.
The question is, why the performance gets better over time.
I am trying to understand what is happening at run-time.
Edit 2
As proposed by Vaughn Cato, I have changed the CPU frequency scaling policy to "Performance". Now I am getting the following results:
It confirms Vaughn Cato's conjecture. Sorry for the silly question.
What you are probably seeing is CPU frequency scaling (throttling). The CPU goes into a low-frequency state to save power when it isn't being heavily used.
Just before running your program, the CPU clock speed is probably fairly low, since there is no big load. When you run your program, the busy loop increases the load, and the CPU clock speed goes up until you hit the maximum clock speed, decreasing your times.
If you run your program several times in a row, you'll probably see the times stay at a lower value after the first run.
In you original experiment, there are too many variables than can affect the measurements:
the use of your processor by other active processes (i.e. scheduling of your OS)
The question whether your loop is optimized away or not
The access and buffering to the console.
The initial mode of your CPU (see answer about throtling)
I must admit that I was very skeptical about your observations. I therefore wrote a small variant using a preallocated vector, to avoid I/O synchronisation effects:
volatile int i, k;
const int n = 1000000, kmax=200,n_avg=30;
std::vector<long> v(kmax,0);
for(k = 0; k < kmax; ++k) {
auto begin = Time::now();
for (i = 0; i < n; ++i); // <-- remain thanks to volatile
auto end = Time::now();
auto dur = std::chrono::duration_cast<us>(end - begin).count();
v[k]=dur;
}
I then ran it several times on ideone (which, given the scale of its use, we can assume that in average the processor whould be in a constantly sollicitated state). Indeed your observations seemed to be confirmed.
I guess that this could be related to branch prediction, which should improve through the repetitive patterns.
I however went on, updated the code slightly and added a loop to repeat the experiment several times. Then I started to get also runs where your observation was not confirmed (i.e. at the end, the time was higher). But it may also be that the many other processes running on the ideone also influence the branch prediction in a different manner.
So in the end, to conclude anything would require a more cautious experiment, on a machine running this benchmark (and only it) a couple of hours.
I'm trying to make an LED blink to the beat of a certain song. The song has exactly 125 bpm.
The code that I wrote seems to work at first, but the longer it runs the bigger the difference in time between the LED flashes and the next beat starts. The LED seems to blink a tiny bit too slow.
I think that happens because lastBlink is kind of depending on the blink which happened right before that to stay in sync, instead of using one static initial value to sync to...
unsigned int bpm = 125;
int flashDuration = 10;
unsigned int lastBlink = 0;
for(;;) {
if (getTickCount() >= lastBlink+1000/(bpm/60)) {
lastBlink = getTickCount();
printf("Blink!\r\n");
RS232_SendByte(cport_nr, 4); //LED ON
delay(flashDuration);
RS232_SendByte(cport_nr, 0); //LED OFF
}
}
Add value to lastBlink, not reread it as the getTickCount might have skipped more than the exact beats want to wait.
lastblink+=1000/(bpm/60);
Busy-waiting is bad, it spins the CPU for no good reason, and under most OS's it will lead to your process being punished -- the OS will notice that it is using up lots of CPU time and dynamically lower its priority so that other, less-greedy programs get first dibs on CPU time. It's much better to sleep until the appointed time(s) instead.
The trick is to dynamically calculate the amount of time to sleep until the next time to blink, based on the current system-clock time. (Simply delaying by a fixed amount of time means you will inevitably drift, since each iteration of your loop takes a non-zero and somewhat indeterminate time to execute).
Example code (tested under MacOS/X, probably also compiles under Linux, but can be adapted for just about any OS with some changes) follows:
#include <stdio.h>
#include <unistd.h>
#include <sys/times.h>
// unit conversion code, just to make the conversion more obvious and self-documenting
static unsigned long long SecondsToMillis(unsigned long secs) {return secs*1000;}
static unsigned long long MillisToMicros(unsigned long ms) {return ms*1000;}
static unsigned long long NanosToMillis(unsigned long nanos) {return nanos/1000000;}
// Returns the current absolute time, in milliseconds, based on the appropriate high-resolution clock
static unsigned long long getCurrentTimeMillis()
{
#if defined(USE_POSIX_MONOTONIC_CLOCK)
// Nicer New-style version using clock_gettime() and the monotonic clock
struct timespec ts;
return (clock_gettime(CLOCK_MONOTONIC, &ts) == 0) ? (SecondsToMillis(ts.tv_sec)+NanosToMillis(ts.tv_nsec)) : 0;
# else
// old-school POSIX version using times()
static clock_t _ticksPerSecond = 0;
if (_ticksPerSecond <= 0) _ticksPerSecond = sysconf(_SC_CLK_TCK);
struct tms junk; clock_t newTicks = (clock_t) times(&junk);
return (_ticksPerSecond > 0) ? (SecondsToMillis((unsigned long long)newTicks)/_ticksPerSecond) : 0;
#endif
}
int main(int, char **)
{
const unsigned int bpm = 125;
const unsigned int flashDurationMillis = 10;
const unsigned int millisBetweenBlinks = SecondsToMillis(60)/bpm;
printf("Milliseconds between blinks: %u\n", millisBetweenBlinks);
unsigned long long nextBlinkTimeMillis = getCurrentTimeMillis();
for(;;) {
long long millisToSleepFor = nextBlinkTimeMillis - getCurrentTimeMillis();
if (millisToSleepFor > 0) usleep(MillisToMicros(millisToSleepFor));
printf("Blink!\r\n");
//RS232_SendByte(cport_nr, 4); //LED ON
usleep(MillisToMicros(flashDurationMillis));
//RS232_SendByte(cport_nr, 0); //LED OFF
nextBlinkTimeMillis += millisBetweenBlinks;
}
}
I think the drift problem may be rooted in your using relative time delays by sleeping for a fixed duration rather than sleeping until an absolute point in time. The problem is threads don't always wake up precisely on time due to scheduling issues.
Something like this solution may work for you:
// for readability
using clock = std::chrono::steady_clock;
unsigned int bpm = 125;
int flashDuration = 10;
// time for entire cycle
clock::duration total_wait = std::chrono::milliseconds(1000 * 60 / bpm);
// time for LED off part of cycle
clock::duration off_wait = std::chrono::milliseconds(1000 - flashDuration);
// time for LED on part of cycle
clock::duration on_wait = total_wait - off_wait;
// when is next change ready?
clock::time_point ready = clock::now();
for(;;)
{
// wait for time to turn light on
std::this_thread::sleep_until(ready);
RS232_SendByte(cport_nr, 4); // LED ON
// reset timer for off
ready += on_wait;
// wait for time to turn light off
std::this_thread::sleep_until(ready);
RS232_SendByte(cport_nr, 0); // LED OFF
// reset timer for on
ready += off_wait;
}
If your problem is drifting out of sync rather than latency I would suggest measuring time from a given start instead of from the last blink.
start = now()
blinks = 0
period = 60 / bpm
while true
if 0 < ((now() - start) - blinks * period)
ledon()
sleep(blinklengh)
ledoff()
blinks++
Since you didn't specify C++98/03, I'm assuming at least C++11, and thus <chrono> is available. This so far is consistent with Galik's answer. However I would set it up so as to use <chrono>'s conversion abilities more precisely, and without having to manually enter conversion factors, except to describe "beats / minute", or actually in this answer, the inverse: "minutes / beat".
using namespace std;
using namespace std::chrono;
using mpb = duration<int, ratio_divide<minutes::period, ratio<125>>>;
constexpr auto flashDuration = 10ms;
auto beginBlink = steady_clock::now() + mpb{0};
while (true)
{
RS232_SendByte(cport_nr, 4); //LED ON
this_thread::sleep_until(beginBlink + flashDuration);
RS232_SendByte(cport_nr, 0); //LED OFF
beginBlink += mpb{1};
this_thread::sleep_until(beginBlink);
}
The first thing to do is specify the duration of a beat, which is "minutes/125". This is what mpb does. I've used minutes::period as a stand in for 60, just in an attempt to improve readability and reduce the number of magic numbers.
Assuming C++14, I can give flashDuration real units (milliseconds). In C++11 this would need to be spelled with this more verbose syntax:
constexpr auto flashDuration = milliseconds{10};
And then the loop: This is very similar in design to Galik's answer, but here I only increment the time to start the blink once per iteration, and each time, by precisely 60/125 seconds.
By delaying until a specified time_point, as opposed to a specific duration, one ensures that there is no round off accumulation as time progresses. And by working in units which exactly describe your required duration interval, there is also no round off error in terms of computing the start time of the next interval.
No need to traffic in milliseconds. And no need to compute how long one needs to delay. Only the need to symbolically compute the start time of each iteration.
Um...
Sorry to pick on Galik's answer, which I believe is the second best answer next to mine, but it exhibits a bug which my answer not only doesn't have, but is designed to prevent. I didn't notice it until I dug into it with a calculator, and it is subtle enough that testing might miss it.
In Galik's answer:
total_wait = 480ms; // this is exactly correct
off_wait = 990ms; // likely a design flaw
on_wait = -510ms; // certainly a mistake
And the total time that an iteration takes is on_wait + off_wait which is 440ms, almost imperceptibly close to total_wait (480ms), making debugging very challenging.
In contrast my answer increments ready (beginBlink) only once, and by exactly 480ms.
My answer is more likely to be right for the simple reason that it delegates more of its computation to the <chrono> library. And in this particular case, that probability paid off.
Avoid manual conversions. Instead let the <chrono> library do them for you. Manual conversions introduce the possibility for error.
You should count the time spent on the process and substract it to the flashDuration value.
The most obvious issue is that you're losing precision when you divide bpm/60. This always yields an integer (2) instead of 2.08333333...
Calling getTickCount() twice could also lead to some drift.
In a function that updates all particles I have the following code:
for (int i = 0; i < _maxParticles; i++)
{
// check if active
if (_particles[i].lifeTime > 0.0f)
{
_particles[i].lifeTime -= _decayRate * deltaTime;
}
}
This decreases the lifetime of the particle based on the time that passed.
It gets calculated every loop, so if I've 10000 particles, that wouldn't be very efficient because it doesn't need to(it doesn't get changed anyways).
So I came up with this:
float lifeMin = _decayRate * deltaTime;
for (int i = 0; i < _maxParticles; i++)
{
// check if active
if (_particles[i].lifeTime > 0.0f)
{
_particles[i].lifeTime -= lifeMin;
}
}
This calculates it once and sets it to a variable that gets called every loop, so the CPU doesn't have to calculate it every loop, which would theoretically increase performance.
Would it run faster than the old code? Or does the release compiler do optimizations like this?
I wrote a program that compares both methods:
#include <time.h>
#include <iostream>
const unsigned int MAX = 1000000000;
int main()
{
float deltaTime = 20;
float decayRate = 200;
float foo = 2041.234f;
unsigned int start = clock();
for (unsigned int i = 0; i < MAX; i++)
{
foo -= decayRate * deltaTime;
}
std::cout << "Method 1 took " << clock() - start << "ms\n";
start = clock();
float calced = decayRate * deltaTime;
for (unsigned int i = 0; i < MAX; i++)
{
foo -= calced;
}
std::cout << "Method 2 took " << clock() - start << "ms\n";
int n;
std::cin >> n;
return 0;
}
Result in debug mode:
Method 1 took 2470ms
Method 2 took 2410ms
Result in release mode:
Method 1 took 0ms
Method 2 took 0ms
But that doesn't work. I know it doesn't do exactly the same, but it gives an idea.
In debug mode, they take roughly the same time. Sometimes Method 1 is faster than Method 2(especially at fewer numbers), sometimes Method 2 is faster.
In release mode, it takes 0 ms. A little weird.
I tried measuring it in the game itself, but there aren't enough particles to get a clear result.
EDIT
I tried to disable optimizations, and let the variables be user inputs using std::cin.
Here are the results:
Method 1 took 2430ms
Method 2 took 2410ms
It will almost certainly make no difference what so ever, at least if
you compile with optimization (and of course, if you're concerned with
performance, you are compiling with optimization). The opimization in
question is called loop invariant code motion, and is universally
implemented (and has been for about 40 years).
On the other hand, it may make sense to use the separate variable
anyway, to make the code clearer. This depends on the application, but
in many cases, giving a name to the results of an expression can make
code clearer. (In other cases, of course, throwing in a lot of extra
variables can make it less clear. It's all depends on the application.)
In any case, for such things, write the code as clearly as possible
first, and then, if (and only if) there is a performance problem,
profile to see where it is, and fix that.
EDIT:
Just to be perfectly clear: I'm talking about this sort of code optimization in general. In the exact case you show, since you don't use foo, the compiler will probably remove it (and the loops) completely.
In theory, yes. But your loop is extremely simple and thus likeley to be heavily optimized.
Try the -O0 option to disable all compiler optimizations.
The release runtime might be caused by the compiler statically computing the result.
I am pretty confident that any decent compiler will replace your loops with the following code:
foo -= MAX * decayRate * deltaTime;
and
foo -= MAX * calced ;
You can make the MAX size depending on some kind of input (e.g. command line parameter) to avoid that.
I've made a small application that averages the numbers between 1 and 1000000. It's not hard to see (using a very basic algebraic formula) that the average is 500000.5 but this was more of a project in learning C++ than anything else.
Anyway, I made clock variables that were designed to find the amount of clock steps required for the application to run. When I first ran the script, it said that it took 3770000 clock steps, but every time that I've run it since then, it's taken "0.0" seconds...
I've attached my code at the bottom.
Either a.) It's saved the variables from the first time I ran it, and it's just running quickly to the answer...
or b.) something is wrong with how I'm declaring the time variables.
Regardless... it doesn't make sense.
Any help would be appreciated.
FYI (I'm running this through a Linux computer, not sure if that matters)
double avg (int arr[], int beg, int end)
{
int nums = end - beg + 1;
double sum = 0.0;
for(int i = beg; i <= end; i++)
{
sum += arr[i];
}
//for(int p = 0; p < nums*10000; p ++){}
return sum/nums;
}
int main (int argc, char *argv[])
{
int nums = 1000000;//atoi(argv[0]);
int myarray[nums];
double timediff;
//printf("Arg is: %d\n",argv[0]);
printf("Nums is: %d\n",nums);
clock_t begin_time = clock();
for(int i = 0; i < nums; i++)
{
myarray[i] = i+1;
}
double average = avg(myarray, 0, nums - 1);
printf("%f\n",average);
clock_t end_time = clock();
timediff = (double) difftime(end_time, begin_time);
printf("Time to Average: %f\n", timediff);
return 0;
}
You are measuring the I/O operation too (printf), that depends on external factors and might be affecting the run time. Also, clock() might not be as precise as needed to measure such a small task - look into higher resolution functions such as clock_get_time(). Even then, other processes might affect the run time by generating page fault interrupts and occupying the memory BUS, etc. So this kind of fluctuation is not abnormal at all.
On the machine I tested, Linux's clock call was only accurate to 1/100th of a second. If your code runs in less than 0.01 seconds, it will usually say zero seconds have passed. Also, I ran your program a total of 50 times in .13 seconds, so I find it suspicous that you claim it takes 2 seconds to run it once on your computer.
Your code incorrectly uses the difftime, which may display incorrect output as well if clock says time did pass.
I'd guess that the first timing you got was with different code than that posted in this question, becase I can't think of any way the code in this question could produce a time of 3770000.
Finally, benchmarking is hard, and your code has several benchmarking mistakes:
You're timing how long it takes to (1) fill an array, (2) calculate an average, (3) format the result string (4) make an OS call (slow) that prints said string in the right language/font/colo/etc, which is especially slow.
You're attempting to time a task which takes less than a hundredth of a second, which is WAY too small for any accurate measurement.
Here is my take on your code, measuring that the average takes ~0.001968 seconds on this machine.