I was trying to create a gameloop with fps dependent on the speed of its iterations. To achieve this I wanted to use a platform specific timer that (in case of windows) used the timeGetTime function (https://learn.microsoft.com/en-us/windows/desktop/api/timeapi/nf-timeapi-timegettime) to calculate how much time has passed since the last iteration. But I found that the time it costs to call this function is already quite a lot (for a computer). Now I'm wondering if this is the right approach.
I created a simple test that looks like this:
Timer timer();
for (int i=0; i < 60; i++)
cout << timer->get_elt() << endl;
delete timer;
The timer class looks like this: (begin is a DWORD)
Timer::Timer()
{
begin = timeGetTime();
}
int Timer::get_elt()
{
return timeGetTime() - begin;
}
Not very interesting, but here is a example of the result:
0 0 1 3 4 14 15 15 15 16 16 17 17 17 17 18 19 19 19 19 20 20 20 20 20 21 21 21 21 21 22 22 22 22 22 22 23 23 23 25 38 39 39 55 56 56 66 68 68 69 71 71 72 73 73 73 73 73 74 74
I was expecting this to take about 10 milliseconds at most, but on average it took about 64.
What surprised me most about it was how erratic the results were. Sometimes it prints up to 7 times the same number, whereas at other times there are gaps of 12 milliseconds between iterations. I realize this is also because the timer is not accurate, but still. As far as I know your pc should execute this program as fast as it possibly can, is that even true?
If you want to run your game at say 60 fps, you'd have about 16 milliseconds for every loop, and if calling the timer alone takes about 2 milliseconds on average every time, and you still need to process input, update, and render, how is that even possible?
So what should I do here, is timeGetTime something you could use in a gameloop (it's been suggested a lot), or should I think of another function?
I would suggest using the QueryPerformanceCounter instead
https://msdn.microsoft.com/en-us/library/windows/desktop/ms644904(v=vs.85).aspx
The Timers from Windows Multimedia API is a good choice for animation, games, etc.
The have greatest precision on Windows Platform.
Qt use and qualifies this timers also as precise ones.
http://doc.qt.io/qt-5/qt.html#TimerType-enum
On Windows, Qt will use Windows's Multimedia timer facility (if
available) for Qt::PreciseTimer and normal Windows timers for
Qt::CoarseTimer and Qt::VeryCoarseTimer.
Related
I need to extract a data structure from the memory of an application I am debugging that is a specific amount of bytes wide, preferably in the form of a series of hex pairs. I want to get this data from the command or immediate window in the Visual Studio Debugger. I could achieve this in windbg via the db command, but I am having trouble finding the specific command for Visual Studio. Debug.Print is insufficient, as it stops printing as soon as it encounters a null character.
I know such a command exists as I have used it before, but I can't for the life of me find it. This is what I get for not writing things down.
I was able to find the answer to this after digging through some documentation. The command I wanted was Debug.ListMemory, which is aliased to the d command. The command to print bytes in hex pairs is specifically db /Count:[number of bytes to print] [memory address].
>db /Count:1686 0x0000021f7102d4d0
0x0000021F7102D4D0 48 72 2f 50 73 36 68 75 4e 6c 59 44 44 56 33 33
0x0000021F7102D4E0 38 78 37 4f 55 65 6c 62 6c 6f 51 78 77 66 4e 68
0x0000021F7102D4F0 35 73 4e 35 42 68 4d 67 54 7a 6e 35 6d 36 52 41
...
Assuming that p is a pointer to the array of bytes you can enter watch like this:
(p + start_pos),[items_count]
I discovered a strange behavior in QDateTime of Qt 4.8 regarding fromMSecsSinceEpoch. The following code does not produce the result I would expect:
assert(
QDateTime::fromMSecsSinceEpoch(
std::numeric_limits<qint64>::max()
).isValid() == true
);
assert(
QDateTime::fromMSecsSinceEpoch(
std::numeric_limits<qint64>::max()
).toMSecsSinceEpoch() == std::numeric_limits<qint64>::max()
);
While the first assertion is true, the second fails. The returned result from Qt is -210866773624193.
The doc for QDateTime::fromMSecsSinceEpoch(qint64 msecs) clearly states:
There are possible values for msecs that lie outside the valid range of QDateTime, both negative and positive. The behavior of this function is undefined for those values.
However, there is no any explicit statement about the valid range.
I found this Qt bug report about an issue regarding timezones in Qt 5.5.1, 5.6.0 and 5.7.0 Beta.
I am not sure wheter this is a similar bug, or if the value I provided to QDateTime::fromMSecsSinceEpoch(qint64 msecs) is simply invalid.
What is (or rather should be) the maximum value that can be passed to this function and yields correct behavior?
std::numeric_limits<qint64>::max() ms yields 9 223 372 036 854 775 807 ms, or 9 223 372 036 854 775 s, or 2 562 047 788 015 hours, or 106 751 991 167 days, or 292 471 208 years: that's far beyond the year 11 million in the valid range of QDateTime.
From the doc, valid dates start from January 2nd, 4713 BCE, and go until QDate::toJulianDay() overflow: 2^31 days (max value for signed integer) yields nearly 5 000 000 years. That's 185 542 587 187 200 000 ms (from January 2nd, 4713 BCE, not from Epoch), "little" more than 2^57.
EDIT:
After discussion in the comments, you checked Qt4.8 sources and found that fromMSecsSinceEpoch() uses QDate(1970, 1, 1).addDays(ddays) internally, where the number of days is calculated directly from the msecs parameter.
Since ddays is of type int here, this would overflow for values larger than 2^31.
I'm trying to speedup the OpenCV SIFT algorithm with OpenMP on a Intel® Core™ i5-6500 CPU # 3.20GHz × 4. You can find the code in sift.cpp.
The most expensive part is the descriptor computaton, in particular:
static void calcDescriptors(const std::vector<Mat>& gpyr, const std::vector<KeyPoint>& keypoints,
Mat& descriptors, int nOctaveLayers, int firstOctave )
{
int d = SIFT_DESCR_WIDTH, n = SIFT_DESCR_HIST_BINS;
for( size_t i = 0; i < keypoints.size(); i++ )
{
KeyPoint kpt = keypoints[i];
int octave, layer;
float scale;
unpackOctave(kpt, octave, layer, scale);
CV_Assert(octave >= firstOctave && layer <= nOctaveLayers+2);
float size=kpt.size*scale;
Point2f ptf(kpt.pt.x*scale, kpt.pt.y*scale);
const Mat& img = gpyr[(octave - firstOctave)*(nOctaveLayers + 3) + layer];
float angle = 360.f - kpt.angle;
if(std::abs(angle - 360.f) < FLT_EPSILON)
angle = 0.f;
calcSIFTDescriptor(img, ptf, angle, size*0.5f, d, n, descriptors.ptr<float>((int)i));
}
}
The serial version of this function take 52 ms on average.
This for has an high granulatiy: it's executed 604 times (which is keypoints.size() ). The main time consuming component inside the for is calcSIFTDescriptor which takes most of the cycle time computation and it takes on 105 us on average, but it often happens that it can take 200usor 50us.
However, we are incredibly lucky: there is no dependency between each for cycle, so we can just add:
#pragma omp parallel for schedule(dynamic,8)
and obtain an initial speedup. The dynamic option is introduced since it seems it give little better performances than static (don't know why).
The problem is that it's really unstable and doesn't scale. This is the time needed to compute the function in parallel mode:
25ms 43ms 32ms 15ms 27ms 53ms 21ms 24ms
As you can see only once the optimal speedup in a quad-core system is reached (15ms). Most of the times we reach half of the optimal speedup: 25ms in a quadcore system is only half of the theoretical optimal speedup.
Why this happens? How can we improve this?
UPDATE:
As suggested in the comments, I tried to use a bigger dataset. Using an huge image, the serial version takes 13574ms to compute the descriptors, while the parallel version 3704ms with the same quad-core of before. Much better: even if it's not the best theoretical result, it actually scales well. But actually the problem remain, since the previous results are obtained from a typical image.
UPDATE 1: as suggested by the comment, I tried to benchmark without any interval between the execution in an "hot mode" (see comment for more details). Better results are achieved more frequently, but still there is a lot of variations. This are the times (in ms) for 100 runs in hot mode:
43 42 14 26 14 43 13 26 15 51 15 20 14 40 34 15 15 31 15 22 14 21 17 15 14 27 14 16 14 22 14 22 15 15 14 43 16 16 15 28 14 24 14 36 15 32 13 21 14 23 14 15 13 26 15 35 13 32 14 36 14 34 15 40 28 14 14 15 15 35 15 22 14 17 15 23 14 24 17 16 14 35 14 29 14 25 14 32 14 28 14 34 14 30 22 14 15 24 14 31
You can see a lot of good results (14ms, 15ms) but a lot of horrible results also (>40ms). The average is 22ms Notice that there is no at most 4ms of variation in the sequential mode:
52 54 52 52 51 52 52 53 53 52 53 51 52 53 53 54 53 53 53 53 54 53 54 54 53 53 53 52 53 52 51 52 52 53 54 54 54 55 55 55 54 54 54 53 53 52 52 52 51 52 54 53 54 54 54 55 54 54 52 55 52 52 52 51 52 51 52 52 51 51 52 52 53 53 53 53 55 54 55 54 54 54 55 52 52 52 51 51 52 51 51 51 52 53 53 54 53 54 53 55
UPDATE 2:
I've noticed that each CPU utilization during the "hot mode" benchmarking is quite random and also it never reach more than 80%, as shown in the image below:
Instead the image below shows the CPUs utilization while I compile OpenCV through make -j4. As you can see it more stable and used almost 100% of it:
I think that this is variation in the first image are normal since we execute the same short program many times, which is more unstable than one big program. What I don't understand is why we never reach more than 80% of CPU utilization.
I strongly suggest you to use some performance tools such as Paraver (http://www.bsc.es/paraver), TAU (http://www.cs.uoregon.edu/research/tau/home.php) Vampir (https://tu-dresden.de/die_tu_dresden/zentrale_einrichtungen/zih/forschung/projekte/vampir) or even Intel's Vtune (https://software.intel.com/en-us/intel-vtune-amplifier-xe).
These tools will help you understand where threads spends their cycles. With them, you can find whether the application is unbalanced (either by IPC or instructions), whether there is any limitation due to memory bandwidth or false sharing problems, among many other issues.
I am using XBee DigiMesh 2.4 API-2 and Raspberry Pi. I broadcast a frame from one node to another.
Frame to transmit:
7e 0 12 10 1 0 0 0 0 0 0 ff ff ff fe 0 0 41 6c 65 78 69
Frame received in the other node:
7e 0 10 90 0 7d 33 a2 0 40 91 57 26 ff fe c2 41 6c 65 78 1e
Byte which is bothering me, is c2. It should be 02. Why it appears in this way?
What is more, checksum is not correct (I read how the checksum should be calculated in API 2 mode).
With byte 0x02 it should be 0xe3 or with byte c2 it should be 0x23. I was trying to obtain the result 0x1e in many ways but I never got this value.
When I broadcast the packet in opposite direction (from second node to the first one) the same problems appear.
Both XBee´s are configurated with 9600 baudrate, no parity. Raspberry Pi UART as well.
----- Edit: I found the answer regarding to C2 byte. C2 is a bit field. C2 = 1100 0010.
Bits 7 and 6 are 11, it means here that it is Digimesh. Bit 1 is set so it is a broadcast packet.
https://dl.dropboxusercontent.com/u/318853/XBee%20900.PNG
Still looking for the reason of this checksum.
You can simplify your code by using API mode 1 and eliminating the need to escape and unescape values as you send and receive them. It really isn't that difficult to have your code figure out framing and ignore 0x7E in the middle of a frame: If you see a 0x7E followed by an invalid length, keep looking. If your frame has a bad checksum, skip the 0x7E and look for the next one.
If you absolutely must use escaping, ensure that the length value and checksum in your frame don't include the escaping bytes, and that you're properly escaping the necessary bytes as you send them.
On the receiving end, unescape the bytes and then calculate the checksum.
I trying to get started with Google Perf Tools to profile some CPU intensive applications. It's a statistical calculation that dumps each step to a file using `ofstream'. I'm not a C++ expert so I'm having troubling finding the bottleneck. My first pass gives results:
Total: 857 samples
357 41.7% 41.7% 357 41.7% _write$UNIX2003
134 15.6% 57.3% 134 15.6% _exp$fenv_access_off
109 12.7% 70.0% 276 32.2% scythe::dnorm
103 12.0% 82.0% 103 12.0% _log$fenv_access_off
58 6.8% 88.8% 58 6.8% scythe::const_matrix_forward_iterator::operator*
37 4.3% 93.1% 37 4.3% scythe::matrix_forward_iterator::operator*
15 1.8% 94.9% 47 5.5% std::transform
13 1.5% 96.4% 486 56.7% SliceStep::DoStep
10 1.2% 97.5% 10 1.2% 0x0002726c
5 0.6% 98.1% 5 0.6% 0x000271c7
5 0.6% 98.7% 5 0.6% _write$NOCANCEL$UNIX2003
This is surprising, since all the real calculation occurs in SliceStep::DoStep. The "_write$UNIX2003" (where can I find out what this is?) appears to be coming from writing the output file. Now, what confuses me is that if I comment out all the outfile << "text" statements and run pprof, 95% is in SliceStep::DoStep and `_write$UNIX2003' goes away. However my application does not speed up, as measured by total time. The whole thing speeds up less than 1 percent.
What am I missing?
Added:
The pprof output without the outfile << statements is:
Total: 790 samples
205 25.9% 25.9% 205 25.9% _exp$fenv_access_off
170 21.5% 47.5% 170 21.5% _log$fenv_access_off
162 20.5% 68.0% 437 55.3% scythe::dnorm
83 10.5% 78.5% 83 10.5% scythe::const_matrix_forward_iterator::operator*
70 8.9% 87.3% 70 8.9% scythe::matrix_forward_iterator::operator*
28 3.5% 90.9% 78 9.9% std::transform
26 3.3% 94.2% 26 3.3% 0x00027262
12 1.5% 95.7% 12 1.5% _write$NOCANCEL$UNIX2003
11 1.4% 97.1% 764 96.7% SliceStep::DoStep
9 1.1% 98.2% 9 1.1% 0x00027253
6 0.8% 99.0% 6 0.8% 0x000274a6
This looks like what I'd expect, except I see no visible increase in performance (.1 second on a 10 second calculation). The code is essentially:
ofstream outfile("out.txt");
for loop:
SliceStep::DoStep()
outfile << 'result'
outfile.close()
Update: I timing using boost::timer, starting where the profiler starts and ending where it ends. I do not use threads or anything fancy.
From my comments:
The numbers you get from your profiler say, that the program should be around 40% faster without the print statements.
The runtime, however, stays nearly the same.
Obviously one of the measurements must be wrong. That means you have to do more and better measurements.
First I suggest starting with another easy tool: the time command. This should get you a rough idea where your time is spend.
If the results are still not conclusive you need a better testcase:
Use a larger problem
Do a warmup before measuring. Do some loops and start any measurement afterwards (in the same process).
Tiristan: It's all in user. What I'm doing is pretty simple, I think... Does the fact that the file is open the whole time mean anything?
That means the profiler is wrong.
Printing 100000 lines to the console using python results in something like:
for i in xrange(100000):
print i
To console:
time python print.py
[...]
real 0m2.370s
user 0m0.156s
sys 0m0.232s
Versus:
time python test.py > /dev/null
real 0m0.133s
user 0m0.116s
sys 0m0.008s
My point is:
Your internal measurements and time show you do not gain anything from disabling output. Google Perf Tools says you should. Who's wrong?
_write$UNIX2003 is probably referring to the write POSIX system call, which outputs to the terminal. I/O is very slow compared to almost anything else, so it makes sense that your program is spending a lot of time there if you are writing a fair bit of output.
I'm not sure why your program wouldn't speed up when you remove the output, but I can't really make a guess on only the information you've given. It would be nice to see some of the code, or even the perftools output when the cout statement is removed.
Google perftools collects samples of the call stack, so what you need is to get some visibility into those.
According to the doc, you can display the call graph at statement or address granularity. That should tell you what you need to know.