gperftools not showing call graph results - profiling

I've got gperftools installed and collecting data, which looks reasonable so far. I see one node (?) that gets sampled a lot - but I'm interested in the callers to that node - I don't see them? I've tried callgrind/kcachegrind also, I feel like I'm missing something? Here's a snippet of the output when using --text
Total: 1844 samples
573 31.1% 31.1% 573 31.1% US_strcpy
185 10.0% 41.1% 185 10.0% US_strstr
167 9.1% 50.2% 167 9.1% US_strlen
63 3.4% 53.6% 63 3.4% PS_CompressTable
58 3.1% 56.7% 58 3.1% LX_LexInternal
51 2.8% 59.5% 51 2.8% US_CStrEql
47 2.5% 62.0% 47 2.5% 0x40472984
40 2.2% 64.2% 40 2.2% PS_DoSets
38 2.1% 66.3% 38 2.1% LX_ProcessCatRange
So I'm interested in seeing the callers to US_strcpy, but I don't seem to have any? I do get a nice call graph from kcachegrind for 0x40472984 (still trying to match that to a symbol)

There are several ways:
a) pprof --web or kcachgrind will show you callers nicely if it is captured correctly. It is sometimes useful to do pprof --traces (only with github.com/google/pprof version). Which will be somewhat like low-tech method that Mike mentioned above.
b) if the data is really unavailable, you're having problem with stacktrace capturing and/or symbolization. For that, build gperftools with libunwind and build all of your program with debug info.

Related

Is timeGetTime a good way to meassure time for a gameloop?

I was trying to create a gameloop with fps dependent on the speed of its iterations. To achieve this I wanted to use a platform specific timer that (in case of windows) used the timeGetTime function (https://learn.microsoft.com/en-us/windows/desktop/api/timeapi/nf-timeapi-timegettime) to calculate how much time has passed since the last iteration. But I found that the time it costs to call this function is already quite a lot (for a computer). Now I'm wondering if this is the right approach.
I created a simple test that looks like this:
Timer timer();
for (int i=0; i < 60; i++)
cout << timer->get_elt() << endl;
delete timer;
The timer class looks like this: (begin is a DWORD)
Timer::Timer()
{
begin = timeGetTime();
}
int Timer::get_elt()
{
return timeGetTime() - begin;
}
Not very interesting, but here is a example of the result:
0 0 1 3 4 14 15 15 15 16 16 17 17 17 17 18 19 19 19 19 20 20 20 20 20 21 21 21 21 21 22 22 22 22 22 22 23 23 23 25 38 39 39 55 56 56 66 68 68 69 71 71 72 73 73 73 73 73 74 74
I was expecting this to take about 10 milliseconds at most, but on average it took about 64.
What surprised me most about it was how erratic the results were. Sometimes it prints up to 7 times the same number, whereas at other times there are gaps of 12 milliseconds between iterations. I realize this is also because the timer is not accurate, but still. As far as I know your pc should execute this program as fast as it possibly can, is that even true?
If you want to run your game at say 60 fps, you'd have about 16 milliseconds for every loop, and if calling the timer alone takes about 2 milliseconds on average every time, and you still need to process input, update, and render, how is that even possible?
So what should I do here, is timeGetTime something you could use in a gameloop (it's been suggested a lot), or should I think of another function?
I would suggest using the QueryPerformanceCounter instead
https://msdn.microsoft.com/en-us/library/windows/desktop/ms644904(v=vs.85).aspx
The Timers from Windows Multimedia API is a good choice for animation, games, etc.
The have greatest precision on Windows Platform.
Qt use and qualifies this timers also as precise ones.
http://doc.qt.io/qt-5/qt.html#TimerType-enum
On Windows, Qt will use Windows's Multimedia timer facility (if
available) for Qt::PreciseTimer and normal Windows timers for
Qt::CoarseTimer and Qt::VeryCoarseTimer.

CFLint tool coldfusion(CFLint VS Fortify)

I have scanned coldfusion code using the cflint jar CFLint-1.3.0-1ll.jar from the command line as
java -jar <jar path> -folder <mylocalColdfusioncodeFolder>
It gives cflint-result.html file in the corresponding folder.
In the report, I found that no cross site scripting and DOM related issues as mentioned by Fortify Audit Workbench tool. CFLint is basically gives language specific issues because it's mainly run on CFParser.
When I run the below command to know the rules against scan I found all are language specific rules.
java -jar CFLint-1.3.0-all.jar -rules gives a list of rules as
The Supported rules to check against the cfm code :
-----------------------------------------------------
1 ComplexBooleanExpressionChecker
2 GlobalLiteralChecker
3 CFBuiltInFunctionChecker
4 CreateObjectChecker
5 CFDumpChecker
6 FunctionTypeChecker
7 ArrayNewChecker
8 LocalLiteralChecker
9 SelectStarChecker
10 TooManyFunctionsChecker
11 QueryParamChecker
12 FunctionLengthChecker
13 OutputParmMissing
14 WriteDumpChecker
15 CFExecuteChecker
16 ComponentLengthChecker
17 GlobalVarChecker
18 CFModuleChecker
19 CFIncludeChecker
20 CFDebugAttributeChecker
21 ComponentDisplayNameChecker
22 ArgVarChecker
23 NestedCFOutput
24 VarScoper
25 FunctionHintChecker
26 ArgumentNameChecker
27 TooManyArgumentsChecker
28 SimpleComplexityChecker
29 TypedQueryNew
30 CFInsertChecker
31 StructKeyChecker
32 BooleanExpressionChecker
33 VariableNameChecker
34 MethodNameChecker
35 AbortChecker
36 ComponentNameChecker
37 UnusedArgumentChecker
38 StructNewChecker
39 PackageCaseChecker
40 CFAbortChecker
41 ComponentHintChecker
42 ArgumentTypeChecker
43 CFUpdateChecker
44 IsDebugModeChecker
45 ArgDefChecker
46 UnusedLocalVarChecker
47 CFSwitchDefaultChecker
48 ArgumentHintChecker
49 CFCompareVsAssignChecker
-----------------------------------------------------
And I found that CFLint does not raises errors of CSS attacks. When I run the same coldfusioncodefolder with Fortify tool (Audit workbench), I got CSS issues like
Cross-Site Scripting: Reflected
, Cross-Site Scripting: DOM
, Unreleased resource
, Dynamic Code Evaluation: Code Injection
, Hardcoded Password
, Sql Injection
, Path Manipulation
, log forging and privacy violation with the tags cfdocument
, cfdirectoryexists
, cfcookie
, cflog
, cffile.....
Can you please clarify whether CFLint scans CSS issues or it only checks the rules only specific to ColdFusion language?
CFLint is only concerned about ColdFusion code. It is not a security scanner nor a CSS linter. You are mixing up your tools and their purposes.
A linter scans for code issues - not security issues.

Valid range for QDateTime::fromMSecsSinceEpoch

I discovered a strange behavior in QDateTime of Qt 4.8 regarding fromMSecsSinceEpoch. The following code does not produce the result I would expect:
assert(
QDateTime::fromMSecsSinceEpoch(
std::numeric_limits<qint64>::max()
).isValid() == true
);
assert(
QDateTime::fromMSecsSinceEpoch(
std::numeric_limits<qint64>::max()
).toMSecsSinceEpoch() == std::numeric_limits<qint64>::max()
);
While the first assertion is true, the second fails. The returned result from Qt is -210866773624193.
The doc for QDateTime::fromMSecsSinceEpoch(qint64 msecs) clearly states:
There are possible values for msecs that lie outside the valid range of QDateTime, both negative and positive. The behavior of this function is undefined for those values.
However, there is no any explicit statement about the valid range.
I found this Qt bug report about an issue regarding timezones in Qt 5.5.1, 5.6.0 and 5.7.0 Beta.
I am not sure wheter this is a similar bug, or if the value I provided to QDateTime::fromMSecsSinceEpoch(qint64 msecs) is simply invalid.
What is (or rather should be) the maximum value that can be passed to this function and yields correct behavior?
std::numeric_limits<qint64>::max() ms yields 9 223 372 036 854 775 807 ms, or 9 223 372 036 854 775 s, or 2 562 047 788 015 hours, or 106 751 991 167 days, or 292 471 208 years: that's far beyond the year 11 million in the valid range of QDateTime.
From the doc, valid dates start from January 2nd, 4713 BCE, and go until QDate::toJulianDay() overflow: 2^31 days (max value for signed integer) yields nearly 5 000 000 years. That's 185 542 587 187 200 000 ms (from January 2nd, 4713 BCE, not from Epoch), "little" more than 2^57.
EDIT:
After discussion in the comments, you checked Qt4.8 sources and found that fromMSecsSinceEpoch() uses QDate(1970, 1, 1).addDays(ddays) internally, where the number of days is calculated directly from the msecs parameter.
Since ddays is of type int here, this would overflow for values larger than 2^31.

gperftools cpuProfiler cannot see the simbols when profiling a file created by and ARM device

I want to profile a C++ application that runs in an ARM device.
I ran my app and I profiled it using ProfilerStart("googleProfBL.prof"), so the file is generated.
When I open the file from the ARM device in my local computer I get this:
./pprof --text --add_lib=libraryIwanttoDebug.so BinaryThatLoadsThatLibrary googleProfBL.prof
Using local file /home/genius/PresControler/src-build-target/deploy/NavStartup.
Using local file ../traces/googleProfBL.prof.
Warning: address ffffffffffffffff is longer than address length 8
Warning: address ffffffffffffffff is longer than address length 8
Hexadecimal number > 0xffffffff non-portable at ./pprof line 4475.
Hexadecimal number > 0xffffffff non-portable at ./pprof line 4475.
Total: 5347 samples
258 4.8% 4.8% 258 4.8% 0x76d4c276
144 2.7% 7.5% 144 2.7% 0x76da2cc4
126 2.4% 9.9% 126 2.4% 0x5d0f8284
114 2.1% 12.0% 114 2.1% 0x76d27386
64 1.2% 13.2% 64 1.2% 0x76dba2dc
53 1.0% 14.2% 53 1.0% 0x76dba1f4
...
The so library is compiled in debug mode (is not stripped), I do not know why I am not getting the symbols.
I tried this:
./pprof --text --add_lib=aFileOfTheLibrary.o BinaryThatLoadsThatLibrary googleProfBL.prof
Looks like I got a couple of symbols.
Using local file /home/genius/PresControler/src-build-target/deploy/NavStartup.
Using local file ../traces/googleProfBL.prof.
Warning: address ffffffffffffffff is longer than address length 8
Warning: address ffffffffffffffff is longer than address length 8
Hexadecimal number > 0xffffffff non-portable at ./pprof line 4475.
Hexadecimal number > 0xffffffff non-portable at ./pprof line 4475.
Total: 5347 samples
258 4.8% 4.8% 258 4.8% 0x76d4c276
144 2.7% 7.5% 144 2.7% 0x76da2cc4
126 2.4% 9.9% 126 2.4% 0x5d0f8284
114 2.1% 12.0% 114 2.1% 0x76d27386
64 1.2% 13.2% 64 1.2% 0x76dba2dc
53 1.0% 14.2% 53 1.0% 0x76dba1f4
50 0.9% 15.1% 50 0.9% 0x76dbf1bc
34 0.6% 15.8% 34 0.6% 0x72eae1b4
30 0.6% 16.3% 30 0.6% 0x76d8a32a
30 0.6% 16.9% 30 0.6% 0x76d8e2c0
..
0 0.0% 100.0% 7 0.1% std::forward_as_tuple <- I couldn't see that before!!!
I tried doing --add_lib for every .o I have but I do not get any more symbols. Why I do not get the symbols, does it have anything to do because I am checking the results using an intel and getting them using an ARM?? How could I fix that? any help???
Thank you!!!
I got much more information now!
I was living my application pressing ctrl+c, so the file got somehow corrupt...
I did a test calling ProfilerStop() before pressing ctrl+c and it worked (of course using as well --lib_prefix where the .so are).
I still got these warnings:
Warning: address ffffffffffffffff is longer than address length 8
Warning: address ffffffffffffffff is longer than address length 8
Hexadecimal number > 0xffffffff non-portable at ./pprof line 4475.
Hexadecimal number > 0xffffffff non-portable at ./pprof line 4475.
If someone knows why I am getting them (I assume is because I am debugging code generated by another device) please let me know.

What exactly does C++ profiling (google cpu perf tools) measure?

I trying to get started with Google Perf Tools to profile some CPU intensive applications. It's a statistical calculation that dumps each step to a file using `ofstream'. I'm not a C++ expert so I'm having troubling finding the bottleneck. My first pass gives results:
Total: 857 samples
357 41.7% 41.7% 357 41.7% _write$UNIX2003
134 15.6% 57.3% 134 15.6% _exp$fenv_access_off
109 12.7% 70.0% 276 32.2% scythe::dnorm
103 12.0% 82.0% 103 12.0% _log$fenv_access_off
58 6.8% 88.8% 58 6.8% scythe::const_matrix_forward_iterator::operator*
37 4.3% 93.1% 37 4.3% scythe::matrix_forward_iterator::operator*
15 1.8% 94.9% 47 5.5% std::transform
13 1.5% 96.4% 486 56.7% SliceStep::DoStep
10 1.2% 97.5% 10 1.2% 0x0002726c
5 0.6% 98.1% 5 0.6% 0x000271c7
5 0.6% 98.7% 5 0.6% _write$NOCANCEL$UNIX2003
This is surprising, since all the real calculation occurs in SliceStep::DoStep. The "_write$UNIX2003" (where can I find out what this is?) appears to be coming from writing the output file. Now, what confuses me is that if I comment out all the outfile << "text" statements and run pprof, 95% is in SliceStep::DoStep and `_write$UNIX2003' goes away. However my application does not speed up, as measured by total time. The whole thing speeds up less than 1 percent.
What am I missing?
Added:
The pprof output without the outfile << statements is:
Total: 790 samples
205 25.9% 25.9% 205 25.9% _exp$fenv_access_off
170 21.5% 47.5% 170 21.5% _log$fenv_access_off
162 20.5% 68.0% 437 55.3% scythe::dnorm
83 10.5% 78.5% 83 10.5% scythe::const_matrix_forward_iterator::operator*
70 8.9% 87.3% 70 8.9% scythe::matrix_forward_iterator::operator*
28 3.5% 90.9% 78 9.9% std::transform
26 3.3% 94.2% 26 3.3% 0x00027262
12 1.5% 95.7% 12 1.5% _write$NOCANCEL$UNIX2003
11 1.4% 97.1% 764 96.7% SliceStep::DoStep
9 1.1% 98.2% 9 1.1% 0x00027253
6 0.8% 99.0% 6 0.8% 0x000274a6
This looks like what I'd expect, except I see no visible increase in performance (.1 second on a 10 second calculation). The code is essentially:
ofstream outfile("out.txt");
for loop:
SliceStep::DoStep()
outfile << 'result'
outfile.close()
Update: I timing using boost::timer, starting where the profiler starts and ending where it ends. I do not use threads or anything fancy.
From my comments:
The numbers you get from your profiler say, that the program should be around 40% faster without the print statements.
The runtime, however, stays nearly the same.
Obviously one of the measurements must be wrong. That means you have to do more and better measurements.
First I suggest starting with another easy tool: the time command. This should get you a rough idea where your time is spend.
If the results are still not conclusive you need a better testcase:
Use a larger problem
Do a warmup before measuring. Do some loops and start any measurement afterwards (in the same process).
Tiristan: It's all in user. What I'm doing is pretty simple, I think... Does the fact that the file is open the whole time mean anything?
That means the profiler is wrong.
Printing 100000 lines to the console using python results in something like:
for i in xrange(100000):
print i
To console:
time python print.py
[...]
real 0m2.370s
user 0m0.156s
sys 0m0.232s
Versus:
time python test.py > /dev/null
real 0m0.133s
user 0m0.116s
sys 0m0.008s
My point is:
Your internal measurements and time show you do not gain anything from disabling output. Google Perf Tools says you should. Who's wrong?
_write$UNIX2003 is probably referring to the write POSIX system call, which outputs to the terminal. I/O is very slow compared to almost anything else, so it makes sense that your program is spending a lot of time there if you are writing a fair bit of output.
I'm not sure why your program wouldn't speed up when you remove the output, but I can't really make a guess on only the information you've given. It would be nice to see some of the code, or even the perftools output when the cout statement is removed.
Google perftools collects samples of the call stack, so what you need is to get some visibility into those.
According to the doc, you can display the call graph at statement or address granularity. That should tell you what you need to know.