Huge difference in dotTrace and Stopwatch function timings - profiling

Profiling performance of a couple functions in my C# application. I was using the .Net Stopwatch to time a function over 20,000 calls. And the timing worked out to around 2.8ms per call.
However when using dotTrace in line by line mode I find that 20,000 calls to my function takes 249,584ms which is ~12.5ms per call.
Now, the function is attached to a dispatch timer, so the Stopwatch was located inside the function and not registering the call itself. Like so:
private static Stopwatch myStop = new Stopwatch();
private MyFunction(object obj, EventArgs e)
{
myStop.Start()
//Code here
myStop.Stop();
Console.WriteLine("Time elapsed for View Update: {0}", myStop.Elapsed);
myStop.Reset();
}
However, I find it hard to believe that the call was taking 10 milliseconds on average.
Is there anything else that could be affecting the timing of the profiler or Stopwatch? Is the dispatch timer event affecting the timing that much?
I've looked through some of the JetBrains forums and wasn't able to find anything related something like this, but I'm sure I could have looked harder and will continue to do so. I do realize that the Stopwatch is unreliable in some ways, but didn't think it could be this much.
It should be noted that this is the first time I've profiled code in C# or .Net.

Short answer: line-by-line profiling have biggest overhead than any other profiling type.
For the line-by-line profiling dotTrace(and others profilers) will insert calls to some profiler's function, lets call it for example GetTime() which calculates time spent from the previous call, sums it and write somewhere in snapshot.
So your function not so fast and simple anymore.
Without profiling your code can look like this:
myStop.Start();
var i = 5;
i++;
myStop.Stop();
And if you start it under profiler it will be like this:
dotTrace.GetTime();
myStop.Start();
dotTrace.GetTime();
var i = 5;
dotTrace.GetTime();
i++;
dotTrace.GetTime();
myStop.Stop()
So 12.5 ms you get includes all this profiler API calls and distorts absolute function time alot. Line-by-line profiling mostly needed to compare relative statements times. So if you want to accurately measure absoulte function times you should use Sampling profiling type.
For more information about profiling types you can refer to dotTrace Profiling Types and comparison of profiling types help pages.

Related

Is it possible to mark tests as taking long time in googletest runs

Most of my tests finish quickly, the time taken is unnoticeable. But a few of them take a few seconds. I would like to print a hint to the user:
TEST(something, thing) {
std::cout << "This might take a few seconds\n";
ASSERT_EQ(expected_result, long_computation());
}
This doesn't blend in well with what is printed. Is there a feature for this in googletest? I couldn't find anything related. Any way to make goggliest understand it, print a hint to the user, and even report error in case the test runs too long? Or any plugin that does this? Thx
TEST(something, thing, max_time: 3 seconds) {
ASSERT_EQ(expected_result, long_computation());
}
I'm not sure anything like this exists but I believe you can tailor some of the features you want at least partially.
You can measure the time of execution of your test yourself and add statements like ASSERT_GT(time_limit, measured_time).
Consider using https://github.com/google/googletest/blob/master/googletest/docs/advanced.md#logging-additional-information for writing the timings or "long"/"fast" attributes.
Consider using https://github.com/google/googletest/blob/master/googletest/docs/advanced.md#running-a-subset-of-the-tests for running only fast, or only long tests, f.e. run only fast tests (provided you named all long tests like *Long): ./foo_test --gtest_filter=-*Long

Callgrind: Profile a specific part of my code

I'm trying to profile (with Callgrind) a specific part of my code by removing noise and computation that I don't care about.
Here is an example of what I want to do:
for (int i=0; i<maxSample; ++i) {
//Prepare data to be processed...
//Method to be profiled with these data
//Post operation on the data
}
My use-case is a regression test, I want to make sure that the method in question is still fast enough (something like less than 10% extra instructions since the last implementation).
This is why I'd like to have the cleaner output form Callgrind.
(I need a for loop in order to have a significant amount of data processed in order to have a good estimation of the behavior of the method I want to profile)
My first try was to change the code to:
for (int i=0; i<maxSample; ++i) {
//Prepare data to be processed...
CALLGRIND_START_INSTRUMENTATION;
//Method to be profiled with these data
CALLGRIND_STOP_INSTRUMENTATION;
//Post operation on the data
}
CALLGRIND_DUMP_STATS;
Adding the Callgrind macros to control the instrumentation. I also added the --instr-atstart=no options to be sure that I profile only the part of the code I want...
Unfortunately with this configuration when I start to launch my executable with callgrind, it never ends... It is not a question of slowness, because a full instrumentation run last less than one minute.
I also tried
for (int i=0; i<maxSample; ++i) {
//Prepare data to be processed...
CALLGRIND_TOGGLE_COLLECT;
//Method to be profiled with these data
CALLGRIND_TOGGLE_COLLECT;
//Post operation on the data
}
CALLGRIND_DUMP_STATS;
(or the --toggle-collect="myMethod" option)
But Callgrind returned me a log without any call (KCachegrind is white as snow :( and says zero instructions...)
Did I use the macros/options correctly? Any idea of what I need to change in order to get the expected result?
I finally managed to solve this issue... This was a config issue:
I kept the code
for (int i=0; i<maxSample; ++i) {
//Prepare data to be processed...
CALLGRIND_TOGGLE_COLLECT;
//Method to be profiled with these data
CALLGRIND_TOGGLE_COLLECT;
//Post operation on the data
}
CALLGRIND_DUMP_STATS;
But ran the callgrind with --collect-atstart=no (and without the --instr-atstart=no!!!) and it worked perfectly, in a reasonable time (~1min).
The issue with START/STOP instrumentation was that callgrind dumps a file (callgrind.out.#number) at each iteration (each STOP) thus it was really really slow... (after 5min I had only 5000 runs for a 300 000 iterations benchmark... unsuitable for a regression test).
The toggle-collect option is very picky in how you specify the method to use as trigger. You actually need to specify its argument list as well, and even the whitespace needs to match! Use the method name exactly as it appears in the callgrind output. For instance, I am using this invokation:
$ valgrind
--tool=callgrind
--collect-atstart=no
"--toggle-collect=ctrl_simulate(float, int)"
./swaag
Please observe:
The double quotes around the option.
The argument list including parentheses.
The whitespace after the comma character.

Trying to print the registers' values from the stack using a pin tool

I am trying to print out the stack in different routines using a pin tool. I am able to get all of the routines but I am a little confused on how to get the addresses stored in the registers in the stack of that routine.
What I have is this:
VOID SETRTN_CONTEXT(CONTEXT * ctxt)
{
ADDRINT reg_address;
PIN_SaveContext(ctxt, &m_ctxt);
reg_address = PIN_GetContextReg(&m_ctxt, REG_STACK_PTR);
}
and in another function I have this piece of code that calls that function:
for(rtn = SEC_RtnHead(sec); RTN_Valid(rtn); rtn = RTN_Next(rtn) )
{
RTN_Open(rtn);
RTN_InsertCall(rtn, IPOINT_BEFORE, (AFUNPTR)SETRTN_CONTEXT,
IARG_CONST_CONTEXT, IARG_THREAD_ID, IARG_END);
RTN_Close(rtn);
}
I am a little confused on when the routine calls that function since I am only getting one result and I get it after attaching with Pin and waiting a couple of seconds.
Any pinheads that might help me on this one? I understand that I need the context from a routine in order to get the registers but I cannot find any function that returns the context as an object...
In your RTN_InsertCall, you add the thread id, and in your SETRTN_CONTEXT function declaration you don't receive the thread id... might want to fix that.
Also, in your analysis routine SETRTN_CONTEXT, you're not actually saving anything external to the application. I could be wrong if m_ctxt is a global variable that you're manipulating elsewhere, which how could that be sound unless you did that for every time the analysis routine ran and in a thread safe way?
Clearly, you want to write the information to some file or output. I recommend using some kind of xml tool; this makes it easy to parse, and if you write your pintools smartly, you can exchange the format of the output by obeying some interface contract.
Also to clarify your confusion, you try to insert the analysis routine to run before every single function in a particular image; every time that function is called in that image, your SETRTN_CONTEXT runs.

Memory leak checking using Instruments on Mac

I've just been pulling my hair out trying to make Instruments cough up my deliberately constructed memory leaks. My test example looks like this:
class Leaker
{
public:
char *_array;
Leaker()
{
_array=new char[1000];
}
~Leaker()
{
}
};
void *leaker()
{
void *p=malloc(1000);
int *pa=new int[2000];
{
Leaker l;
Leaker *pl=new Leaker();
}
return p;
}
int main (int argc, char **argv)
{
for (int i=0; i<1000; ++i) {
leaker();
}
sleep(2); // Needed to give Instruments a chance to poll memory
return 0;
}
Basically Instruments never found the obvious leaks. I was going nuts as to why, but then discovered "sec Between Auto Detections" in the "Leaks Configuration" panel under the Leaks panel. I dialed it back as low as it would go, which was 1 second, and placed the sleep(2) in in my code, and voila; leaks found!
As far as I'm concerned, a leak is a leak, regardless of whether it happens 30 minutes into an app or 30 milliseconds. In my case, I stripped the test case back to the above code, but my real application is a command-line application with no UI or anything and it runs very quickly; certainly less than the default 10 second sample interval.
Ok, so I can live with a couple of seconds upon exit of my app in instrumentation mode, but what I REALLY want, is to simply have Instruments snapshot memory on exit, then do whatever it needs over time while the app is running.
So... the question is: Is there a way to make Instruments snapshot memory on exit of an application, regardless of the sampling interval?
Cheers,
Shane
Instruments, in Leaks mode can be really powerful for leak tracing, but I've found that it's more biased towards event-based GUI apps than command line programs (particularly those which exit after a short time). There used to be a CHUD API where you could programmatically control aspects of the instrumentation, but last time I tried it the frameworks were no longer provided as part of the SDK. Perhaps some of this is now replaced with Dtrace.
Also, ensure you're up to date with Xcode as there were some recent improvements in this area which might make it easier to do what you need. You could also keep the short delay before exit but make it conditional on the presence of an environment variable, and then set that environment variable in the Instruments launch properties for your app, so that running outside Instruments doesn't have the delay.
Most unit testing code executes the desired code paths and exits. Although this is perfectly normal for unit testing, it creates a problem for the leaks tool, which needs time to analyze the process memory space. To fix this problem, you should make sure your unit-testing code does not exit immediately upon completing its tests. You can do this by putting the process to sleep indefinitely instead of exiting normally.
https://developer.apple.com/library/ios/documentation/Performance/Conceptual/ManagingMemory/Articles/FindingLeaks.html
I've just decided to leave the 2 second delay during my debug+leaking build.

unable to accumulate time using gprof - the gnu profiler

I am running cygwin on windows and using latest version of gprof for profiling my code. My problem is that the flat profile shows zero sec for each of the functions in my code, I even tried to loop the functions(tried a for loop for a million) but gprof is unable to accumulate any time .Please help . Here is one of my sample function.
bool is_adjacent(const char* a ,const char* b)
{
for(long long iter=0;iter<=1000000;iter++){
string line1="qwertyuiop";
string line2="asdfghjkl";
string line3="zxcvbnm";
string line4="1234567890";
int pos=line1.find(*a);
if(pos!=string::npos){
if ((line1[pos++]==*b)||((pos!=0)&&(line1[pos--]==*b)))
return true;
else return false;}
pos=line2.find(*a);
if(pos!=string::npos){
if ((line2[pos++]==*b)||((pos!=0)&&(line2[pos--]==*b)))
return true;
else return false;}
pos=line3.find(*a);
if(pos!=string::npos){
if ((line3[pos++]==*b)||((pos!=0)&&(line3[pos--]==*b)))
return true;
else return false;}
pos=line4.find(*a);
if(pos!=string::npos){
if ((line4[pos++]==*b)||((pos!=0)&&(line4[pos--]==*b)))
return true;
else return false;}
}
}
I'm having that problem from time to time. Esp. in heavily threaded code.
You can use valgrind with the --callgrind option (tool) which will allow you to at least have a more detailed view of how much time per function call. There's a kde tool as well to visualize the output (and eswp. callgraph) better called kcachegrind. Don't know if you can install that on cygwin though.
If your overall goal is to find and remove performance problems, you might consider this.
I suspect it will show that essentially 100% of the CPU time is being spent in find and string-compare, leaving almost 0% for your code. That's what happens when only the program counter is sampled.
If you sample the call stack, you will see that the lines of code that invoke find and string-compare will be displayed on the stack with a frequency equal to the time they are responsible for.
That's the glory of gprof.
P.S. You could also figure this out by single-stepping the code at the disassembly level.
What version of gprof are you using? Some old versions have this exact bug.
Run gprof --version and tell us the results.