Benchmarking Code Runtime with Trace32 - trace32

I have an embedded system with code that I'd like to benchmark. In this case, there's a single line I want to know the time spent on (it's the creation of a new object that kicks off the rest of our application).
I'm able to open Trace->Chart->Symbols and see the time taken for the region selected with my cursor, but this is cumbersome and not as accurate as I'd like. I've also found Perf->Function Runtime, but I'm benchmarking the assignment of a new object, not of any particular function call (new is called in multiple places, not just the line of interest).
Is there a way to view the real-world time taken on a line of code with Trace32? Going further than a single line: would there be a way to easily benchmark the time between two breakpoints?

The solution by codehearts, which uses the RunTime commands, is just fine if you don't have a real-time trace. It works with any Lauterbach tool and any target CPU.
However if you have a real-time trace (e.g. CPU with ETM and Lauterbach PowerTrace hardare), I recommend to use the command Trace.STATistic.AddressDURation <start-addr> <end-addr> instead. This command opens a window which shows the average time between two addresses. You get best results, if you execute the code between the two addresses several times.
If you are using an ARM Cortex CPU, which supports cycle-accurate timing information (usually all Cortex-A, Cortex-R and Cortex-M7) you can improve the accuracy of the result dramatically by using the setting ETM.TImeMode.CycleAccurate (together with ETM.CLOCK <core-frequency>).
If you are using a Lauterbach CombiProbe or uTrace (and you can't use the ETM.TImeMode.CycleAccurate) I recommend the setting Trace.PortFilter.ON. (By default the port-filter is set to PACK, which allows to record more data and program flow, but with a slightly worse timing accuracy.)

Opening the Misc->Runtime window shows you the total time taken since "laststart." By setting a breakpoint on the first line of your code block and another after the last line, you can see the time taken from the first breakpoint to the second under the "actual" column.

Related

Can I use temporary breakpoints to implement line step of a remote target

I'm working on an implement of gdb server. The server talks with gdb using RSP protocol and drive a CPU model. While developing the line step function (range step mode enabled) i found that my CPU model need to take some time to finish just one instruction step. The reason is the CPU model run in individual process. I have to pass the package (instruction step) through IPC. With thousands of instructions in the source line it spend too much time compared with normal running of that line.
Can I ask gdb using temporary breakpoints ( set on any possible instruction can go out of the range corresponding to the source line) to assist with step functionality? Does gdb really knows where to set the required breakpoints? If the answer is false, is there a good way to deal with this problem?
Thank you for your time!
Ted CH
You say that you are using range-stepping mode, which would imply your gdb server already supports the vCont packet, and the r action. See this page for all the details.
The r action gives a range start and end, and you are free to step until the program counter leaves the specified range.
Your server is free to implement this as you like, so, you could place a temporary breakpoint, then resume execution until the breakpoint is hit. But, you need to think about both control flow, and traps, both of which might cause a thread to leave the range without passing through the end point.
If passing each single step through to your simulator is slow, could you not just pass the vCont range information directly through to the simulator, and have the simulator stop as soon as the program counter leaves the range? Surely this would be quicker than having multiple round trips between simulator and server.

Why does my program run faster on first launch than on next launches?

I have been working for 2.5 years on a personal flight sim project on my leisure time, written in C++ and using Opengl on a windows 7 PC.
I recently had to move to windows 10. Hardware is exactly the same. I reinstalled Code::blocks.
It turns out that on first launch of my project after the system start, performance is OK, similar to what I used to see with windows 7. But, the second, third, and all next launches give me lower performance, with significant less fluidity in frame rate compared to the first run, detectable by eye. This never happened with windows 7.
Any time I start my system, first run is fast, next ones are slower.
I had a look at the task manager while doing some runs. The first run is handled by one of the 4 cores of my CPU (iCore5-6500) at approximately 85%. For the next runs, the load is spread accross the 4 cores. During those slower runs on 4 cores, I tried to modify the affinity and direct my program to only one core without significant improvement in performance. The selected core was working at full load, though.
My C++ code doesn't explicitly use any thread function at this stage. From my modest programmer's point of view, there is only one main thread run in the main(). On the task manager, I can see that some 10 to 14 threads are alive when my program runs. I guess (wrongly?) that they are implicitly created by the use of joysticks, track ir or other communication task with GPU...
Could it come from memory not being correctly freed when my program stops? I thought windows would free it properly, even if I forgot some 'delete' after using 'new'.
Has anyone encountered a similar situation? Any explanation coming to your minds?
Any suggestion to better understand these facts? Obviously, my ultimate goal is to have a consistent performance level whatever the number of launches.
trying to upload screenshots of second run as viewed by task manager:
trying to upload screenshots of first run as viewed by task manager:
Well I got a problems when switching to win10 for clients at my work too here few I encountered all because Windows10 has changed up scheduling of processes creating a lot of issues like:
older windowses blockless thread synchronizations techniques not working anymore
well placed Sleep() helps sometimes. Btw. similar problems was encountered when switching from w2k to wxp.
huge slowdowns and frequent freezes for few seconds on older single threaded apps
usually setting affinity to single core solves this. You can do this also in task manager just to check and if helps then you can do this in code too. Here an example on how to do it with winapi:
Cache size estimation on your system?
messed up drivers timings causing zombies processes even total freeze and or BSOD
I deal with USB in my work and its a nightmare sometimes on win10. On top of all this Win10 tends to enforce wrong drivers on devices (like gfx cards, custom USB systems etc ...)
auto freeze close app if it does not respond the wndproc in time
In Windows10 the timeout is much much smaller than in older versions. If the case You can try running in compatibility mode (set in icon properties on desktop) for older windows (however does not work for #1,#2), or change the apps code to speed up response. For example in VCL you can call Process Messages from inside of blocking code to remedy this... Or you can use threads for the heavy lifting ... just be careful with rendering and using winapi as accessing some winapi (any window/visual related stuff) functions from outside main thread causes havoc ...
On top of all this old IDEs (especially for MCUs) don't work properly anymore and new ones are usually much worse (or unusable because of lack of functionality that was present in older versions) to work with so I stayed faith full to Windows7 for developer purposes.
If none of above helps then try to log the times some of your task did need ... it might show you which part of code is the problem. I usually do this using timing graph like this:
both x,y axises are time and each task has its own color and row in graph. the graph is scrolling in time (to the left side in my case) and has changeable time scale. The numbers are showing actual and max (or sliding avg) value ...
This way I can see if some task is not taking too much time or even overlaps its next execution, peaks are also nicely visible and all runs during runtime without any debug tools which might change the behavior of execution.

Viewing the contents of a variable in debug mode? (Other than with break points)

I'm writing a program which calculates the Mandelbrot Set (and then renders it in OpenGL under Windows) in order to utilise parallel programming techniques.
I'm supposed to demonstrate the use of threads, mutexes, and semaphores; so at the moment I'm calculating the set using multiple threads (splitting the set up horizontally) and timing each thread, then adding it to a total (the total is a global variable protected by a mutex)
I'd like to be able to view the total in debug mode - is there any relatively simple way to do this, other than rendering the total in the OpenGL window, or checking the contents of the variable with break points?
If you're on windows you could use OutputDebugString and view the results with a tool called DebugView. The downside is that it will print each value on a new line instead of updating it in place (which I guess is what you prefer).
If you want to view a value that will be updated in-place, you could probably use Performance Counters, but it's much more of a hassle: First, your program would have to implement a provider. And second, you'll have to write another program (a consumer) to track this counter and display it. But if you want maximum flexibility, this API is great, since it means many programs can observe the provider's counters, and they can, for example, be logged to a file and replayed or turned into a graph.
Easiest way is to somehow output a message to the debug stream and then view it using your IDE.
Under windows you can use:
OuputDebugString(LPCTSTR lpOutputString);
You should be able to read the global variable from the debugger. Have you tried?

Accurate benchmark under windows

I am writing a program that needs to run a set of executables and find their execution times.
My first approach was just to run a process, start a timer and see the difference between the start time and the moment when process returns exit value.
Unfortunately, this program will not run on the dedicated machine so many other processes / threads can greatly change the execution time.
I would like to get time in milliseconds / clocks that actually was given to the process by OS. I hope that windows stores that information somewhere but I cannot find anything useful on the msdn.
Sure one solution is to run the process multiple times and calculate the avrage time, but I wont to avoid that.
Thanks.
You can take a look at GetProcessTimes API.
The "High-Performance Counter" might be what you're looking for.
I've used QueryPerformanceCounter/QueryPerformanceFrequency for high-resolution timing in stuff like 3D programming where the stock functionality just doesn't cut it.
You could also try the RDTSC x86 instruction.

How to use gprof to profile a daemon process without terminating it gracefully?

Need to profile a daemon written in C++, gprof says it need to terminate the process to get the gmon.out. I'm wondering anyone has ideas to get the gmon.out with ctrl-c? I want to find out the hot spot for cpu cycle
Need to profile a daemon written in C++, gprof says it need to terminate the process to get the gmon.out.
That fits the normal practice of debugging daemon processes: provision a switch (e.g. with command line option) which would force the daemon to run in foreground.
I'm wondering anyone has ideas to get the gmon.out with ctrl-c?
I'm not aware of such options.
Though in case of gmon, call to exit() should suffice: if you for example intend to test say processing 100K messages, you can add in code a counter incremented on every processed message. When the counter exceeds the limit, simply call exit().
You also can try to add a handler for some unused signal (like SIGUSR1 or SIGUSR2) and call exit() from there. Thought I do not have personal experience and cannot be sure that gmon would work properly in the case.
I want to find out the hot spot for cpu cycle
My usual practice is to create a test application, using same source code as the daemon but different main() where I simulate precise scenario (often with a command line switch many scenarios) I need to debug or test. For the purpose, I normally create a static library containing the whole module - except the file with main() - and link the test application with the static library. (That helps keeping Makefiles tidy.)
I prefer the separate test application to hacks inside of the code since especially in case of performance testing I can sometimes bypass or reduce calls to expensive I/O (or DB accesses) which often skews the profiler's sampling and renders the output useless.
As a first suggestion I would say you might try to use another tool. If the performance of that daemon is not an issue in your test you could give a try to valgrind. It is a wonderful tool, I really love it.
If you want to make the daemon go as fast as possible, you can use lsstack with this technique. It will show you what's taking time that you can remove. If you're looking for hot spots, you are probably looking for the wrong thing. Typically there are function calls that are not absolutely needed, and those don't show up as hot spots, but they do show up on stackshots.
Another good option is RotateRight/Zoom.