I have been doing some profiling of a physics application I wrote, and I've noticed when I profile it, it runs faster and perhaps smoother than without the profiler. Note that I am NOT running the program in the debug configuration or with the debugger attached.
I measured the difference, and I found program runs ~50% faster under the profiler. I don't consider this a duplicate because the other question doesn't make it clear whether he/she was running it with the debugger attached, and the top answer assumes that's the case (And the 20x speedup strongly indicates it would be the correct answer in most cases).
Another answer suggests a "heisenburg" bug, but that's kind of a catchall hypothesis (I'm still going to investigate down this line).
Is it possible that Visual Studio does something that prevents other applications from interfering with my application's compute or memory resources (in order to get a "fairer" result)?
Visual Studio's "CPU Usage" profiler appears to disregard laptop power usage settings, so if you run an application on a laptop that is trying to conserve battery power, it will run slower than if you run the profiler on it.
I discovered this when I got home from work- I noticed the speed difference had disappeared. On a hunch, I unplugged my laptop and tried the test several more times. The speed difference returned. What's more, under the profiler, the application runs at about the same speed plugged in or not.
I was not able to find any sources on this, but I'll be happy to edit them in if someone can find some.
If you use threading in your code, this can be caused due to the System Timer Resolution in Windows.
Default windows timer resolution is 15.6ms
When you are running the profiler, this is reduced to 1ms and your program run fast.
Checkout this answer
Related
I have been working for 2.5 years on a personal flight sim project on my leisure time, written in C++ and using Opengl on a windows 7 PC.
I recently had to move to windows 10. Hardware is exactly the same. I reinstalled Code::blocks.
It turns out that on first launch of my project after the system start, performance is OK, similar to what I used to see with windows 7. But, the second, third, and all next launches give me lower performance, with significant less fluidity in frame rate compared to the first run, detectable by eye. This never happened with windows 7.
Any time I start my system, first run is fast, next ones are slower.
I had a look at the task manager while doing some runs. The first run is handled by one of the 4 cores of my CPU (iCore5-6500) at approximately 85%. For the next runs, the load is spread accross the 4 cores. During those slower runs on 4 cores, I tried to modify the affinity and direct my program to only one core without significant improvement in performance. The selected core was working at full load, though.
My C++ code doesn't explicitly use any thread function at this stage. From my modest programmer's point of view, there is only one main thread run in the main(). On the task manager, I can see that some 10 to 14 threads are alive when my program runs. I guess (wrongly?) that they are implicitly created by the use of joysticks, track ir or other communication task with GPU...
Could it come from memory not being correctly freed when my program stops? I thought windows would free it properly, even if I forgot some 'delete' after using 'new'.
Has anyone encountered a similar situation? Any explanation coming to your minds?
Any suggestion to better understand these facts? Obviously, my ultimate goal is to have a consistent performance level whatever the number of launches.
trying to upload screenshots of second run as viewed by task manager:
trying to upload screenshots of first run as viewed by task manager:
Well I got a problems when switching to win10 for clients at my work too here few I encountered all because Windows10 has changed up scheduling of processes creating a lot of issues like:
older windowses blockless thread synchronizations techniques not working anymore
well placed Sleep() helps sometimes. Btw. similar problems was encountered when switching from w2k to wxp.
huge slowdowns and frequent freezes for few seconds on older single threaded apps
usually setting affinity to single core solves this. You can do this also in task manager just to check and if helps then you can do this in code too. Here an example on how to do it with winapi:
Cache size estimation on your system?
messed up drivers timings causing zombies processes even total freeze and or BSOD
I deal with USB in my work and its a nightmare sometimes on win10. On top of all this Win10 tends to enforce wrong drivers on devices (like gfx cards, custom USB systems etc ...)
auto freeze close app if it does not respond the wndproc in time
In Windows10 the timeout is much much smaller than in older versions. If the case You can try running in compatibility mode (set in icon properties on desktop) for older windows (however does not work for #1,#2), or change the apps code to speed up response. For example in VCL you can call Process Messages from inside of blocking code to remedy this... Or you can use threads for the heavy lifting ... just be careful with rendering and using winapi as accessing some winapi (any window/visual related stuff) functions from outside main thread causes havoc ...
On top of all this old IDEs (especially for MCUs) don't work properly anymore and new ones are usually much worse (or unusable because of lack of functionality that was present in older versions) to work with so I stayed faith full to Windows7 for developer purposes.
If none of above helps then try to log the times some of your task did need ... it might show you which part of code is the problem. I usually do this using timing graph like this:
both x,y axises are time and each task has its own color and row in graph. the graph is scrolling in time (to the left side in my case) and has changeable time scale. The numbers are showing actual and max (or sliding avg) value ...
This way I can see if some task is not taking too much time or even overlaps its next execution, peaks are also nicely visible and all runs during runtime without any debug tools which might change the behavior of execution.
I created a program that sends a bunch of text in to the COM port on the speed 9600. In debug mode the program sends all the data and than the COM port is closed. Bit if I create a installer project and than install it on the same machine it doesn't send the last symbols. It closes the port earlier than all data is transmitted. So my question is: Is the debug exe file slower or is it somehow slowed down by the IDE (Visual Studio)?
Also adding a Sleep(100); between the last transition command and the port closing line the problem disappeared.
Your observations correspond with a program written incorrectly, showing signs of a bug that is currently only detectable with a release build.
Release builds tend to run faster, as they are created at an enhanced optimisation level and with various debugging utilities turned off. They are designed for maximum production performance at the expense of development-time debuggability. Creating an installation package evidently created a release build (which makes sense).
This increased performance in turn affects the timing of your program. If you were accidentally relying on there being a long duration before the port was closed, giving your program enough time to transmit all its data only by chance, then when this process speeds up your bug becomes observable. There is no longer enough time for the data to get through. Adding a Sleep simulates the slower execution of the debug build, thus almost certainly confirming the existence of a timing bug.
This is good news! You have strong evidence of where the bug is and the form it takes. Now all you have to do is fix it!
basically, I have a native C++ application which obviously crashed from time to time during code development. After fixing some bugs it runs rather smoothly. However, I noticed a rather weird behaviour which is very confusing to me.
It is a CMake-based project and I have an option of packaging the application into a zip file. When I run the executable from a freshly extracted zip - the application runs normally (and performance is high). However, say, if it crashes couple of times - the next start the behaviour is the same, but everything runs terribly slow. The test case I have (albeit manual) is very straightforward and I can repeat it easily.
Once the application is slowed down all I have to do is to rename the folder the application is located in and the performance rises again. Until it has crashed couple of times.
Obviously, this is not a problem once I get rid of the crashes. However, it seems to me that once the system (Win8.1 x64) detects persistent crashes it runs it in some different mode(?). And it looks like the app that I run from the IDE (MSVS2013) has received this "penalty". So it runs terribly slow even in release mode, let alone debug.
So I have a couple of questions to the community. First, have you ever observed such behaviour? Second, is the reason for slowdown indeed some system-wide information stored upon application crash? And finally, do you have an idea how to fix it?
Note: I could probably move the source folder to a different location, but it would require quite a lengthy rebuild step. Moreover, I'm just curious about what is going on. Also, system restart and full rebuild in the same folder didn't help.
Some context: the application uses Qt4 (also, Qt plugin system), MITK, ITK, VTK, CTK, OpenCASCADE, glew and related libraries all build from source on this machine.
Cheers,
Rostislav.
Pretty much a year after asking this question, and having the problem "self-resolve", I think I found the answer and decided to post it here in case someone would run into similar problem.
It seems to me that my application was put by the system into the fault-tolerant heap (FTH) after several crashes. Some more information on FTH can be found on MSDN. It might make sense to disable the FTH on the developer machine (at least temporarily) when debugging a particularly nasty crash.
The actual question: what are the differences between running in Visual C++ 2010 (both release and debug mode) and standalone, excluding anything that couldn't cause the problem stated at the bottom of this post?
I have a very specific problem with my program, so I am not going to post the code. If you would like to know details of the problem to gain some understanding, I post them at the bottom. Instead, I ask: what are the differences in running in visual c++ and standalone?
I am using Visual C++ 2010, and my program is using SFML 2.0. It runs fine in studio (I am calling it studio because it is easier to type), but when run standalone on some computers only a bug will occur within the program, to do with movement being delayed but other parts not. I cannot find any links between the computer specs of the users who test it and whether they work or not.
All dlls and such are included (at least, I think they are - the program runs fine, as detailed at the bottom - maybe some users have some framework installed?). On my computer, it works all the time in studio but not standalone. On my netbook, it works, on my older computer, it doesn't. The only project settings I changed with vc++ directories and linker settings.
Problem details:
All the sounds sound slightly different and there is a possible high pitched background noise. Character movement, enemy movement and space bar delay which is supposed to be 0.5 seconds is increased to about 5 seconds but "generating pixels" (an aspect in the game with the arrow keys isn't.
I'm stressing again that it only happens on SOME COMPUTERS, mine included, but always works when run within studio. A number of people have tested the game, and some reported the problem and others different.
[NOT THE MAIN POINT OF THE QUESTION BUT HERE IS AN EXTRA BIT: If you are so willing to help me that you wish to look through the entire source code, it's here: http://dl.dropbox.com/u/53835113/EVERYTHING.zip [note: ignore the massive amount of bugs and memory leaks other than the problem described :P]]
EDIT: THE POINT ISN'T THAT I WANT MY PROGRAM DEBUGGED! That is an optional extra. The question: what are the differences in running in visual c++ and standalone?
More edits: Found that the problem is due to the frame time not being correct. Not sure if this provides some insight.
Thanks to anyone who tries to help me!
GetFrameTime() in SFML was removed in a recent version because it returns the time since last frame rather than the time since last update. As a result, the it may return 0.
short version:
Is there a good time based sampling profiler for Linux?
long version:
I generally use OProfile to optimize my applications. I recently found a shortcoming that has me wondering.
The problem was a tight loop, spawning c++filt to demangle a c++ name. I only stumbled upon the code by accident while chasing down another bottleneck. The OProfile didn't show anything unusual about the code so I almost ignored it but my code sense told me to optimize the call and see what happened. I changed the popen of c++filt to abi::__cxa_demangle. The runtime went from more than a minute to a little over a second. About a x60 speed up.
Is there a way I could have configured OProfile to flag the popen call? As the profile data sits now OProfile thinks the bottle neck was the heap and std::string calls (which BTW once optimized dropped the runtime to less than a second, more than x2 speed up).
Here is my OProfile configuration:
$ sudo opcontrol --status
Daemon not running
Event 0: CPU_CLK_UNHALTED:90000:0:1:1
Separate options: library
vmlinux file: none
Image filter: /path/to/executable
Call-graph depth: 7
Buffer size: 65536
Is there another profiler for Linux that could have found the bottleneck?
I suspect the issue is that OProfile only logs its samples to the currently running process. I'd like it to always log its samples to the process I'm profiling. So if the process is currently switched out (blocking on IO or a popen call) OProfile would just place its sample at the blocked call.
If I can't fix this, OProfile will only be useful when the executable is pushing near 100% CPU. It can't help with executables that have inefficient blocking calls.
Glad you asked. I believe OProfile can be made to do what I consider the right thing, which is to take stack samples on wall-clock time when the program is being slow and, if it won't let you examine individual stack samples, at least summarize for each line of code that appears on samples, the percent of samples the line appears on. That is a direct measure of what would be saved if that line were not there. Here's one discussion. Here's another, and another. And, as Paul said, Zoom should do it.
If your time went from 60 sec to 1 sec, that implies every single stack sample would have had a 59/60 probability of showing you the problem.
Try Zoom - I believe it will let you profile all processes - it would be interesting to know if it highlights your problem in this case.
I wrote this a long time ago, only because I couldn't find anything better: https://github.com/dicej/profile
I just found this, too, though I haven't tried it: https://github.com/oliver/ptrace-sampler
Quickly hacked up trivial sampling profiler for linux: http://vi-server.org/vi/simple_sampling_profiler.html
It appends backtrace(3) to a file on SIGUSR1, and then converts it to annotated source.
After trying everything suggested here (except for the now-defunct Zoom, which is still available as huge file from dropbox), I found that NOTHING does what Mr. Dunlavey recommends. The "quick hacks" listed above in some of the answers wouldn't build for me, or didn't work for me either. Spent all day trying stuff... and nothing could find fseek as a hotspot in an otherwise simple test program that was I/O bound.
So I coded up yet another profiler, this time with no build dependencies, based on GDB, so it should "just work" for almost any debuggable code. A single CPP file.
https://github.com/jasonrohrer/wallClockProfiler
It automates the manual process suggested by Mr. Dunlavey, interrupting the target process with GDB periodically and harvesting a stack trace, and then printing a report at the end about which stack traces are the most common. Those are your true wall-clock hotspots. And it actually works.