Will std::fopen execution time depend on file system content in Linux?

Will std::fopen execution time depend on file system content in Linux? - c++

I am running profiling of my C++ code on a purpose built machine which is running Linux. The machine is re-installed sometimes, something which is outside of my control, but i'm thinking that the file system of the machine is gradually filled during the two weeks or so between the cleanups, and that this impacts my profiling measurements.
I have noticed that my profiling results get worse over time and then back to normal when the machine is cleaned up. This led to further investigation and i can see that std::fopen takes ~10 times longer to execute the day before cleanup compared to the day after.
Is it expected that std::fopen execution time is depending on what is stored in the file system? The file I'm opening is located in a directory which is always empty when i start my test case. Could there be some search involved regardless when running std::fopen or why would the execution time vary so much?
std::fopen after cleanup : 0.000098141 seconds
std::fopen before cleanup: 0.000940125 seconds
Running a decently new gcc. The machine has ARM architecture.

Related

How can I make a function creating tens of thousands of symbolic filesystem links show up in VTUNE?

I'm profiling some binary on CENTOS 7.6 using VTUNE. I've yet to find the function (in vtune output) which is creating tens of thousands of symbolic file system links. And there is another one reading many such links, which I also cannot find. I used both,
basic hotspot analysis
locks and waits
Is the "basic hotspot analysis" only user-CPU-time but excludes system-CPU-time?
Where can one find the actually time spent (not CPU time) inside a function?

Hotspot analysis includes System-CPU time too. To be clear, the total time (real CPU time) is the combination of the amount of time the CPU spends executing a program and the amount of time the CPU spends performing system calls for the kernel on the program’s behalf.
Actual time spent inside a function: After running hotspots analysis, hover to the bottom-up tab in Vtune GUI to view the execution time for all functions in the user code. Double click on any of the function to view the execution time

Why does my program run faster on first launch than on next launches?

I have been working for 2.5 years on a personal flight sim project on my leisure time, written in C++ and using Opengl on a windows 7 PC.
I recently had to move to windows 10. Hardware is exactly the same. I reinstalled Code::blocks.
It turns out that on first launch of my project after the system start, performance is OK, similar to what I used to see with windows 7. But, the second, third, and all next launches give me lower performance, with significant less fluidity in frame rate compared to the first run, detectable by eye. This never happened with windows 7.
Any time I start my system, first run is fast, next ones are slower.
I had a look at the task manager while doing some runs. The first run is handled by one of the 4 cores of my CPU (iCore5-6500) at approximately 85%. For the next runs, the load is spread accross the 4 cores. During those slower runs on 4 cores, I tried to modify the affinity and direct my program to only one core without significant improvement in performance. The selected core was working at full load, though.
My C++ code doesn't explicitly use any thread function at this stage. From my modest programmer's point of view, there is only one main thread run in the main(). On the task manager, I can see that some 10 to 14 threads are alive when my program runs. I guess (wrongly?) that they are implicitly created by the use of joysticks, track ir or other communication task with GPU...
Could it come from memory not being correctly freed when my program stops? I thought windows would free it properly, even if I forgot some 'delete' after using 'new'.
Has anyone encountered a similar situation? Any explanation coming to your minds?
Any suggestion to better understand these facts? Obviously, my ultimate goal is to have a consistent performance level whatever the number of launches.
trying to upload screenshots of second run as viewed by task manager:
trying to upload screenshots of first run as viewed by task manager:

Well I got a problems when switching to win10 for clients at my work too here few I encountered all because Windows10 has changed up scheduling of processes creating a lot of issues like:
older windowses blockless thread synchronizations techniques not working anymore
well placed Sleep() helps sometimes. Btw. similar problems was encountered when switching from w2k to wxp.
huge slowdowns and frequent freezes for few seconds on older single threaded apps
usually setting affinity to single core solves this. You can do this also in task manager just to check and if helps then you can do this in code too. Here an example on how to do it with winapi:
Cache size estimation on your system?
messed up drivers timings causing zombies processes even total freeze and or BSOD
I deal with USB in my work and its a nightmare sometimes on win10. On top of all this Win10 tends to enforce wrong drivers on devices (like gfx cards, custom USB systems etc ...)
auto freeze close app if it does not respond the wndproc in time
In Windows10 the timeout is much much smaller than in older versions. If the case You can try running in compatibility mode (set in icon properties on desktop) for older windows (however does not work for #1,#2), or change the apps code to speed up response. For example in VCL you can call Process Messages from inside of blocking code to remedy this... Or you can use threads for the heavy lifting ... just be careful with rendering and using winapi as accessing some winapi (any window/visual related stuff) functions from outside main thread causes havoc ...
On top of all this old IDEs (especially for MCUs) don't work properly anymore and new ones are usually much worse (or unusable because of lack of functionality that was present in older versions) to work with so I stayed faith full to Windows7 for developer purposes.
If none of above helps then try to log the times some of your task did need ... it might show you which part of code is the problem. I usually do this using timing graph like this:
both x,y axises are time and each task has its own color and row in graph. the graph is scrolling in time (to the left side in my case) and has changeable time scale. The numbers are showing actual and max (or sliding avg) value ...
This way I can see if some task is not taking too much time or even overlaps its next execution, peaks are also nicely visible and all runs during runtime without any debug tools which might change the behavior of execution.

Why the installed version of the program runs different from the one in debug mode?

I created a program that sends a bunch of text in to the COM port on the speed 9600. In debug mode the program sends all the data and than the COM port is closed. Bit if I create a installer project and than install it on the same machine it doesn't send the last symbols. It closes the port earlier than all data is transmitted. So my question is: Is the debug exe file slower or is it somehow slowed down by the IDE (Visual Studio)?
Also adding a Sleep(100); between the last transition command and the port closing line the problem disappeared.

Your observations correspond with a program written incorrectly, showing signs of a bug that is currently only detectable with a release build.
Release builds tend to run faster, as they are created at an enhanced optimisation level and with various debugging utilities turned off. They are designed for maximum production performance at the expense of development-time debuggability. Creating an installation package evidently created a release build (which makes sense).
This increased performance in turn affects the timing of your program. If you were accidentally relying on there being a long duration before the port was closed, giving your program enough time to transmit all its data only by chance, then when this process speeds up your bug becomes observable. There is no longer enough time for the data to get through. Adding a Sleep simulates the slower execution of the debug build, thus almost certainly confirming the existence of a timing bug.
This is good news! You have strong evidence of where the bug is and the form it takes. Now all you have to do is fix it!

Reduce size of tlog files produced by compiler

Since our build on the build server is more and more slowing down I tried to find out what could be the cause. It seems it is mostly hanging in the disk IO operations of the .tlog files since there is no CPU load and still the build hangs. Even with a project containing only 10 cpp files, it generates ~5500 rows in the CL.read.1.tlog file.
The suspicious thing is that the file contains the same headers over and over, especially boost headers which take up like 90% of the file.
Is this really the expected behavior that those files are so big and have redundant content or is maybe a problem triggered from our source code? Are there maybe cyclic includes or too many header includes that can cause this problem?
Update 1
After all the comments I'll try to clarify some more details here.
We are only using boost by including the headers and linking the already compiled libs, we are not compiling boost itself
Yes, SSD is always a nice improvement, but the build server is hosted by our IT and we do not have SSDs available there. (see points below)
I checked some perfcounters especially via perfmon during the compilation. Whereas the CPU and Memory load are negligible most of the time, the disk IO counters and also queue sizes are quite high all the time. Disk Activity - Highest Active time is constantly on 100& and if I sort the Total (B/sec) it's all full with tlog files which read/write a lot of data to the disk.
Even if 5500 lines of tlog seem okay in some cases, I wonder why the exact same boost headers are contained over and over. Here a logfile where I censored our own headers.
There is no Antivirus influencing. I stopped it for my investigations since we know that it influences our compilation even more.
On my local developer machine with SSD it takes ~16min to build our whole solution, whereas on our build server with a "slower" disk it takes ~2hrs. CPU and memory are comparable. The 5500 line file just was an example from a single project within a solution of 20-30 projects. We have a project where we have ~30MB tlog files with ~60.000 lines in it, only this project takes half of the compilation duration.
Of course there is some basic CPU load on the machine during compilation. But it is not comparable to other developer machines with SSDs.
Our .net solution with 45 projects is finished in 12min (including setup project with WiX)
As on developer machines with SSDs we have at least a reduction from 2hrs to 16mins with a comparable CPU/memory configuration my assumption for the bottle neck was always the hard disk. Checking for disk related operations lead me to the tlog files since they caused the highest disk activity according to permon.

Benchmarking Code Runtime with Trace32

I have an embedded system with code that I'd like to benchmark. In this case, there's a single line I want to know the time spent on (it's the creation of a new object that kicks off the rest of our application).
I'm able to open Trace->Chart->Symbols and see the time taken for the region selected with my cursor, but this is cumbersome and not as accurate as I'd like. I've also found Perf->Function Runtime, but I'm benchmarking the assignment of a new object, not of any particular function call (new is called in multiple places, not just the line of interest).
Is there a way to view the real-world time taken on a line of code with Trace32? Going further than a single line: would there be a way to easily benchmark the time between two breakpoints?

The solution by codehearts, which uses the RunTime commands, is just fine if you don't have a real-time trace. It works with any Lauterbach tool and any target CPU.
However if you have a real-time trace (e.g. CPU with ETM and Lauterbach PowerTrace hardare), I recommend to use the command Trace.STATistic.AddressDURation <start-addr> <end-addr> instead. This command opens a window which shows the average time between two addresses. You get best results, if you execute the code between the two addresses several times.
If you are using an ARM Cortex CPU, which supports cycle-accurate timing information (usually all Cortex-A, Cortex-R and Cortex-M7) you can improve the accuracy of the result dramatically by using the setting ETM.TImeMode.CycleAccurate (together with ETM.CLOCK <core-frequency>).
If you are using a Lauterbach CombiProbe or uTrace (and you can't use the ETM.TImeMode.CycleAccurate) I recommend the setting Trace.PortFilter.ON. (By default the port-filter is set to PACK, which allows to record more data and program flow, but with a slightly worse timing accuracy.)

Opening the Misc->Runtime window shows you the total time taken since "laststart." By setting a breakpoint on the first line of your code block and another after the last line, you can see the time taken from the first breakpoint to the second under the "actual" column.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js