C++ Code is running slower over time

C++ Code is running slower over time - c++

I have a long C++ program consisting of thousands of lines of codes, several classes, functions etc. The program basically reads info from an input file, runs an algorithm, and writes the results in output files.
Today I realized that run time of the program drastically changes from time to time. So I do a small test by restarting my computer, closing every other thing possible, and running the code 5 times in a row using the same input file. The run times are 50, 80, 130, 180, 190 seconds, respectively.
My first guess in this situation is the non-deleted dynamic memories. But I have been using dynamic arrays just twice in the whole code, and I am sure I delete those arrays.
Do you guys have any explanation for this? I am using Visual Studio 2010 on Windows 7 computer.

Beware running programs from within visual studio debugger as the LFH (low fragmentation heap) memory allocator is disabled in this case. Try the software from outside of VS.
I have seen cases where tasks would take seconds to complete normally take hours to complete just by running from within visual studio.
Above all if you still don't know what is going on divide and conquer. Instrument the app to see runtimes of subsystems or just place debug timers in various areas to see where execution time is changing drastically and drill down from there. If it is a memory allocator issue you will normally see large runtimes while freeing the arrays.

Your code runs in an environment, which includes the state of the operating system, disk, network, time, memory, other processes launched, etc.
Executing the same code in the same environment will give the same result, every time.
Now, you're getting different results (execution times). If you're running the same executable repeatedly, then something is changing in the surrounding environment.
Now, the most obvious question is : Is your code causing a change to the outside environment? A simple example would be: It reads in a file, changes the data and writes it back out to the same file.
You know your code. Just use this approach to isolate any effects your code may be having on its environment and you'll find the reason.

Related

Why does my logging library cause performance tests to run faster?

I have spent the past year developing a logging library in C++ with performance in mind. To evaluate performance I developed a set of benchmarks to compare my code with other libraries, including a base case that performs no logging at all.
In my last benchmark I measure the total running time of a CPU-intensive task while logging is active and when it is not. I can then compare the time to determine how much overhead my library has. This bar chart shows the difference compared to my non-logging base case.
As you can see, my library ("reckless") adds negative overhead (unless all 4 CPU cores are busy). The program runs about half a second faster when logging is enabled than when it is disabled.
I know I should try to isolate this down to a simpler case rather than asking about a 4000-line program. But there are so many venues for what to remove, and without a hypothesis I will just make the problem go away when I try to isolate it. I could probably spend another year just doing this. I'm hoping that the collective expertise of Stack Overflow will make this a much more shallow problem or that the cause will be obvious to someone who has more experience than me.
Some facts about my library and the benchmarks:
The library consists of a front-end API that pushes the log arguments onto a lockless queue (Boost.Lockless) and a back-end thread that performs string formatting and writes the log entries to disk.
The timing is based on simply calling std::chrono::steady_clock::now() at the beginning and end of the program, and printing the difference.
The benchmark is run on a 4-core Intel CPU (i7-3770K).
The benchmark program computes a 1024x1024 Mandelbrot fractal and logs statistics about each pixel, i.e. it writes about one million log entries.
The total running time is about 35 seconds for the single worker-thread case. So the speed increase is about 1.5%.
The benchmark produces an output file (this is not part of the timed code) that contains the generated Mandelbrot fractal. I have verified that the same output is produced when logging is on and off.
The benchmark is run 100 times (with all the benchmarked libraries, this takes about 10 hours). The bar chart shows the average time and the error bars show the interquartile range.
Source code for the Mandelbrot computation
Source code for the benchmark.
Root of the code repository and documentation.
My question is, how can I explain the apparent speed increase when my logging library is enabled?
Edit: This was solved after trying the suggestions given in comments. My log object is created on line 24 of the benchmark test. Apparently when LOG_INIT() touches the log object it triggers a page fault that causes some or all pages of the image buffer to be mapped to physical memory. I'm still not sure why this improves the performance by almost half a second; even without the log object, the first thing that happens in the mandelbrot_thread() function is a write to the bottom of the image buffer, which should have a similar effect. But, in any case, clearing the buffer with a memset() before starting the benchmark makes everything more sane. Current benchmarks are here
Other things that I tried are:
Run it with the oprofile profiler. I was never able to get it to register any time in the locks, even after enlarging the job to make it run for about 10 minutes. Almost all the time was in the inner loop of the Mandelbrot computation. But maybe I would be able to interpret them differently now that I know about the page faults. I didn't think to check whether the image write was taking a disproportionate amount of time.
Removing the locks. This did have a significant effect on performance, but results were still weird and anyway I couldn't do the change in any of the multithreaded variants.
Compare the generated assembly code. There were differences but the logging build was clearly doing more things. Nothing stood out as being an obvious performance killer.

When uninitialised memory is first accessed, page faults will affect timing.
So, before your first call to, std::chrono::steady_clock::now(), initialise the memory by running memset() on your sample_buffer.

C++, always running processes or invoked executable files?

I'm working on a project made of some separate processes (services). Some services are called every second, some other every minute and some services may not be called after days. (and there are some services that are called randomly and there is no exact information about their call times).
I have two approaches to develop the project. To make services always running processes using interprocess messaging, or to write separate C++ programs and run executable files when I need them.
I have two questions that I couldn't find a suitable answer to.
Is there any way I could calculate an approximated threshold that can help me answer to 'when to use which way'?
How much faster is always running processes? (I mean compared with process of initializing and running executable files in OS)
Edit 1: As mentioned in comments and Mats Petersson's answer, answer to my questions is heavily related to environment. Then I explain more about these conditions.
OS: CentOS 6.3
services are small (smaller that 1000 line codes normally) and use no additional resources (such as database)

I don't think anyone can answer your direct two questions, as it depends on many factors, such as "what OS", "what secondary storage", "how large an application is", "what your application does" (loading up the contents of a database with a million entries takes much longer than int x = 73; as the whole initialization outside main).
There is overhead with both approaches, and assuming there isn't enough memory to hold EVERYTHING in RAM at all times (and modern OS's will try to make use of the RAM as disk-cache or for other caching, rather than keep old crusty application code that doesn't run, so eventually your application code will be swapped out if it's not being run), you are going to have approximately the same disk I/O amount for both solutions.
For me, "having memory available" trumps other things, so executing a process when itäs needed is better than leaving it running in the expectation that in some time, it will need to be reused. The only exceptions are if the executable takes a long time to start (in other words, it's large and has a complex starting procedure) AND it's being run fairly frequently (at the very least several times per minute). Or you have high real-time requirements, so the extra delay of starting the process is significantly worse than "we're holding it in memory" penalty (but bear in mind that holding it in memory isn't REALLY holding it in memory, since the content will be swapped out to disk anyway, if it isn't being used).
Starting a process that was recently run is typically done from cache, so it's less of an issue. Also, if the application uses shared libraries (.so, .dll or .dynlib depending on OS) that are genuinely shared, then it will normally shorten the load time if that shared library is in memory already.
Both Linux and Windows (and I expect OS X) are optimised to load a program much faster the second time it executes in short succession - because it caches things, etc. So for the frequent calling of the executable, this will definitely work in your favour.
I would start by "execute every time", and if you find that this is causing a problem, redesign the programs to stay around.

How to debug code causing computer to crash

I have written a program to control several scientific instruments which ends up going through several thousand loops as it runs. This all told tends to take about half an hour to run.
I have run into a weird bug/issue in which about every other time I run the program, the program freezes the computer and I have to hard restart. When I just do a small number of loops to test the program I never have problems, it's only when I try full data runs that it crashes off and on.
Is there anyway to trace the error if it only occurs intermittently? Is there anyway to catch what the error is before the computer freezes? Could it be related to running the code in debug mode and not in release?
I am using Visual C++ 2013 on a Win 7 64-bit machine. All the various includes are the 64-bit versions. I can post the code if that would be helpful, but I must warn that it is very long. Thanks

Being the test procedure so long, maybe the best "in house" way to deal with it is to write on a file the needed debug infos at almost every step the program takes.
Be sure to close the file each time, otherwise you will probably lose your data on freezing.
Sure it will take a lot of more time, but if you are lucky and the error has a certain regularity, after a couple of check you can go with conditional breakpoints and debug the usual way.
If I had to bet on the cause, I will say this is a memory leak.
Hope this helps

Profiling a dll plugin

I want to profile a dll plugin in C++. I have access to the source (being the author/mantainer) and can modify them (if needed for instrumentation).
What I don't have is the source/symbols/etc of the host program which is calling the dll. I only have the headers needed to build the plugin.
The dll is invoked upon action from the client.
What is the best way to proceed for profiling the code? It is not realistic to "wrap" an executable around the dll and it would be not useful because since in the plugin I am calling some functions from the host AND i need to profile those paths, a wrapper would skew the performance.
EDIT after Kieren Johnston's comment: Ideally I would like to hook into the loaded dll just like the debugger is able to (attaching to the running host process and placing a breakpoint somewhere in the dll as needed). Is it possible? If not, I will need to ask another question to ask why :-)
I am using the TFS edition of Visual Studio 2010.
Bonus points for providing suggestions/answers for the same task under AIX (ah, the joys of multiple environments!).

This is possible albeit a little annoying.
Deploy your plug-in DLL to where the host application needs it to be
Launch your host application and verify that it is using your plug-in
Create a new Performance Session
Add the host EXE as a target in the Session from step 3
Select Sampling or Instrumentation for your Session
Launch the profiling session
During all this keep your plug-in solution loaded and VS should find the symbols for your plug-in automatically.

Not sure about VS10, but in older ones, you debug the dll by specifying the exe for running it.
Let's split the problem into two parts: 1) locating what you might call "bottlenecks", and 2) measuring the overall speedup you get by fixing each one.
(2) is easy, right? All you need is an outer timer.
That leaves (1). If you're like most people, you think that finding the "bottlenecks" cannot be done without some kind of precision timing of the parts of the program.
Not so, because most of the time the things you need to fix to get the most speedup are not things you can detect that way.
They are not necessarily bad algorithms, or slow functions, or hotspots.
They are distributed things being done by perfectly innocent-looking well-designed code, that just happen to present huge speedup opportunity if coded in a different way.
Here's an example where a reasonably well written program had its execution time reduced from 48 seconds to 20, 17, 13, 10, 7, 4, 2.1, and finally 1.1, over 8 iterations.**
That's a compound speedup factor of over 40x.
The speedup factor you can get is different in every different program - some can get less, some can get more, depending on how close they are to optimal in the first place.
There's no mystery of how to do this.
The method was random pausing.
(It's an alternative to using a profiler. Profilers measure various things, and give you various clues that may or may not be helpful, but they don't reliably tell you what the problem is.)
** The speedup factors achieved, per iteration, were 2.38, 1.18, 1.31, 1.30, 1.43, 1.75, 1.90, 1.91. Another way to put it is the percent time reduced in each iteration: 58%, 15%, 24%, 23%, 30%, 43%, 48%, 48%. I get a hard time from profiler fans because the method is so manual, but they never talk about the speedup results.
(Maybe that will change.)

Random Complete System Unresponsiveness Running Mathematical Functions

I have a program that loads a file (anywhere from 10MB to 5GB) a chunk at a time (ReadFile), and for each chunk performs a set of mathematical operations (basically calculates the hash).
After calculating the hash, it stores info about the chunk in an STL map (basically <chunkID, hash>) and then writes the chunk itself to another file (WriteFile).
That's all it does. This program will cause certain PCs to choke and die. The mouse begins to stutter, the task manager takes > 2 min to show, ctrl+alt+del is unresponsive, running programs are slow.... the works.
I've done literally everything I can think of to optimize the program, and have triple-checked all objects.
What I've done:
Tried different (less intensive) hashing algorithms.
Switched all allocations to nedmalloc instead of the default new operator
Switched from stl::map to unordered_set, found the performance to still be abysmal, so I switched again to Google's dense_hash_map.
Converted all objects to store pointers to objects instead of the objects themselves.
Caching all Read and Write operations. Instead of reading a 16k chunk of the file and performing the math on it, I read 4MB into a buffer and read 16k chunks from there instead. Same for all write operations - they are coalesced into 4MB blocks before being written to disk.
Run extensive profiling with Visual Studio 2010, AMD Code Analyst, and perfmon.
Set the thread priority to THREAD_MODE_BACKGROUND_BEGIN
Set the thread priority to THREAD_PRIORITY_IDLE
Added a Sleep(100) call after every loop.
Even after all this, the application still results in a system-wide hang on certain machines under certain circumstances.
Perfmon and Process Explorer show minimal CPU usage (with the sleep), no constant reads/writes from disk, few hard pagefaults (and only ~30k pagefaults in the lifetime of the application on a 5GB input file), little virtual memory (never more than 150MB), no leaked handles, no memory leaks.
The machines I've tested it on run Windows XP - Windows 7, x86 and x64 versions included. None have less than 2GB RAM, though the problem is always exacerbated under lower memory conditions.
I'm at a loss as to what to do next. I don't know what's causing it - I'm torn between CPU or Memory as the culprit. CPU because without the sleep and under different thread priorities the system performances changes noticeably. Memory because there's a huge difference in how often the issue occurs when using unordered_set vs Google's dense_hash_map.
What's really weird? Obviously, the NT kernel design is supposed to prevent this sort of behavior from ever occurring (a user-mode application driving the system to this sort of extreme poor performance!?)..... but when I compile the code and run it on OS X or Linux (it's fairly standard C++ throughout) it performs excellently even on poor machines with little RAM and weaker CPUs.
What am I supposed to do next? How do I know what the hell it is that Windows is doing behind the scenes that's killing system performance, when all the indicators are that the application itself isn't doing anything extreme?
Any advice would be most welcome.

I know you said you had monitored memory usage and that it seems minimal here, but the symptoms sound very much like the OS thrashing like crazy, which would definitely cause general loss of OS responsiveness like you're seeing.
When you run the application on a file say 1/4 to 1/2 the size of available physical memory, does it seem to work better?
What I suspect may be happening is that Windows is "helpfully" caching your disk reads into memory and not giving up that cache memory to your application for use, forcing it to go to swap. Thus, even though swap use is minimal (150MB), it's going in and out constantly as you calculate the hash. This then brings the system to its knees.

Some things to check:
Antivirus software. These often scan files as they're opened to check for viruses. Is your delay occuring before any data is read by the application?
General system performance. Does copying the file using Explorer also show this problem?
Your code. Break it down into the various stages. Write a program that just reads the file, then one that reads and writes the files, then one that just hashes random blocks of ram (i.e. remove the disk IO part) and see if any particular step is problematic. If you can get a profiler then use this as well to see if there any slow spots in your code.
EDIT
More ideas. Perhaps your program is holding on to the GDI lock too much. This would explain everything else being slow without high CPU usage. Only one app at a time can have the GDI lock. Is this a GUI app, or just a simple console app?
You also mentioned RtlEnterCriticalSection. This is a costly operation, and can hang the system quite easily, i.e. mismatched Enters and Leaves. Are you multi-threading at all? Is the slow down due to race conditions between threads?

XPerf is your guide here - watch the PDC Video about it, and then take a trace of the misbehaving app. It will tell you exactly what's happening throughout the system, it is extremely powerful.

I like the disk-caching/thrashing suggestions, but if that's not it, here are some scattershot suggestions:
What non-MSVC libraries, if any, are you linking to?
Can your program be modified (#ifdef'd) to run without a GUI? Does the problem occur?
You added ::Sleep(100) after each loop in each thread, right? How many threads are you talking about? A handful or hundreds? How long does each loop take, roughly? What happens if you make that ::Sleep(10000)?
Is your program perhaps doing something else that locks a limited resources (ProcExp can show you what handles are being acquired ... of course you might have difficulty with ProcExp not responding:-[)
Are you sure CriticalSections are userland-only? I recall that was so back when I worked on Windows (or so I believed), but Microsoft could have modified that. I don't see any guarantee in the MSDN article Critical Section Objects (http://msdn.microsoft.com/en-us/library/ms682530%28VS.85%29.aspx) ... and this leads me to wonder: Anti-convoy locks in Windows Server 2003 SP1 and Windows Vista
Hmmm... presumably we're all multi-processor now, so are you setting the spin count on the CS?
How about running a debugging version of one of these OSes and monitoring the kernel debugging output (using DbgView)... possibly using the kernel debugger from the Platform SDK ... if MS still calls it that?
I wonder whether VMMap (another SysInternal/MS utility) might help with the Disk caching hypothesis.

It turns out that this is a bug in the Visual Studio compiler. Using a different compiler resolves the issue entirely.
In my case, I installed and used the Intel C++ Compiler and even with all optimizations disabled I did not see the fully-system hang that I was experiencing w/ the Visual Studio 2005 - 2010 compilers on this library.
I'm not certain as to what is causing the compiler to generate such broken code, but it looks like we'll be buying a copy of the Intel compiler.

It sounds like you're poking around fixing things without knowing what the problem is. Take stackshots. They will tell you what your program is doing when the problem occurs. It might not be easy to get the stackshots if the problem occurs on other machines where you cannot use an IDE or a stack sampler. One possibility is to kill the app and get a stack dump when it's acting up. You need to reproduce the problem in an environment where you can get a stack dump.
Added: You say it performs well on OSX and Linux, and poorly on Windows. I assume the ratio of completion time is some fairly large number, like 10 or 100, if you've even had the patience to wait for it. I said this in the comment, but it is a key point. The program is waiting for something, and you need to find out what. It could be any of the things people mentioned, but it is not random.
Every program, all the time while it runs, has a call stack consisting of a hierarchy of call instructions at specific addresses. If at a point in time it is calculating, the last instruction on the stack is a non-call instruction. If it is in I/O the stack may reach into a few levels of library calls that you can't see into. That's OK. Every call instruction on the stack is waiting. It is waiting for the work it requested to finish. If you look at the call stack, and look at where the call instructions are in your code, you will know what your program is waiting for.
Your program, since it is taking so long to complete, is spending nearly all of its time waiting for something to finish, and as I said, that's what you need to find out. Get a stack dump while it's being slow, and it will give you the answer. The chance that it will miss it is 1/the-slowness-ratio.
Sorry to be so elemental about this, but lots of people (and profiler makers) don't get it. They think they have to measure.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js