profiling and performance issues - c++

After observing some performance issues in my program, I decided to run a profiling session. The results seem to indicate that something like 87% of samples taken were somehow related to my Update() function.
In this function, I am going through a list of A*, where sizeof(A) equals 72, and deleting them after processing.
void Update()
{
//...
for(auto i = myList.begin(); i != myList.end(); i++)
{
A* pA = *i;
//Process item before deleting it.
delete pA;
}
myList.clear();
//...
}
where myList is a std::list<A*>. On average, I am calling this function anywhere from 30 to 60 times per second while the list contains an average of 5 items. That means I'm deleting anywhere from 150 to 300 A objects per second.
Would calling delete this many times be enough to cause a performance issue in most cases? Is there any way to track down exactly where in the function the problem is occuring? Is delete generally considered an expensive operation?

Very difficult to tell, since you brush over what is probably the bulk of the work done in the loop and give no hint as to what A is...
If A is a simple collection of data, particularly primitives then the deletion is almost certainly not the culprit. You can test the theory by splitting your update function in two - update and uninit. Update does all the processing, uninit deletes the object and clears the list.
If only update is slow, then it's the processing. If only uninit is slow, then it's the deletion. If both are slow then memory fragmentation is probably the culprit.
As others have pointed out in the comments, std::vector may give you a performance increase. But be careful since it may also cause performance problems elsewhere depending on how you build the data structure.

You could have a look at tcmalloc from gperftools (Google Performance Tools). gperftools also contains a profiler (both libraries only need to be linked in, very easy). tcmalloc keeps a memory pool for small objects and re-uses this memory when possible.
The profiler can be used for cpu and heap profiling.

Totally easy to tell what's going on.
Do yourself a favor and use this method.
It's been analyzed to the nth degree, and is very effective.
In a nutshell, if 87% of time is in Update, then if you just stop it a few times with Ctrl-C or whatever, the probability is 87% each time that you will catch it in the act.
You will not only see that it's in Update. You will see where in Update, and what it's doing. If it is in the process of delete, or accessing the data structure, you will see that. You will also see, further down the stack, the reason why that operation takes time.

Related

Vector re-declaration versus insertions in looping operations - C++

I have an option to either create and destroy a vector on every call to func() and push elements in each iteration, as shown in Example A OR fixed the initialization and only overwrite old values in each iteration, as shown in Example B.
Example A:
void func ()
{
std::vector<double> my_vec(5, 0.0);
for ( int i = 0; i < my_vec.size(); i++) {
my_vec.push_back(i);
// do something
}
}
while (condition) {
func();
}
Example B:
void func (std::vector<double>& my_vec)
{
for ( int i = 0; i < my_vec.size(); i++) {
my_vec[i] = i;
// do something
}
}
while (condition) {
std::vector<double> my_vec(5, 0.0);
func(myVec);
}
Which of the two would be computationally inexpensive. The size of the array won't be more than 10.
I still suspect that the question that was asked is not the question that was intended, but it occurred to me that the main point of my answer would likely not change. If the question gets updated, I can always edit this answer to match (or delete it, if it turns out to be inapplicable).
De-prioritize optimizations
There are various factors that should affect how you write your code. Among the desirable goals are space optimization, time optimization, data encapsulation, logic encapsulation, readability, robustness, and correct functionality. Ideally, all of these goals would be achievable in every piece of code, but that is not particularly realistic. Much more likely is a situation where one or more of these goals must be sacrificed in favor of the others. When that happens, optimizations should typically yield to everything else.
That is not to say that optimizations should be ignored. There are plenty of optimizations that rarely obstruct the higher-priority goals. These range from the small, such as passing by const reference instead of by value, to the large, such as choosing the logarithmic algorithm instead of the exponential one. However, the optimizations that do interfere with the other goals should be postponed until after your code is reasonably complete and functioning correctly. At that point, a profiler should be used to determine where the bottlenecks actually are. Those bottlenecks are the only places where other goals should yield to optimizations, and only if the profiler confirms that the optimizations achieved their goals.
For the question being asked, this means that the main concern should not be computational expense, but encapsulation. Why should the caller of func() need to allocate space for func() to work with? It should not, unless a profiler identified this as a performance bottleneck. And if a profiler did that, it would be much easier (and more reliable!) to ask the profiler if the change helps than to ask Stack Overflow.
Why?
I can think of two major reasons to de-prioritize optimizations. First, the "sniff test" is unreliable. While there might be a few people who can identify bottlenecks by looking at code, there are many, many more who merely think they can. Second, that's why we have optimizing compilers. It is not unheard of for someone to come up with this super-clever optimization trick only to discover that the compiler was already doing it. Keep your code clean and let the compiler handle the routine optimizations. Only step in when the task demonstrably exceeds the compiler's capabilities.
See also: premature-optimization
Choosing an optimization
OK, suppose the profiler did identify construction of this small, 10-element array as a bottleneck. The next step is to test an alternative, right? Almost. First you need an alternative, and I would consider a review of the theoretical benefits of various alternatives to be useful. Just keep in mind that this is theoretical and that the profiler gets the final say. So I'll go into the pros and cons of the alternatives from the question, as well as some other alternatives that might bear consideration. Let's start from the worst options, working our way to the better ones.
Example A
In Example A, a vector is created with 5 elements, then elements are pushed onto the vector until i meets or exceeds the vector's size. Seeing how i and the vector's size are both increased by one each iteration (and i starts smaller than the size), this loop will run until the vector grows large enough to crash the program. That means probably billions of iterations (despite the question's claim that the size will not exceed 10).
Easily the most computationally expensive option. Don't do this.
Example B
In example B, a vector is created for each iteration of the outer while loop, which is then accessed by reference from within func(). The performance cons here include passing a parameter to func() and having func() access the vector indirectly through a reference. There are no performance pros as this does everything the baseline (see below) would do, plus some extra steps.
Even though a compiler might be able to compensate for the cons, I see no reason to try this approach.
Baseline
The baseline I'm using is a fix to Example A's infinite loop. Specifically, replace "my_vec.push_back(i);" with Example B's "my_vec[i] = i;". This simple approach is along the lines of what I would expect for the initial assessment by the profiler. If you cannot beat simple, stick with it.
Example B*
The text of the question presents an inaccurate assessment of Example B. Interestingly, the assessment describes an approach that has the potential to improve on the baseline. To get code that matches the textual description, move Example B's "std::vector<double> my_vec(5, 0.0);" to the line immediately before the while statement. This has the effect of constructing the vector only once, rather than constructing it with each iteration.
The cons of this approach are the same as those of Example B as originally coded. However, we now pick up a gain in that the vector's constructor is called only once. If construction is more expensive than the indirection costs, the result should be a net improvement once the while loop iterates often enough. (Beware these conditions: that's a significant "if" and there is no a priori guess as to how many iterations is "enough".) It would be reasonable to try this and see what the profiler says.
Get some static
A variant on Example B* that helps preserve encapsulation is to use the baseline (the fixed Example A), but precede the declaration of the vector with the keyword static. This brings in the benefit of constructing the vector only once, but without the overhead associated with making the vector a parameter. In fact, the benefit could be greater than in Example B* since construction happens only once per program execution, rather than each time the while loop is started. The more times the while loop is started, the greater this benefit.
The main con here is that the vector will occupy memory throughout the program's execution. Unlike Example B*, it will not release its memory when the block containing the while loop ends. Using this approach in too many places would lead to memory bloat. So while it is reasonable to profile this approach, you might want to consider other options. (Of course if the profiler calls this out as the bottleneck, dwarfing all others, the cost is small enough to pay.)
Fix the size
My personal choice for what optimization to try here would be to start from the baseline and switch the vector to std::array<10,double>. My main motivation is that the needed size won't be more than 10. Also relevant is that the construction of a double is trivial. Construction of the array should be on par with declaring 10 variables of type double, which I would expect to be negligible. So no need for fancy optimization tricks. Just let the compiler do its thing.
The expected possible benefit of this approach is that a vector allocates space on the heap for its storage, which has an overhead cost. The local array would not have this cost. However, this is only a possible benefit. A vector implementation might already take advantage of this performance consideration for small vectors. (Maybe it does not use the heap until the capacity needs to exceed some magic number, perhaps more than 10.) I would refer you back to earlier when I mentioned "super-clever" and "compiler was already doing it".
I'd run this through the profiler. If there's no benefit, there is likely no benefit from the other approaches. Give them a try, sure, since they're easy enough, but it would probably be a better use of your time to look at other aspects to optimize.

Optimisation linked list

Silly question maybe - but, what would be faster:
-Deleting an item from a linked list every time an item is gone (whatever that item might be)
-Just marking the record as dead and overwriting it after x amount of time or conditions.
Would I not use less cpu time by avoiding all the removing and inserting rather than just overwriting.
Only way to find out is to use a profiler.
Keep in mind that there are other factors which are very important that you didn't specify in your question. For example, it makes a big difference if you know which element to delete, i.e. you have the pointer to it ready, or if you have to iterate over the linked list to find it. In that case, a vector could actually be several times faster than any linked list (depending on the number of elements) due to caching.
Overwriting is going to be faster. If deleting involves freeing allocated memory from the heap, then overwriting will definitely be faster. Memory allocation is relatively slow. Plus, over time your list memory would get fragmented which might cause cache misses. Even freeing memory that goes back into a memory manager of some sort would be slower since you have the overhead of managing the nodes of the list.

optimizing `std::vector operator []` (vector access) when it becomes a bottleneck

gprof says that my high computing app spends 53% of its time inside std::vector <...> operator [] (unsigned long), 32% of which goes to one heavily used vector. Worse, I suspect that my parallel code failing to scale beyond 3-6 cores is due to a related memory bottleneck. While my app does spend a lot of time accessing and writing memory, it seems like I should be able (or at least try) to do better than 52%. Should I try using dynamic arrays instead (size remains constant in most cases)? Would that be likely to help with possible bottlenecks?
Actually, my preferred solution would be to solve the bottleneck and leave the vectors as is for convenience. Based on the above, are there any likely culprits or solutions (tcmalloc is out)?
Did you examine your memory access pattern itself? It might be inefficient - cache unfriendly.
Did you try to use raw pointer while array accessing?
// regular place
for (int i = 0; i < arr.size(); ++i)
wcout << arr[i];
// In bottleneck
int *pArr = &arr.front();
for (int i = 0; i < arr.size(); ++i)
wcout << pArr[i];
I suspect that gprof prevents functions to be inlined. Try to use another profiling method. std::vector operator [] cannot be bottleneck because it doesn't differ much from raw array access. SGI implementaion is shown below:
reference operator[](size_type __n) { return *(begin() + __n); }
iterator begin() { return _M_start; }
You cannot trust gprof for high-speed code profiling, you should instead use a passive profiling method like oprofile to get the real picture.
As an alternative you could profile by manual code alteration (e.g. calling a computation 10 times instead of one and checking how much the execution time increases). Note that this is however going to be influenced by cache issues so YMMV.
The vector class is heavily liked and provides a certain amount of convenience, at the expense of performance, which is fine when you don't particularly need performance.
If you really need performance, it won't hurt you too much to bypass the vector class and go directly to a simple old hand-made array, whether statically or dynamically allocated. Then 1) the time you currently spend indexing should essentially disappear, speeding up your app by that amount, and 2) you can move on to whatever the "next big thing" is that takes time in your app.
EDIT:
Most programs have a lot more room for speedup than you might suppose. I made a walk-through project to illustrate this. If I can summarize it really quickly, it goes like this:
Original time is 2.7 msec per "job" (the number of "jobs" can be varied to get enough run-time to analyze it).
First cut showed roughly 60% of time was spent in vector operations, including indexing, appending, and removing. I replaced with a similar vector class from MFC, and time decreased to 1.8 msec/job. (That's a 1.5x or 50% speedup.)
Even with that array class, roughly 40% of time was spent in the [] indexing operator. I wanted it to index directly, so I forced it to index directly, not through the operator function. That reduced time to 1.5 msec/job, a 1.2x speedup.
Now roughly 60% of the time is adding/removing items in arrays. An additional fraction was spent in "new" and "delete". I decided to chuck the arrays and do two things. One was to use do-it-yourself linked lists, and to pool used objects. The first reduced time to 1.3 msec (1.15x). The second reduced it to 0.44 msec (2.95x).
Of that time, I found that about 60% of the time was in code I had written to do indexing into the list (as if it were an array). I decided that could be done instead just by having a pointer directly into the list. Result: 0.14 msec (3.14x).
Now I found that nearly all the time was being spent in a line of diagnostic I/O I was printing to the console. I decided to get rid of that: 0.0037 msec (38x).
I could have kept going, but I stopped.
The overall time per job was reduced by a compounded factor of about 700x.
What I want you to take away is if you need performance bad enough to deviate from what might be considered the accepted ways of doing things, you don't have to stop after one "bottleneck".
Just because you got a big speedup doesn't mean there are no more.
In fact the next "bottleneck" might be bigger than the first, in terms of speedup factor.
So raise your expectations of speedup you can get, and go for broke.

Multithreading not taking advantage of multiple cores?

My computer is a dual core core2Duo. I have implemented multithreading in a slow area of my application but I still notice cpu usage never exceeds 50% and it still lags after many iterations. Is this normal? I was hopeing it would get my cpu up to 100% since im dividing it into 4 threads. Why could it still be capped at 50%?
Thanks
See What am I doing wrong? (multithreading)
for my implementation, except I fixed the issue that that code was having
Looking at your code, you are making a huge number of allocations in your tight loop--in each iteration you dynamically allocate two, two-element vectors and then push those back onto the result vector (thus making copies of both of those vectors); that last push back will occasionally cause a reallocation and a copy of the vector contents.
Heap allocation is relatively slow, even if your implementation uses a fast, fixed-size allocator for small blocks. In the worst case, the general-purpose allocator may even use a global lock; if so, it will obliterate any gains you might get from multithreading, since each thread will spend a lot of time waiting on heap allocation.
Of course, profiling would tell you whether heap allocation is constraining your performance or whether it's something else. I'd make two concrete suggestions to cut back your heap allocations:
Since every instance of the inner vector has two elements, you should consider using a std::array (or std::tr1::array or boost::array); the array "container" doesn't use heap allocation for its elements (they are stored like a C array).
Since you know roughly how many elements you are going to put into the result vector, you can reserve() sufficient space for those elements before inserting them.
From your description we have very little to go on, however, let me see if I can help:
You have implemented a lock-based system but you aren't judiciously using the resources of the second, third, or fourth threads because the entity that they require is constantly locked. (this is a very real and obvious area I'd look into first)
You're not actually using more than a single thread. Somehow, somewhere, those other threads aren't even fired up or initialized. (sounds stupid but I've done this before)
Look into those areas first.

C++ map performance - Linux (30 sec) vs Windows (30 mins) !

I need to process a list of files. The processing action should not be repeated for the same file. The code I am using for this is -
using namespace std;
vector<File*> gInputFileList; //Can contain duplicates, File has member sFilename
map<string, File*> gProcessedFileList; //Using map to avoid linear search costs
void processFile(File* pFile)
{
File* pProcessedFile = gProcessedFileList[pFile->sFilename];
if(pProcessedFile != NULL)
return; //Already processed
foo(pFile); //foo() is the action to do for each file
gProcessedFileList[pFile->sFilename] = pFile;
}
void main()
{
size_t n= gInputFileList.size(); //Using array syntax (iterator syntax also gives identical performance)
for(size_t i=0; i<n; i++){
processFile(gInputFileList[i]);
}
}
The code works correctly, but...
My problem is that when the input size is 1000, it takes 30 minutes - HALF AN HOUR - on Windows/Visual Studio 2008 Express. For the same input, it takes only 40 seconds to run on Linux/gcc!
What could be the problem? The action foo() takes only a very short time to execute, when used separately. Should I be using something like vector::reserve for the map?
EDIT, EXTRA INFORMATION
What foo() does is:
1. it opens the file
2. reads it into memory
3. closes the file
4. the contents of the file in memory is parsed
5. it builds a list of tokens; I'm using a vector for that.
Whenever I break the program (while running the program with the 1000+ files input set): the call-stack shows that the program is in the middle of a std::vector add.
In the Microsoft Visual Studio, there's a global lock when accessing the Standard C++ Library to protect from multi threading issue in Debug builds. This can cause big performance hits. For instance, our full test code runs on Linux/gcc in 50 minutes, whereas it needs 5 hours on Windows VC++2008. Note that this performance hit does not exist when compiling in Release mode, using the non-debug Visual C++ runtime.
I would approach it like any performance problem. This means: profiling. MSVC has a built-in profiler, by the way, so it may be a good chance to get familiar with it.
Break into the program using the debugger at a random time, and the chances are very high that the stack trace will tell you where it's spending the time.
I very very strongly doubt that your performance problem is coming from the STL containers.
Try to eliminate (comment out) the call to foo(pFile) or any other method which touches the filesystem. Although running foo(pFile) once may appear fast, running it on 1000 different files (especially on Windows filesystems, in my experience) could turn out to be much slower (e.g. because of filesystem cache behaviour.)
EDIT
Your initial post was claiming that BOTH debug and release builds were affected. Now you are withdrawing that claim.
Be aware that in DEBUG builds:
the STL implementation performs
extra checks and assertions
heap
operations (memory allocation etc.)
perform extra checks and assertions;
moreover, under debug builds the
low-fragmentation heap is
disabled (up to a 10x overall
slowdown in memory allocation)
no code optimizations are performed,
which may result in further STL
performance degradation (STL relying many a time heavily on inlining, loop unwinding etc.)
With 1000 iterations you are probably not affected by the above (not at the outer loop level at least) unless you use STL/the heap heavily INSIDE foo().
I would be astounded if the performance issues you are seeing have anything at all to do with the map class. Doing 1000 lookups and 1000 insertion should take a combined time on the order of microseconds. What is foo() doing?
Without knowing how the rest of the code fits in, I think the overall idea of caching processed files is a little flaky.
Try removing duplicates from your vector first, then process them all.
Try commenting each block or major operation to determine which part actually caused the difference in execution time in Linux and Windows. I also don't think it would be because of the STL map. The problem may be inside foo(). It may be in some file operation as it is the only thing I could think of that would be costly in this case.
You may insert clock() calls in between operations to get an idea of the execution time.
You say that when you break, you find yourself inside vector::add. You don't have a vector::add in the code you've shown us, so I suspect it's inside the foo function. Without seeing that code, it's going to be difficult to say what's up.
You might have inadvertently created a Shlemiel the Painter algorithm.
You can improve things somewhat if you ditch your map and partition your vector instead. This implies reordering the input files list. It also means you have to find a way of quickly determining if a file has been processed already, possibly by holding a flag in the File class. If it's ok to reorder the files list and if you can store that dirty flag in the File object then you can improve performance from O(n log m) to O(n), for n total files and m processed files.
#include <algorithm>
#include <functional>
// ...
vector<File*>::iterator end(partition(inputfiles.begin(), inputfiles.end(),
not1(mem_fun(&File::is_processed))));
for_each(inputfiles.begin(), end, processFile);
If you can't reorder the files list or if you can't change the File object then you can switch the map with a vector and shadow each file in the input files list with a flag in the second vector at the same index. This will cost you O(n) space but will give you O(1) check for dirty state.
vector<File*> processed(inputfiles.size(), 0);
for( vector<File*>::size_type i(0); i != inputfiles.size(); ++i ) {
if( processed[i] != 0 ) return; // O(1)
// ...
processed[i] = inputfiles[i]; // O(1)
}
But be careful: You're dealing with two distinct pointers pointing at the same address, and that's the case for each pair of pointers in the two containers. Make sure one and only one pointer owns the pointee.
I don't expect either of these to yield a solution for that performance hit, but nevertheless.
If you are doing most of your work in linux then I strongly strongly suggest you only ever compile to release mode in windows. That makes life much easier, especially considering all the windows inflexible library handling headaches.