A carelessly written template here, some excessive inlining there - it's all too easy to write bloated code in C++. In principle, refactoring to reduce that bloat isn't too hard. The problem is tracing the worst offending templates and inlines - tracing those items that are causing real bloat in real programs.
With that in mind, and because I'm certain that my libraries are a bit more bloat-prone than they should be, I was wondering if there's any tools that can track down those worst offenders automatically - i.e. identify those items that contribute most (including all their repeated instantiations and calls) to the size of a particular target.
I'm not much interested in performance at this point - it's all about the executable file size.
Are there any tools for this job, usable on Windows, and fitting with either MinGW GCC or Visual Studio?
EDIT - some context
I have a set of multiway-tree templates that act as replacements for the red-black tree standard containers. They are written as wrappers around non-typesafe non-template code, but they were also written a long time ago and as an "will better cache friendliness boost real performance" experiment. The point being, they weren't really written for long-term use.
Because they support some handy tricks, though (search based on custom comparisons/partial keys, efficient subscripted access, search for smallest unused key) they ended up being in use just about everywhere in my code. These days, I hardly ever use std::map.
Layered on top of those, I have some more complex containers, such as two-way maps. On top of those, I have tree and digraph classes. On top of those...
Using map files, I could track down whether non-inline template methods are causing bloat. That's just a matter of finding all the instantiations of a particular method and adding the sizes. But what about unwisely inlined methods? The templates were, after all, meant to be thin wrappers around non-template code, but historically my ability to judge whether something should be inlined or not hasn't been very reliable. The bloat impact of those template inlines isn't so easy to measure.
I have some idea which methods are heavily used, but that's the well-known opimization-without-profiling mistake.
Check out Symbol Sort. I used it a while back to figure out why our installer had grown by a factor of 4 in six months (it turns out the answer was static linking of the C runtime and libxml2).
Map file analysis
I have seen a problem like this some time ago, and I ended up writing a custom tool which analysed map file (Visual Studio linker can be instructed to produce one). The tool output was:
list of function sorted descending by code size, listing only first N
list of source files sorted descending by code size, listing only first N
Parsing map file is relatively easy (function code size can be computed as a difference between current and following line), the hardest part is probably handling mangled names in a reasonable way. You might find some ready to use libraries for both of this, I did it a few years ago and I do not know the current situation.
Here is a short excerpt from a map file, so that you know what to expect:
Address Publics by Value Rva+Base Lib:Object
0001:0023cbb4 ?ApplyScheme#Input##QAEXPBVParamEntry###Z 0063dbb4 f mainInput.obj
0001:0023cea1 ?InitKeys#Input##QAEXXZ 0063dea1 f mainInput.obj
0001:0023cf47 ?LoadKeys#Input##QAEXABVParamEntry###Z 0063df47 f mainInput.obj
Symbol Sort
As posted in Ben Staub's answer, Symbol Sort is a ready to use command line utility (comes with a complete C# source) which does all of this, with the only difference of not analysing map files, but rather pdb/exe files.
So what I'm reading based on your question and your comments is that the library is not actually too large.
The only tool you need to determine this is a command shell, or Windows File explorer. Look at the file size. Is it so big that it causes real actual problems? (Unacceptable download times, won't fit in memory on the target platform, that kind of thing)?
If not, then you should worry about code readability and maintainability and nothing else. And the tool for that is your eyes. Read the code, and take the actions needed to make it more readable if necessary.
If you can point to an actual reason why the executable size is a problem, please edit that into your question, as it is important context.
However, assuming the file size is actually a problem:
Inlined functions are generally not a problem, because the compiler, and no one else, chooses which functions to inline. Simply marking something inline does not inline the actual generated code. The compiler inlines if it determines the trade-off between larger code and less indirection to be worth it. If a function is called often, it will not be inlined, because that would dramatically affect code size, which would hurt performance.
If you're worried that inlined functions cause code bloat, simply compile with the "optimize for size" flag. Then the compiler will restrict inlining to the cases where it doesn't affect executable size noticeably.
For finding out which symbols are biggest, parse the map file as #Suma suggested.
But really, you said it yourself when you mentioned "the well-known opimization-without-profiling mistake."
The very first act of profiling you need to do is to ask is the executable size actually a problem? In the comments you said that you "have a feeling", which, in a profiling context is useless, and can be translated into "no, the executable size is not a problem".
Profile. Gather data and identify trouble spots. Before worrying about how to bring down the executable size, find out what the executable size is, and identify whether or not that is actually a problem. You haven't done that yet. You read in a book that "code bloat is a problem in C++", and so you assume that code bloat is a problem in your program. but is it? Why? How do you determine that it is?
http://www.sikorskiy.net/prj/amap/index.html
This is wonderful object file in lib/library size analysis GUI tool generated from Visual studio compiler map file . this tool analyses and generates report from map file . you can do filtering also and it dynamically display size . just input the map file to this tool and this tool will list what function are occupying which size the given map fiel generated by dll/exe check the screenshots of it in above file/ you can sort on size also.
Basically, you are looking for costly things that you don't need. Suppose there is some category of functions that you don't need taking some large percent of the space, like 20%. Then if you picked 20 random bytes out of the image size, on the average 4 of them (20 * 20%) will be in that category, and you will be able to see them. So basically, you take those samples, look at them, and if you see an obvious pattern of functions that you don't really need, then remove them. Then do it again because other categories of routines that used less space are now taking a higher percentage.
So I agree with Suma that parsing the map file is a good start. Then I would write a routine to walk through it, and every 5% of the way (space-wise) print the routine I am in. That way I get 20 samples. Often I find that a large chunk of object space results from a very small number (like 1) of lines of source code that I could easily have done another way.
You are also worried about too much inlining making functions larger than they could be. To figure that out, I would take each of those sample, and since it represents a specific address in a specific function, I would trace that back to the line of code it is in. That way, I can tell if it is in an expanded function. This is a bit of work, but doable.
A similar problem is how to find tumors when disks get full. The same idea there is to walk the directory tree, adding up the file sizes, Then you walk it again, and as you pass each 5% point, you print out the path of the file you are in. This tells you not only if you have large files, it tells you if you have large numbers of small files, and it doesn't matter how deeply they are buried or how widely they are scattered. When you clean out one category of files that you don't need, you can do it again to get the next category, and so on.
Good luck.
Your question seems to tend towards run-time rather than compile-time bloat.
However, if compile-time bloat (plus binary bloat resulting from inefficient compilation) is relevant, then I have to mention clang tool IWYU.
Since IWYU likely will manage to toss quite a number of #include:s in your code areas, this should also manage to reduce binary bloat. At least for my own environment I can certainly confirm a useful reduction in build time.
Related
I was reading through the String searching algorithm wikipedia article, and it made me wonder what algorithm strstr uses in Visual Studio? Should I try and use another implementation, or is strstr fairly fast?
Thanks!
The implementation in visual studio strstr is not know to me, and I am uncertain if it is to anyone. However I found these interesting sources and an example implementation. The latter shows that the algorithm runs in worst case quadratic time wrt the size of the searched string. Aggregate should be less than that. The algorithmic limit of non stochastic solutions should be that.
What is actually the case is that depending the size of the input it might be possible that different algorithms are used, mainly optimized to the metal. However, one cannot really bet on that. In case that you are doing DNA sequencing strstr and family are very important and most probably you will have to write your own customized version. Usually, standard implementations are optimized for the general case, but on the other hand those working on compilers know their shit n staff. At any rate you should not bet your own skills against the pros.
But really all this discussion about time to develop is hurting the effort to write good software. Be certain that the benefit of rewriting a custom strstr outweigh the effort that is going to be needed to maintain and tune it for your specific case, before you embark on this task.
As others have recommended: Profile. Perform valid performance tests.
Without the profile data, you could be optimizing a part of the code that runs 20% of the time, a waste of ROI.
Development costs are the prime concern with modern computers, not execution time. The best use of time is to develop the program to operate correctly with few errors before entering System Test. This is where the focus should be. Also due to this reasoning, most people don't care how Visual Studio implements strstr as long as the function works correctly.
Be aware that there is line or point where a linear search outperforms other searches. This line depends on the size of the data or the search criteria. For example, linear search using a processor with branch prediction and a large instruction cache may outperform other techniques for small and medium data sizes. A more complicated algorithm may have more branches that cause reloading of the instruction cache or data cache (wasting execution time).
Another method for optimizing your program is to make the data organization easier for searching. For example, making the string small enough to fit into a cache line. This also depends on the quantity of searching. For a large amount of searches, optimizing the data structure may gain some performance.
In summary, optimize if and only if the program is not working correctly, the User is complaining about speed, it is missing timing constraints or it doesn't fit in the allocated memory. Next step is then to profile and optimize the areas where most of the time is spent. Any other optimization is futile.
The C++ standard refers to the C standard for the description of what strstr does. The C standard doesn't seem to put any restrictions on the complexity, so pretty much any algorithm the finds the first instance of the substring would be compliant.
Thus different implementations may choose different algorithms. You'd have to look at your particular implementation to determine which it uses.
The simple, brute-force approach is likely O(m×n) where m and n are the lengths of the strings. If you need better than that, you can try other libraries, like Boost, or implement one of the sub-linear searches yourself.
I'm planning on writing a game with C++, and it will be extremely CPU-intensive (pathfinding,genetic algorithms, neural networks, ...)
So I've been thinking about how to tackle this situation best so that it would run smoothly.
(let this top section of this question be side information, I don't want it to restrict the main question, but it would be nice if you could give me side notes as well)
Is it worth it to learn how to work with ASM, so I can make ASM calls in C++,
can it give me a significant/notable performance advantage?
In what situations should I use it?
Almost never:
You only want to be using it once you've profiled your C++ code and have identified a particular section as a bottleneck.
And even then, you only want to do it once you've exhausted all C++ optimization options.
And even then, you only want to be using ASM for tight, inner loops.
And even then, it takes quite a lot of effort and skill to beat a C++ compiler on a modern platform.
If your not an experienced assembly programmer, I doubt you will be able to optimize assembly code better than your compiler.
Also note that assembly is not portable. If you decide to go this way, you will have to write different assembly for all the architectures you decide to support.
Short answer: it depends, most likely you won't need it.
Don't start optimizing prematurely. Write code that is also easy to read and to modify. Separate logical sections into modules. Write something that is easy to extend.
Do some profiling.
You can't tell where your bottlenecks are unless you profile your code. 99% of the time you won't get that much performance gain by writing asm. There's a high chance you might even worsen your performance. Optimizers nowadays are very good at what they do. If you do have a bottleneck, it will most probably be because of some poorly chosen algorithm or at least something that can be remedied at a high-level.
My suggestion is, even if you do learn asm, which is a good thing, don't do it just so you can optimize.
Profile profile profile....
A legitimate use case for going low-level (although sometimes a compiler can infer it for you) is to make use of SIMD instructions such as SSE. I would assume that at least some of the algorithms you mention will benefit from parallel processing.
However, you don't need to write actual assembly, instead you can simply use intrinsic functions. See, e.g. this.
Don't get ahead of yourself.
I've posted a sourceforge project showing how a simulation program was massively speeded up (over 700x).
This was not done by assuming in advance what needed to be made fast.
It was done by "profiling", which I put in quotes because the method I use is not to employ a profiler.
Rather I rely on random pausing, a method known and used to good effect by some programmers.
It proceeds through a series of iterations.
In each iteration a large source of time-consumption is identified and fixed, resulting in a certain speedup ratio.
As you proceed through multiple iterations, these speedup ratios multiply together (like compound interest).
That's how you get major speedup.
If, and only if, you get to a point where some code is taking a large fraction of time, and it doesn't contain any function calls, and you think you can write assembly code better than the compiler does, then go for it.
P.S. If you're wondering, the difference between using a profiler and random pausing is that profilers look for "bottlenecks", on the assumption that those are localized things. They look for routines or lines of code that are responsible for a large percent of overall time.
What they miss is problems that are diffuse.
For example, you could have 100 routines, each taking 1% of time.
That is, no bottlenecks.
However, there could be an activity being done within many or all of those routines, accounting for 1/3 of the time, that could be done better or not at all.
Random pausing will see that activity with a small number of samples, because you don't summarize, you examine the samples.
In other words, if you took 9 samples, on average you would notice the activity on 3 of them.
That tells you it's big.
So you can fix it and get your 3/2 speedup ratio.
"To understand recursion, you must first understand recursion." That quote comes to mind when I consider my response to your question, which is "until you understand when to use assembly, you should never use assembly." After you have completely implemented your soution, extensively profiled its performance and determined precise bottlenecks, and experimented with several alternative solutions, then you can begin to consider using assembly. If you code a single line of assembly before you have a working and extensively profiled program, you have made a mistake.
If you need to ask than you don't need it.
I sunk about a month of full time into a native C++ equation parser. It works, except it is slow (between 30-100 times slower than a hard-coded equation). What can I change to make it faster?
I read everything I could find on efficient code. In broad strokes:
The parser converts a string equation expression into a list of "operation" objects.
An operation object has two function pointers: a "getSource" and a "evaluate".
To evaluate an equation, all I do is a for loop on the operation list, calling each function in turn.
There isn't a single if / switch encountered when evaluating an equation - all conditionals are handled by the parser when it originally assigned the function pointers.
I tried inlining all the functions to which the function pointers point - no improvement.
Would switching from function pointers to functors help?
How about removing the function pointer framework, and instead creating a full set of derived "operation" classes, each with its own virtual "getSource" and "evaluate" functions? (But doesn't this just move the function pointers into the vtable?)
I have a lot of code. Not sure what to distill / post. Ask for some aspect of it, and ye shall receive.
In your post you don't mention that you have profiled the code. This is the first thing I would do if I were in your shoes. It'll give you a good idea of where the time is spent and where to focus your optimization efforts.
It's hard to tell from your description if the slowness includes parsing, or it is just the interpretation time.
The parser, if you write it as recursive-descent (LL1) should be I/O bound. In other words, the reading of characters by the parser, and construction of your parse tree, should take a lot less time than it takes to simply read the file into a buffer.
The interpretation is another matter.
The speed differential between interpreted and compiled code is usually 10-100 times slower, unless the basic operations themselves are lengthy.
That said, you can still optimize it.
You could profile, but in such a simple case, you could also just single-step the program, in the debugger, at the level of individual instructions.
That way, you are "walking in the computer's shoes" and it will be obvious what can be improved.
Whenever I'm doing what you're doing, that is, providing a language to the user, but I want the language to have fast execution, what I do is this:
I translate the source language into a language I have a compiler for, and then compile it on-the-fly into a .dll (or .exe) and run that.
It's very quick, and I don't need to write an interpreter or worry about how fast it is.
The very first thing is: Profile what actually went wrong. Is the bottleneck in parsing or in evaluation? valgrind offers some tools that can help you here.
If it's in parsing, boost::spirit might help you. If in evaluation, remember that virtual functions can be pretty slow to evaluate. I've made pretty good experiences with recursive boost::variant's.
You know, building an expression recursive descent parser is really easy, the LL(1) grammar for expressions is only a couple of rules. Parsing then becomes a linear affair and everything else can work on the expression tree (while parsing basically); you'd collect the data from the lower nodes and pass it up to the higher nodes for aggregation.
This would avoid altogether function/class pointers to determine the call path at runtime, relying instead of proven recursivity (or you can build an iterative LL parser if you wish).
It seems that you're using a quite complicated data structure (as I understand it, a syntax tree with pointers etc.). Thus, walking through pointer dereference is not very efficient memory-wise (lots of random accesses) and could slow you down significantly. As Mike Dunlavey proposed, you could compile the whole expression at runtime using another language or by embedding a compiler (such as LLVM). For what I know, Microsoft .Net provides this feature (dynamic compilation) with Reflection.Emit and Linq.Expression trees.
This is one of those rare times that I'd advise against profiling just yet. My immediate guess is that the basic structure you're using is the real source of the problem. Profiling the code is rarely worth much until you're reasonably certain the basic structure is reasonable, and it's mostly a matter of finding which parts of that basic structure can be improved. It's not so useful when what you really need to do is throw out most of what you have, and basically start over.
I'd advise converting the input to RPN. To execute this, the only data structure you need is a stack. Basically, when you get to an operand, you push it on the stack. When you encounter an operator, it operates on the items at the top of the stack. When you're done evaluating a well-formed expression, you should have exactly one item on the stack, which is the value of the expression.
Just about the only thing that will usually give better performance than this is to do like #Mike Dunlavey advised, and just generate source code and run it through a "real" compiler. That is, however, a fairly "heavy" solution. If you really need maximum speed, it's clearly the best solution -- but if you just want to improve what you're doing now, converting to RPN and interpreting that will usually give a pretty decent speed improvement for a small amount of code.
I have recently started making a C++ Wrapper of GTK+ (Nothing special, just wrapping everything into C++ classes for easy development, to be used in-house) and to cause minimum performance bloat to the already slow Gtk+ I used inline functions almost everywhere. Take a look at a few class functions...
class Widget : public Object
{
public: // A few sample functions. gwidget is the internal GTK+ widget.
void Show(void) { gtk_widget_show(GTK_WIDGET(gwidget)); }
void ShowNow(void) { gtk_widget_show_now(GTK_WIDGET(gwidget)); }
void Hide(void) { gtk_widget_hide(GTK_WIDGET(gwidget)); }
void ShowAll(void) { gtk_widget_show_all(GTK_WIDGET(gwidget)); }
public: // the internal Gtk+ widget.
GtkWidget* gwidget;
};
And although there is an almost non-existent performance bloat and the startup time and memory usage is exactly the same, the file size has increased dramatically. a C Gtk+ sample window generates 6.5 kb while a sample window using my wrapper generates 22.5 kb. , so I need a little advice. Should I continue using inline functions? I want my apps to be efficient and I can compromise a little on file size like I can take it even if a 6.5 kb C GTK+ program generated like 400-500 kb using my wrapper but NOT MORE. I don't want my wrapper to produce HUGE EXES like wxWidgets or MFC. So is it worthwile to use inline functions or should i use normal ones?
NOTE: all my functions only take up one or sometimes two lines and are not big as you can see in the example.
I strongly suspect you're comparing apples to oranges.
Did you compiler with exactly the same compiler, the same compile flags, and making an application with the exact same functionality?
If so, disassemble the executable, and see what the extra code is.
My guess is that it's some one-off library code used to support C++ features that were previously unused.
But as always, don't guess, measure.
You've got a single data point. That doesn't tell you much.
We could be looking at a 350% increase in file size in all cases, or we could be looking at a fixed 16kb overhead. You need to find out which it is.
So get some more data points. Extend your application. Make it open ten windows instead of one, or otherwise add extra functionality. is "your" version three times as big in that case too? Or is it 16kb larger? Or somewhere in between? Get some more data points, and you'll be able to see how the file size scales.
But most likely, you're worrying over nothing, for several reasons:
the C++ compiler treats inline as a hint. You are making it easy for the compiler to inline the function, but the decision rests with the compiler itself, and it tries to make the application fast. If the file size starts growing out of control, that will slow down your code, and so your compiler will try to optimize more towards a smaller file size.
you are looking at a couple of kilobytes. In an age of terabyte harddrives. If this has the potential to become a problem, then you should be able to provoke that problem in a test case. If you can't write a test that causes more than 16kb growth in file size, then it's not worth worrying about.
if file size does become a problem, compilers typically have an "optimize for size" flag.
large executables typically gain their size because they contain a lot of data and resources. The code itself is very rarely the problem (unless you go completely bonkers with template metaprogramming)
The size difference is more likely to be due to the libraries which are pulled in for C++, rather than in your C equivalent there is less library overhead.
If all your wrapper code follows the above, then very little to none in terms of bloat.
Looks to me that your solution is a good worth while way of implementing the wrapper.
To cause minimum performance bloat to the already slow Gtk+ I used inline functions almost everywhere
Compilers are quite good at knowing when to inline and when not to inline functions. The inline keyword does not actually mean that the function will be inlined, only that definitions of that function in different translation units don't break the One Definition Rule (ODR).
Regarding the code size, it will depend on the compiler options, and a few other things, but for the code that you are presenting there should not be any effect, as if the function is inlined the call to one function will be replaced for the call to the other, those are one liners. Note that many compilers do create the function and leave it in the binary even if all uses are inlined, you might want to look at the compiler/linker documentation on how to remove those, but even if they are created, the size of the project should not be affected by much.
If you are willing to allow your project to grow from 6.5kb to 400kb, you should be more than fine.
It depends on
how big your functions are
how often you use them
compiler settings (which you can influence)
compiler implementation (which you can't)
etc. etc. In other words, don't assume, measure. Do a representative chunk of your project with and without inline functions and see what the effect is in your particular circumstances. Any predictions you receive across the internet will be educated guesses at best.
If your functions are only one or two lines then it's very unlikely that they'll increase the size of your resulting binary by any considerable amount. In fact, they could make it smaller if the code itself is smaller than the overhead code of a function call.
Note that either way, the overhead would be negligible. Just removing a single call to std::sort or an instantiation of std::map would offset any bloat. If you care about code size, small inlined functions are the least of your worries.
The examples that you give should incur no code increase at the calling side. Replacing an inline function that does just one call and nothing else, should be optimized by any decent compiler.
You'd have to investigate where the real increase is. Look at the assembler that is produced, this usually gives a good view on the the overhead.
There is no easy answer. It depends on many factors:
the complexity of the function
the CPU architecture and memory model
implementation procedure calling conventions
the parameter passing mechanism, including how the object accesses this
Size is not the only efficiency to consider. On many modern CPUs, executing any kind of branch or call instruction may stall the CPU for the equivalent time of many instructions. Often replacing a call instruction with a few instructions from the function body is a big time gain. It can also be a code size advantage since CPU registers may already have function parameters in them so they would not need to be pushed or moved.
Don't sweat little optimizations until there is a known space or speed issue. Then look for the 10% fix which affects 90% of the problem.
Congratulations, you are reinventing GTKmm, the official C++ bindings for GTK+.
Each time I write another one of my small c++ toy programs, I come across the need for a small, easy-to-use options/parameter class. Here is what it should be able to do:
accept at least ints, doubles, string parameters
easy way to add new options
portable, small and fast
free
read options from a file and/or command line
upper and lower bounds for parameters
and all the other neat useful things I am not thinking of right now
What I want to do is pass a pointer to this class to the builder and all of my strategy objects, so they can read the parameters of the algorithm I am running (e.g. which algorithm, maximum number of iterations etc.)
Can someone point me to an implementation that achieves at least some of these things?
Boost Program-Options is pretty slick. I think it does all the things on your list, apart from maybe bounds validation. But even then, you can provide custom validators pretty easily.
Update: As #stefan rightly points out in a comment, this also fails on "small"! It adds quite a significant chunk to your binary if you statically link it in.
You might want to consider storing your configuration in JSON format. While reading JSON from the command-line is slightly awkward, it's still perfectly doable and even reasonably legible. Other than that you get a whole lot of flexibility, including nested configuration options, facilities for deserializing complicated data types etc.
There are numerous libraries for de-serializing JSON into C++, see e.g. this discussion and comparison of a few of them. Some are small, many are fast (although you don't actually need them to be fast - configuration data is very small), most are very portable. A long list and some benchmark results (although not a feature comparison) can be found here; some of these libraries might actually be geared towards use for reading configuration options, although that's just a wild guess.