So I'm thinking of using the Eigen matrix library for a project I'm doing (2D space simulator). I just went ahead and profiled some code with Eigen::Vector2d, and with bare arrays. I noticed a 10x improvement in assigning values to elements in the array, and a 40x improvement in calculating the dot products.
Here is my profiling if you want to check it out, basically it's ~4.065s against ~0.110s.
Obviously bare arrays are much more efficient at dot products and assigning stuff. So why use the Eigen library (or any other library, Eigen just seemed the fastest)? Is it stability? Complicated maths that would be hard to code by yourself efficiently?
The real win for these libraies is the built in SIMD vectorization.
It looks like eigen doesn't enable that by default and you need to enable it with a define / compiler switch. (EDIT: Misread the link, it's enabled if it detects that the compiler supports it, and you need to enable the instructions on some compilers, still, may or may not be on by default on your compiler)
(Not to mention the fact that they are typically more thoroughly tested than a home rolled solution, and enable all sorts of complicated/interesting stuff that's a real bear to code by hand)
There are a number of reasons to opt for standard library code.
Better portability. An individual developer may not have considered (or may not have access to) multiple platforms.
Better reliability. (as mentioned by Donnie) A library is usually more thoroughly tested.
Better developer mobility. It is easier to work on other people's code if they are using standard library components.
Avoids reinventing the wheel. You want to avoid a situation where each developer develops the same component in their own way.
A custom implementation can get stale soon. There's only a limited amount of time upto which you would be able to keep updating and supporting your version of the library. The standard library is likely to have more support effort.
Better "external" support. Consider the C++ STL library for instance. You will find plenty of resources from people who are not the original developers. Also, textbooks will cover standard library components, which helps new users and students to learn them without any burden to the developer.
PS/Disclaimer: My apologies, I don't know about the Eigen library. The above points are from a more general perspective regarding standard library.
I just had a look at your benchmarking and get the following result:
g++ -I/usr/include/eigen3/ eigen.cpp -o eigen
g++ -O3 -I/usr/include/eigen3/ eigen.cpp -o eigen_opt
g++ -I/usr/include/eigen3/ matrix.cpp -o matrix
g++ -O3 -I/usr/include/eigen3/ matrix.cpp -o matrix_opt
./eigen 3.10s user 0.00s system 99% cpu 3.112 total
./eigen_opt 0.00s user 0.00s system 0% cpu 0.001 total
./matrix 0.06s user 0.00s system 96% cpu 0.058 total
./matrix_opt 0.00s user 0.00s system 0% cpu 0.001 total
Eigen really is not fast unless you switch on the compiler optimizations. I also suspect that the compiler in the -O3 case does some optimization that works against the benchmarking character. You might want to look into it.
I think this removes one of your points for not using a library: speed. Once that criteria is out of the way, there is NO reason that I could think of not to use an existing library, other than you want to do something for academic purposes, or you want to write your own library. Whenever I see a library or other code that implements its own Matrix and Vector classes these days I try to avoid it if possible. With Eigen around I even have a much lower need of Matlab...
Related
What is the difference between -fprofile-use and -fauto-profile?
Here's what the docs say:
https://gcc.gnu.org/onlinedocs/gcc/Optimize-Options.html#Optimize-Options
-fprofile-use
-fprofile-use=path
Enable profile feedback-directed optimizations, and the following optimizations which are generally profitable only with profile feedback available: [...]
If path is specified, GCC looks at the path to find the profile feedback data files. See -fprofile-dir.
and underneath that
-fauto-profile
-fauto-profile=path
Enable sampling-based feedback-directed optimizations, and the following optimizations which are generally profitable only with profile feedback available: [...]
path is the name of a file containing AutoFDO profile information. If omitted, it defaults to fbdata.afdo in the current directory.
(The list of optimizations in the [...] for -fauto-profile is longer.)
I stumbled into this thread by a path I can't even remember and am learning this stuff as I go along. But I don't like seeing an unanswered question if I could learn something from it! So I got reading.
Feedback-Directed Optimisation
As GCC say, both of these are modes of applying Feedback-Directed Optimisation. By running the program and profiling what it does, how it does it, how long it spends in which functions, etc. - we may facilitate extra, directed optimisations from the resulting data. Results from the profiler are 'fed forward' to the optimiser. Next, presumably, you can take your profile-optimised binary and profile that, then compile another FDO'd version, and so on... hence the feedback part of the name.
The real answer, the difference between these two switches, isn't very clearly documented, but it's available if we just need to look a little further.
-fprofile-use
Firstly, your quote for -fprofile-use only really states that it requires -fprofile-generate, an option that isn't very well documented: the reference from -use just tells you to read the page you're already on, where in all cases, -generate is only mentioned but never defined. Useful! But! We can refer to the answers to this question: How to use profile guided optimizations in g++?
As that answer states, and the piece of GCC's documentation in question here gently indicates... -fprofile-generate causes instrumentation to be added to the output binary. As that page summarises, an instrumented executable has stuff added to facilitate extra checks or insights during its runtime.
(The other form of instrumentation I know - and the one I've used - is the compiler add-on library UBSan, which I use via GCC's -fsanitize=undefined option. This catches bits of Undefined Behaviour at runtime. GCC with this on has revealed UB I might've otherwise taken ages to find - and made me wonder how my programs ran at all! Clang can use this library too, and maybe other compilers.)
-fauto-profile
In contrast, -fauto-profile is different. The key distinction is hinted, if not clearly, in the synopsis you quoted for it:
path is the name of a file containing AutoFDO profile information.
This mode handles profiling and subsequent optimisations using AutoFDO. To Google we go: AutoFDO The first few lines don't explain this as succinctly as possible, and I think the best summary is buried rather far down the page:
The major difference between AutoFDO [-fauto-profile] and FDO [-fprofile-use] is that AutoFDO profiles on optimized binary instead of instrumented binary. This makes it very different in handling cloned functions.
How does it do this? -fauto-profile requires you to provide profiling files written out by the Linux kernel's profiler, Perf, converted to the AutoFDO format. Perf, rather than adding instrumentation, uses hardware features of the CPU and kernel-level features of the OS to profile various statistics about a program while it's running:
perf is powerful: it can instrument CPU performance counters, tracepoints, kprobes, and uprobes (dynamic tracing). It is capable of lightweight profiling. [...] Performance counters are CPU hardware registers that count hardware events such as instructions executed, cache-misses suffered, or branches mispredicted. They form a basis for profiling applications to trace dynamic control flow and identify hotspots.
So, that lets it profile an optimised program, rather than an instrumented one. We might reasonably presume this is more representative of how your program would react in the real world - and so can facilitate gathering more useful profiling data and applying more effective optimisations as a result.
An example of how to do the legwork of tying all this together and getting -fauto-profile to do something with your program is summarised here: Feedback directed optimization with GCC and Perf
(Maybe now that I learned all this, I'll try these options out some day!)
underscore_d gives an in-depth insight into the differences.
Here is my take on it.
Performing internal profiling by compiling initially with -fprofile-generate, which integrates the profiler into the binary for the performance data collection run. Execute the binary, for 10 minutes or whatever time you think covers enough activity for the profiler to record. Recompile again instead with -fprofile-use along with -fprofile-correction if it is a multi-threaded application. Internal profiler runs causes a significant performance hit (25% in my case) which does not reflect the real world non-profiler included binary behavior, so could result in less accurate profiling, but if all activity when running the profiler enabled binary scales with the performance penalty, I guess it should not matter.
Alternatively you can use the perf tool (more error prone and effort) which is specific to your kernel (may also need kernel built to support profiling, tracing etc) to create the profiling data. This could be considered, external profiling and has negligible impact on the application performance while being profiled. You run this on the binary that you compile normally. I cannot find any studies comparing the two.
perf record -e br_inst_retired:near_taken -b -o perf.data *your_program.unstripped -program -parameters*
then without stripping the binary, convert the profiling data into something GCC understands...
create_gcov --binary=your_program.unstripped --profile=perf.data --gcov=profile.afdo
Then recompile the application using -fauto-profile. Perf and AutoFDO/create_gcov version specific issues are known to exist. I referred to https://blog.wnohang.net/index.php/2015/04/29/feedback-directed-optimization-with-gcc-and-perf/ for detailed information on this alternative profiling method.
-fprofile-use and -fauto-profile both enable many optimization options by default, in my case the unwanted -funroll-loops which I knew had negative impact on performance in my application. If your the pedantic type, you can test option combinations by including the disabling counterpart in the compile flags, in my case -fno-unroll-loops.
Using internal profiling with my program after stripping the binary, it reduced the size by 25% (compared to original non-profiler stripped binary) however I only observed sub-percentile performance gains and the previous work output fluctuations that are reported by the program log (it's a crypto currency miner) were more erratic, instead of a gradual rising and lowering between peaks and troughs in hash rates like originally.
Overall, a stab in the dark.
So I want to distribute my gcc application with backtrace logging for critical errors. Yet it is quite performance critical application so I wonder if -g -rdynamic gcc flags do slow down execution (especially if they do allot)? Also would like to give my users maximum performance so I do compile with optimization flags like "-flto" and "-mtune" and that makes me wonder if flags would conflict and inside baacktrace would be madness?
Although introducing debug symbols does not affect performance by itself, your application still end up far behind in terms of possible performance. What I mean by that is that it would be bad idea to use -g and -O3 simultaneously, in general. Therefore, if your application is performance critical, but at the same time severely needs to keep good level of debugging, then it would be reasonable to find some balance between these two. In the latest versions of GCC, we are provided with -Og flag:
Optimize debugging experience. -Og enables optimizations that do not
interfere with debugging. It should be the optimization level of
choice for the standard edit-compile-debug cycle, offering a
reasonable level of optimization while maintaining fast compilation
and a good debugging experience.
I think it would be good idea to test your application with this flag, to see whether the performance is indeed better than bare -g, but the debugging stays intact.
Once again, do not neglect reading official GCC documentation. LTO is relatively new feature in GCC, and, as a result, some of its parts are still experimental and are not meant for production. For example, this is the direct extract:
Link-time optimization does not work well with generation of debugging
information. Combining -flto with -g is currently experimental and
expected to produce wrong results.
Not so long ago I had mixed experience with LTO. Sometimes it works well, sometimes the project doesn't even compile, not to mention that there could also be subtle runtime issues. Summarizing all of it, I would not recommend using LTO, especially in your situation.
NOTE: Performance gain from LTO usually varies from 0% to 3%, and it heavily depends on the underlying application. Without profiling, you cannot tell whether it is even reasonable to employ LTO for your situation as it might deliver more troubles than benefits.
Flags like -march and -mtune usually do optimizations on a very low level - instruction level for the target processor architecture. Thus, I wouldn't expect them to interfere with debugging. Nevertheless, you are welcomed to test this yourself with your application.
-g has no impact whatsoever on performance. -rdynamic will increase the size of the dynamic symbol table in the main executable, which might slow down dynamic linking. My best guess is that the slow-down will be very small but possibly measurable (nonzero) with precise measurement/profiling tools.
My project currently has a library that is static linked (compiled with gcc and linked with ar), but I am currently trying to profile my whole entire project with gprof, in which I would also like to profile this statically linked library. Is there any way of going about doing this?
Gprof requires that you provide -pg to GCC for compilation and -pg to the linker. However, ar complains when -pg is added to the list of flags for it.
I haven't used gprof in a long time, but is -pg even a valid argument to ar? Does profiling work if you compile all of the objects with -pg, then create your archive without -pg?
If you can't get gprof to work, gperftools contains a CPU profiler which I think should work very well in this case. You don't need to compile your application with any special flags, and you don't need to try to change how your static library is linked.
Before starting, there are two tradeoffs involved with using gperftools that you should be aware of:
gperftools is a sampling profiler. As such, your results won't be 100%
accurate, but they should be really good. The big upside to using a
sampling profiler is that it won't really slow your application down.
In multithreaded applications, in my experience, gperftools will only
profile the main thread. The only way I've been able to successfully
profile worker threads is by adding profiling code to my application.
With that said, profiling the main thread shouldn't require any code
changes.
There are lots of different ways to use gperftools. My preferred way is to load the gperftools library with $LD_PRELOAD, specify a logging destination with $CPUPROFILE, and maybe bump up the sample frequency with $CPUPROFILE_FREQUENCY before starting my application up. Something like this:
export LD_PRELOAD=/usr/lib/libprofiler.so
export CPUPROFILE=/tmp/prof.out
export CPUPROFILE_FREQUENCY=10000
./my_application
This will write a bunch of profiling information to /tmp/prof.out. You can run a post-processing script to convert this file into something human readable. There are lots of supported output formats -- my preferred one is callgrind:
google-pprof --callgrind /path/to/my_application /tmp/prof.out > callgrind.dat
kcachegrind callgrind.dat &
This should provide a nice view of where your program is spending its time.
If you're interested, I spent some time over the weekend learning how to use gperftools to profile I/O bound applications, and I documented a lot of my findings here. There's a lot of overlap with what you're trying to do, so maybe it will be helpful.
test.c:
int main()
{
return 0;
}
I haven't used any flags (I am a newb to gcc) , just the command:
gcc test.c
I have used the latest TDM build of GCC on win32.
The resulting executable is almost 23KB, way too big for an empty program.
How can I reduce the size of the executable?
Don't follow its suggestions, but for amusement sake, read this 'story' about making the smallest possible ELF binary.
How can I reduce its size?
Don't do it. You just wasting your time.
Use -s flag to strip symbols (gcc -s)
By default some standard libraries (e.g. C runtime) linked with your executable. Check out keys --nostdlib --nostartfiles --nodefaultlib for details. Link options described here.
For real program second option is to try optimization options, e.g. -Os (optimize for size).
Give up. On x86 Linux, gcc 4.3.2 produces a 5K binary. But wait! That's with dynamic linking! The statically linked binary is over half a meg: 516K. Relax and learn to live with the bloat.
And they said Modula-3 would never go anywhere because of a 200K hello world binary!
In case you wonder what's going on, the Gnu C library is structured such as to include certain features whether your program depends on them or not. These features include such trivia as malloc and free, dlopen, some string processing, and a whole bucketload of stuff that appears to have to do with locales and internationalization, although I can't find any relevant man pages.
Creating small executables for programs that require minimum services is not a design goal for glibc. To be fair, it has also been not a design goal for every run-time system I've ever worked with (about half a dozen).
Actually, if your code does nothing, is it even fair that the compiler still creates an executable? ;-)
Well, on Windows any executable would still have a size, although it can be reasonable small. With the old MS-DOS system, a complete do-nothing application would just be a couple of bytes. (I think four bytes to use the 21h interrupt to close the program.) Then again, those application were loaded straight into memory.
When the EXE format became more popular, things changed a bit. Now executables had additional information about the process itself, like the relocation of code and data segments plus some checksums and version information.
The introduction of Windows added another header to the format, to tell MS-DOS that it couldn't execute the executable since it needed to run under Windows. And Windows would recognize it without problems.
Of course, the executable format was also extended with resource information, like bitmaps, icons and dialog forms and much, much more.
A do-nothing executable would nowadays be between 4 and 8 kilobytes in size, depending on your compiler and every method you've used to reduce it's size. It would be at a size where UPX would actually result in bigger executables! Additional bytes in your executable might be added because you added certain libraries to your code. Especially libraries with initialized data or resources will add a considerable amount of bytes. Adding debug information also increases the size of the executable.
But while this all makes a nice exercise at reducing size, you could wonder if it's practical to just continue to worry about bloatedness of applications. Modern hard disks will divide files up in segments and for really large disks, the difference would be very small. However, the amount of trouble it would take to keep the size as small as possible will slow down development speed, unless you're an expert developer whom is used to these optimizations. These kinds of optimizations don't tend to improve performance and considering the average disk space of most systems, I don't see why it would be practical. (Still, I do optimize my own code in similar ways but then again, I am experienced with these optimizations.)
Interested in the EXE header? It's starts with the letters MZ, for "Mark Zbikowski". The first part is the old-style MS-DOS header for executables and is used as a stub to MS-DOS saying the program is not an MS-DOS executable. (In the binary, you can find the text 'This program cannot be run in DOS mode.' which is basically all it does: displaying that message. Next is the PE header, which Windows will recognise and use instead of the MS-DOS header. It starts with the letters PE for Portable Executable. After this second header there will be the executable itself, divided in several blocks of code and data. The header contains special reallocation tables which tells the OS where to load a specific block. And if you can keep this to a limit, the final executable can be smaller than 4 KB, but 90% would then be header information and no functionality.
I like the way the DJGPP FAQ addressed this many many years ago:
In general, judging code sizes by looking at the size of "Hello" programs is meaningless, because such programs consist mostly of the startup code. ... Most of the power of all these features goes wasted in "Hello" programs. There is no point in running all that code just to print a 15-byte string and exit.
What is the purpose of this exercise?
Even with as low a level language as C, there's still a lot of setup that has to happen before main can be called. Some of that setup is handled by the loader (which needs certain information), some is handled by the code that calls main. And then there's probably a little bit of library code that any normal program would have to have. At the least, there's probably references to the standard libraries, if they are in dlls.
Examining the binary size of the empty program is a worthless exercise in and of itself. It tells you nothing. If you want to learn something about code size, try writing non-empty (and preferably non-trivial) programs. Compare programs that use standard libraries with programs that do everything themselves.
If you really want to know what's going on in that binary (and why it's so big), then find out the executable format get a binary dump tool and take the thing apart.
What does 'size a.out' tell you about the size of the code, data, and bss segments? The majority of the code is likely to be the start up code (classically crt0.o on Unix machines) which is invoked by the o/s and does set up work (like sorting out command line arguments into argc, argv) before invoking main().
Run strip on the binary to get rid of the symbols. With gcc version 3.4.4 (cygming special) I drop from 10k to 4K.
You can try linking a custom run time (The part that calls main) to setup your runtime environment. All programs use the same one to setup the runtime environment that comes with gcc but for your executable you don't need data or zero'ed memory. The means you could get rid of unused library functions like memset/memcpy and reduce CRT0 size. When looking for info on this look at GCC in embedded environment. Embedded developers are general the only people that use custom runtime environments.
The rest is overheads for the OS that loads the executable. You are not going to same much there unless you tune that by hand?
Using GCC, compile your program using -Os rather than one of the other optimization flags (-O2 or -O3). This tells it to optimize for size rather than speed. Incidentally, it can sometimes make programs run faster than the speed optimizations would have, if some critical segment happens to fit more nicely. On the other hand, -O3 can actually induce code-size increases.
There might also be some linker flags telling it to leave out unused code from the final binary.
I really hate using STL containers because they make the debug version of my code run really slowly. What do other people use instead of STL that has reasonable performance for debug builds?
I'm a game programmer and this has been a problem on many of the projects I've worked on. It's pretty hard to get 60 fps when you use STL container for everything.
I use MSVC for most of my work.
EASTL is a possibility, but still not perfect. Paul Pedriana of Electronic Arts did an investigation of various STL implementations with respect to performance in game applications the summary of which is found here:
http://www.open-std.org/jtc1/sc22/wg21/docs/papers/2007/n2271.html
Some of these adjustments to are being reviewed for inclusion in the C++ standard.
And note, even EASTL doesn't optimize for the non-optimized case. I had an excel file w/ some timing a while back but I think I've lost it, but for access it was something like:
debug release
STL 100 10
EASTL 10 3
array[i] 3 1
The most success I've had was rolling my own containers. You can get those down to near array[x] performance.
My experience is that well designed STL code runs slowly in debug builds because the optimizer is turned off. STL containers emit a lot of calls to constructors and operator= which (if they are light weight) gets inlined/removed in release builds.
Also, Visual C++ 2005 and up has checking enabled for STL in both release and debug builds. It is a huge performance hog for STL-heavy software. It can be disabled by defining _SECURE_SCL=0 for all your compilation units. Please note that having different _SECURE_SCL status in different compilation units will almost certainly lead to disaster.
You could create a third build configuration with checking turned off and use that to debug with performance. I recommend you to keep a debug configuration with checking on though, since it's very helpful to catch erroneous array indices and stuff like that.
If your running visual studios you may want to consider the following:
#define _SECURE_SCL 0
#define _HAS_ITERATOR_DEBUGGING 0
That's just for iterators, what type of STL operations are you preforming? You may want to look at optimizing your memory operations; ie, using resize() to insert several elements at once instead of using pop/push to insert elements one at a time.
For big, performance critical applications, building your own containers specifically tailored to your needs may be worth the time investment.
I´m talking about real game development here.
I'll bet your STL uses a checked implementation for debug. This is probably a good thing, as it will catch iterator overruns and such. If it's that much of a problem for you, there may be a compiler switch to turn it off. Check your docs.
If you're using Visual C++, then you should have a look at this:
http://channel9.msdn.com/shows/Going+Deep/STL-Iterator-Debugging-and-Secure-SCL/
and the links from that page, which cover the various costs and options of all the debug-mode checking which the MS/Dinkware STL does.
If you're going to ask such a platform dependent question, it would be a good idea to mention your platform, too...
Check out EASTL.
MSVC uses a very heavyweight implementation of checked iterators in debug builds, which others have already discussed, so I won't repeat it (but start there)
One other thing that might be of interest to you is that your "debug build" and "release build" probably involves changing (at least) 4 settings which are only loosely related.
Generating a .pdb file (cl /Zi and link /DEBUG), which allows symbolic debugging. You may want to add /OPT:ref to the linker options; the linker drops unreferenced functions when not making a .pdb file, but with /DEBUG mode it keeps them all (since the debug symbols reference them) unless you add this expicitly.
Using a debug version of the C runtime library (probably MSVCR*D.dll, but it depends on what runtime you're using). This boils down to /MT or /MTd (or something else if not using the dll runtime)
Turning off the compiler optimizations (/Od)
setting the preprocessor #defines DEBUG or NDEBUG
These can be switched independently. The first costs nothing in runtime performance, though it adds size. The second makes a number of functions more expensive, but has a huge impact on malloc and free; the debug runtime versions are careful to "poison" the memory they touch with values to make uninitialized data bugs clear. I believe with the MSVCP* STL implementations it also eliminates all the allocation pooling that is usually done, so that leaks show exactly the block you'd think and not some larger chunk of memory that it's been sub-allocating; that means it makes more calls to malloc on top of them being much slower. The third; well, that one does lots of things (this question has some good discussion of the subject). Unfortunately, it's needed if you want single-stepping to work smoothly. The fourth affects lots of libraries in various ways, but most notable it compiles in or eliminates assert() and friends.
So you might consider making a build with some lesser combination of these selections. I make a lot of use of builds that use have symbols (/Zi and link /DEBUG) and asserts (/DDEBUG), but are still optimized (/O1 or /O2 or whatever flags you use) but with stack frame pointers kept for clear backtraces (/Oy-) and using the normal runtime library (/MT). This performs close to my release build and is semi-debuggable (backtraces are fine, single-stepping is a bit wacky at the source level; assembly level works fine of course). You can have however many configurations you want; just clone your release one and turn on whatever parts of the debugging seem useful.
Sorry, I can't leave a comment, so here's an answer: EASTL is now available at github: https://github.com/paulhodge/EASTL
Ultimate++ has its own set of containers - not sure if you can use them separatelly from the rest of the library: http://www.ultimatepp.org/
What about the ACE library? It's an open-source object-oriented framework for concurrent communication software, but it also has some container classes.
Checkout Data Structures and Algorithms with Object-Oriented Design Patterns in C++
By Bruno Preiss
http://www.brpreiss.com/
Qt has reimplemented most c++ standard library stuff with different interfaces. It looks pretty good, but it can be expensive for the commercially licensed version.
Edit: Qt has since been released under LGPL, which usually makes it possible to use it in commercial product without bying the commercial version (which also still exists).
STL containers should not run "really slowly" in debug or anywhere else. Perhaps you're misusing them. You're not running against something like ElectricFence or Valgrind in debug are you? They slow anything down that does lots of allocations.
All the containers can use custom allocators, which some people use to improve performance - but I've never needed to use them myself.
There is also the ETL https://www.etlcpp.com/. This library aims especially for time critical (deterministic) applications
From the webpage:
The ETL is not designed to completely replace the STL, but complement
it. Its design objective covers four main areas.
Create a set of containers where the size or maximum size is determined at compile time. These containers should be largely
equivalent to those supplied in the STL, with a compatible API.
Be compatible with C++ 03 but implement as many of the C++ 11 additions as possible.
Have deterministic behaviour.
Add other useful components that are not present in the standard library.
The embedded template library has been designed for lower resource
embedded applications. It defines a set of containers, algorithms and
utilities, some of which emulate parts of the STL. There is no dynamic
memory allocation. The library makes no use of the heap. All of the
containers (apart from intrusive types) have a fixed capacity allowing
all memory allocation to be determined at compile time. The library is
intended for any compiler that supports C++03.