My project currently has a library that is static linked (compiled with gcc and linked with ar), but I am currently trying to profile my whole entire project with gprof, in which I would also like to profile this statically linked library. Is there any way of going about doing this?
Gprof requires that you provide -pg to GCC for compilation and -pg to the linker. However, ar complains when -pg is added to the list of flags for it.
I haven't used gprof in a long time, but is -pg even a valid argument to ar? Does profiling work if you compile all of the objects with -pg, then create your archive without -pg?
If you can't get gprof to work, gperftools contains a CPU profiler which I think should work very well in this case. You don't need to compile your application with any special flags, and you don't need to try to change how your static library is linked.
Before starting, there are two tradeoffs involved with using gperftools that you should be aware of:
gperftools is a sampling profiler. As such, your results won't be 100%
accurate, but they should be really good. The big upside to using a
sampling profiler is that it won't really slow your application down.
In multithreaded applications, in my experience, gperftools will only
profile the main thread. The only way I've been able to successfully
profile worker threads is by adding profiling code to my application.
With that said, profiling the main thread shouldn't require any code
changes.
There are lots of different ways to use gperftools. My preferred way is to load the gperftools library with $LD_PRELOAD, specify a logging destination with $CPUPROFILE, and maybe bump up the sample frequency with $CPUPROFILE_FREQUENCY before starting my application up. Something like this:
export LD_PRELOAD=/usr/lib/libprofiler.so
export CPUPROFILE=/tmp/prof.out
export CPUPROFILE_FREQUENCY=10000
./my_application
This will write a bunch of profiling information to /tmp/prof.out. You can run a post-processing script to convert this file into something human readable. There are lots of supported output formats -- my preferred one is callgrind:
google-pprof --callgrind /path/to/my_application /tmp/prof.out > callgrind.dat
kcachegrind callgrind.dat &
This should provide a nice view of where your program is spending its time.
If you're interested, I spent some time over the weekend learning how to use gperftools to profile I/O bound applications, and I documented a lot of my findings here. There's a lot of overlap with what you're trying to do, so maybe it will be helpful.
Related
I have a build system that is using the long standing LTO support in clang via the -flto flag.
The ThinLTO support added to LLVM (https://clang.llvm.org/docs/ThinLTO.html) looks interesting, but I'm a little puzzled about the decision to launch std::thread::hardware_concurrency parallel processing threads in the context of a build system that already runs concurrent jobs.
If you have a build system that is already launching a thread per core and running a mix of compile and link jobs, does it still make sense for the linker to assume that it should use all cores, or even more than one?
Or does it make sense instead to reduce ThinLTOs background concurrency to 1 with the flags documented at https://clang.llvm.org/docs/ThinLTO.html#controlling-backend-parallelism? Are there any advantages to ThinLTO over regular plain old LTO when the parallelism has been removed?
ThinLTO actually can greatly improve build times for large projects, among its other benefits. The cache is not designed only for incremental builds - it's part and parcel for how the multi-threaded link stage works and is meant to speed symbol lookups. How helpful ThinLTO is for you with respect to shortening build times depends on your project and build system.
I found a very good video that goes over some details of the design for ThinLTO, its usecases, and some ways it has been implemented successfully:
https://www.youtube.com/watch?v=p9nH2vZ2mNo&list=WL&index=51&t=2812s
The corresponding Google Research paper is also a very interesting (if heavy) read:
https://research.google/pubs/pub47584/
For a lighter and more casual take, this blog post was also helpful:
http://blog.llvm.org/2016/06/thinlto-scalable-and-incremental-lto.html
How to figure out which part of my d code takes long time to compile?
I tried to use valgrind, but the the method names were not very insightful. 87% of time was spent in <cycle 7>, 40% of the time in _D4ddmd5lexer5Lexer4scanMFPS4ddmd6tokens5TokenZv
I'm looking for something like this: 40% of the time was spent on xy.d, out of that 80% of the time took compiling various instantiations of template xyz and a reason is because it used memcpy 99% of the time.
I'm interested profiling both DMD and LDC.
As the D compiler front end is written in D, profiling using conventional tools will be rather hard compared to something like C++. I have had some success using tools like gdb and valgrind on Linux and tools like VisualD on Windows, Mac users are kind of SOL.
You have five other options:
Stop trying to find the specific function in the compiler and turn to common knowledge about the problem (see below)
Use a tool like https://github.com/CyberShadow/DBuildStat. It doesn't give you the exact answer you're asking about, but if you're trying to get a large project to compile faster it's better than nothing.
Use the -v flag to try and see which parts of your program take a while. Granted, this is a very brute force approach and can take you a while.
Modify the makefile the DMD front-end to use the -profile switch. Every time you run DMD you will get a profile file with a lot of information. Granted, I don't think this has ever been tried. Your milage may vary.
Try to ask the LDC team about this on their Github issues page. IIRC they made a patched version for profiling that they used for the Weka.io codebase.
When I say turn to common knowledge, I mean to say that your slow compilation is likely due to a few common problems. For example, when an SQL query is taking too long, my first reaction is not to try to profile the MySQL server code. Here are a couple of the most common issues
CTFE, while it speeds up your runtime, is slow. Especially if your doing recursive templates like allSatisfy or your using functions like ctRegex. If you're doing heavy CTFE and you want faster compiles at the price of possible slower code, consider switching them to run time calls.
DMD doesn't (yet) ignore symbols which aren't used in your program, meaning if you import a module, code-gen will happen for all of the functions in the module. This is true even for selective imports. If you don't use them the linker will prune the functions from the resulting executable, but the compiler still took time to compile them. Avoid imports like import std.algorithm; or import std.range;. Instead use package specific imports like import std.algorithm.iteration : map;.
Situation
I have a program (written in fortran) which consists of:
A set of core routines, used every time the program is run.
A large collection of alternate routines, only one of which is used for each run, selected by the user at the start.
The user may reasonably select different alternatives for subsequent runs.
Most of the building time is spent compiling the alternatives, which is frustrating when I know only one will be used each time. Most of the run time is spent in the alternative routine, which is short but called many times.
Idea
Compile all the core routines to a native executable and all the alternatives to an llvm bitcode library. At runtime, the selected alternative only is automatically compiled and linked. This would hopefully save a lot of building time and slow down the running only marginally.
Questions
Is this even possible and if so, how?
Is it a good idea? Are there better ways of achieving a similar result?
What is the difference between -fprofile-use and -fauto-profile?
Here's what the docs say:
https://gcc.gnu.org/onlinedocs/gcc/Optimize-Options.html#Optimize-Options
-fprofile-use
-fprofile-use=path
Enable profile feedback-directed optimizations, and the following optimizations which are generally profitable only with profile feedback available: [...]
If path is specified, GCC looks at the path to find the profile feedback data files. See -fprofile-dir.
and underneath that
-fauto-profile
-fauto-profile=path
Enable sampling-based feedback-directed optimizations, and the following optimizations which are generally profitable only with profile feedback available: [...]
path is the name of a file containing AutoFDO profile information. If omitted, it defaults to fbdata.afdo in the current directory.
(The list of optimizations in the [...] for -fauto-profile is longer.)
I stumbled into this thread by a path I can't even remember and am learning this stuff as I go along. But I don't like seeing an unanswered question if I could learn something from it! So I got reading.
Feedback-Directed Optimisation
As GCC say, both of these are modes of applying Feedback-Directed Optimisation. By running the program and profiling what it does, how it does it, how long it spends in which functions, etc. - we may facilitate extra, directed optimisations from the resulting data. Results from the profiler are 'fed forward' to the optimiser. Next, presumably, you can take your profile-optimised binary and profile that, then compile another FDO'd version, and so on... hence the feedback part of the name.
The real answer, the difference between these two switches, isn't very clearly documented, but it's available if we just need to look a little further.
-fprofile-use
Firstly, your quote for -fprofile-use only really states that it requires -fprofile-generate, an option that isn't very well documented: the reference from -use just tells you to read the page you're already on, where in all cases, -generate is only mentioned but never defined. Useful! But! We can refer to the answers to this question: How to use profile guided optimizations in g++?
As that answer states, and the piece of GCC's documentation in question here gently indicates... -fprofile-generate causes instrumentation to be added to the output binary. As that page summarises, an instrumented executable has stuff added to facilitate extra checks or insights during its runtime.
(The other form of instrumentation I know - and the one I've used - is the compiler add-on library UBSan, which I use via GCC's -fsanitize=undefined option. This catches bits of Undefined Behaviour at runtime. GCC with this on has revealed UB I might've otherwise taken ages to find - and made me wonder how my programs ran at all! Clang can use this library too, and maybe other compilers.)
-fauto-profile
In contrast, -fauto-profile is different. The key distinction is hinted, if not clearly, in the synopsis you quoted for it:
path is the name of a file containing AutoFDO profile information.
This mode handles profiling and subsequent optimisations using AutoFDO. To Google we go: AutoFDO The first few lines don't explain this as succinctly as possible, and I think the best summary is buried rather far down the page:
The major difference between AutoFDO [-fauto-profile] and FDO [-fprofile-use] is that AutoFDO profiles on optimized binary instead of instrumented binary. This makes it very different in handling cloned functions.
How does it do this? -fauto-profile requires you to provide profiling files written out by the Linux kernel's profiler, Perf, converted to the AutoFDO format. Perf, rather than adding instrumentation, uses hardware features of the CPU and kernel-level features of the OS to profile various statistics about a program while it's running:
perf is powerful: it can instrument CPU performance counters, tracepoints, kprobes, and uprobes (dynamic tracing). It is capable of lightweight profiling. [...] Performance counters are CPU hardware registers that count hardware events such as instructions executed, cache-misses suffered, or branches mispredicted. They form a basis for profiling applications to trace dynamic control flow and identify hotspots.
So, that lets it profile an optimised program, rather than an instrumented one. We might reasonably presume this is more representative of how your program would react in the real world - and so can facilitate gathering more useful profiling data and applying more effective optimisations as a result.
An example of how to do the legwork of tying all this together and getting -fauto-profile to do something with your program is summarised here: Feedback directed optimization with GCC and Perf
(Maybe now that I learned all this, I'll try these options out some day!)
underscore_d gives an in-depth insight into the differences.
Here is my take on it.
Performing internal profiling by compiling initially with -fprofile-generate, which integrates the profiler into the binary for the performance data collection run. Execute the binary, for 10 minutes or whatever time you think covers enough activity for the profiler to record. Recompile again instead with -fprofile-use along with -fprofile-correction if it is a multi-threaded application. Internal profiler runs causes a significant performance hit (25% in my case) which does not reflect the real world non-profiler included binary behavior, so could result in less accurate profiling, but if all activity when running the profiler enabled binary scales with the performance penalty, I guess it should not matter.
Alternatively you can use the perf tool (more error prone and effort) which is specific to your kernel (may also need kernel built to support profiling, tracing etc) to create the profiling data. This could be considered, external profiling and has negligible impact on the application performance while being profiled. You run this on the binary that you compile normally. I cannot find any studies comparing the two.
perf record -e br_inst_retired:near_taken -b -o perf.data *your_program.unstripped -program -parameters*
then without stripping the binary, convert the profiling data into something GCC understands...
create_gcov --binary=your_program.unstripped --profile=perf.data --gcov=profile.afdo
Then recompile the application using -fauto-profile. Perf and AutoFDO/create_gcov version specific issues are known to exist. I referred to https://blog.wnohang.net/index.php/2015/04/29/feedback-directed-optimization-with-gcc-and-perf/ for detailed information on this alternative profiling method.
-fprofile-use and -fauto-profile both enable many optimization options by default, in my case the unwanted -funroll-loops which I knew had negative impact on performance in my application. If your the pedantic type, you can test option combinations by including the disabling counterpart in the compile flags, in my case -fno-unroll-loops.
Using internal profiling with my program after stripping the binary, it reduced the size by 25% (compared to original non-profiler stripped binary) however I only observed sub-percentile performance gains and the previous work output fluctuations that are reported by the program log (it's a crypto currency miner) were more erratic, instead of a gradual rising and lowering between peaks and troughs in hash rates like originally.
Overall, a stab in the dark.
I'm thinking about adding code to my application that would gather diagnostic information for later examination. Is there any C++ library created for such purpose? What I'm trying to do is similar to profiling, but it's not the same, because gathered data will be used more for debugging than profiling.
EDIT:
Platform: Linux
Diagnostic information to gather: information resulting from application logic, various asserts and statistics.
You might also want to check out libcwd:
Libcwd is a thread-safe, full-featured debugging support library for C++
developers. It includes ostream-based debug output with custom debug
channels and devices, powerful memory allocation debugging support, as well
as run-time support for printing source file:line number information
and demangled type names.
List of features
Tutorial
Quick Reference
Reference Manual
Also, another interesting logging library is pantheios:
Pantheios is an Open Source C/C++ Logging API library, offering an
optimal combination of 100% type-safety, efficiency, genericity
and extensibility. It is simple to use and extend, highly-portable (platform
and compiler-independent) and, best of all, it upholds the C tradition of you
only pay for what you use.
I tend to use logging for this purpose. Log4cxx works like a charm.
If debugging is what you're doing, perhaps use a debugger. GDB scripts are pretty easy to write up and use. Maintaining them in parallel to your code might be challenging.
Edit - Appending Annecdote:
The software I maintain includes a home-grown instrumentation system. Macros are used to queue log messages and configuration options control what classes of messages are logged and the level of detail to be logged. A thread processes the logging queue, flushing messages to file and rotating files as they become too large (which they commonly do). The system provides a lot of detail, but often all too often it provides huge files our support engineers must wade through for hours to find anything useful.
Now, I've only used GDB to diagnose bugs a few times, but for those issues it had a few nice advantages over the logging system. GDB scripting allowed me to gather new instrumentation data without adding new instrumentation lines and deploying a new build of my software to the client. GDB can generate messages from third-party libraries (needed to debug into openssl at one point). GDB adds no run-time impact to the software when not in use. GDB does a pretty good job of printing the contents of objects; the code-level logging system requires new macros to be written when new objects need to have their states logged.
One of the drawbacks was that the gdb scripts I generated had no explicit relationship to the source code; the source file and the gdb script were developed independently. Ideally, changes to the source file should impact and update the gdb script. One thought is to put specially-formatted comments in code and have a scripting language make a pass on the source files to generate the debugger script file for the source file. Finally, have the makefile execute this script during the build cycle.
It's a fun exercise to think about the potential of using GDB for this purpose, but I must admit that there are probably better code-level solutions out there.
If you execute your application in Linux, you can use "ulimit" to generate a core when your application crash (or assert(false), or kill -6 ), later, you can debug with gdb (gdb -c core_file binary_file) and analyze the stack.
Salu2.
PD. for profiling, use gprof