How can I benchmark the performance of C++ code? [closed] - c++

Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
We don’t allow questions seeking recommendations for books, tools, software libraries, and more. You can edit the question so it can be answered with facts and citations.
Closed 4 years ago.
Improve this question
I am starting to study algorithms and data structures seriously, and interested in learning how to compare the performance of the different ways I can implement A&DTs.
For simple tests, I can get the time before/after something runs, run that thing 10^5 times, and average the running times. I can parametrize input by size, or sample random input, and get a list of running times vs. input size. I can output that as a csv file, and feed it into pandas.
I am not sure there are no caveats. I am also not sure what to do about measuring space complexity.
I am learning to program in C++. Are there humane tools to achieve what I am trying to do?

Benchmarking code is not easy. What I found most useful was Google benchmark library. Even if you are not planning to use it, it might be good to read some of examples. It has a lot of possibilities to parametrize test, output results to file and even returning you Big O notation complexity of your algorithm (to name just few of them). If you are any familiar with Google test framework I would recommend you to use it. It also keeps compiler optimization possible to manage so you can be sure that your code wasn't optimized away.
There is also great talk about benchmarking code on CppCon 2015: Chandler Carruth "Tuning C++: Benchmarks, and CPUs, and Compilers! Oh My!". There are many insights in possible mistake that you can make (it also uses google benchmark)

It is operating system and compiler specific (so implementation specific). You could use profiling tools, you could use timing tools, etc.
On Linux, see time(1), time(7), perf(1), gprof(1), pmap(1), mallinfo(3) and proc(5) and about Invoking GCC.
See also this. In practice, be sure that your runs are lasting long enough (e.g. at least one second of time in a process).
Be aware that optimizing compilers can transform drastically your program. See CppCon 2017: Matt Godbolt talk “What Has My Compiler Done for Me Lately? Unbolting the Compiler's Lid”

Talking from an architecture point of view, you can also benchmark your C++ code using different architectural tools such as Intel Pin, perf tool. You can use these tools to study the architecture dependency of your code. For example, you can compile your code for different level of optimizations and check the IPC/CPI, cache accesses and load-store accesses. You can even check if your code is suffering a performance hit due to library functions. The tools are powerful and can give you potentially huge insights into your code.
You can also try disassembling your code and study where your code spends most of the time and try and optimize that. You can look at different techniques to ensure that the frequently accessed data remains in the cache and thus ensure a high hit rate.
Say, you realize that your code is heavily dominated by loops, you can run your code for different loop bounds and check for the metrics in 2 cases. For example, set the loop bound for 100,000 and find the desired performance metric 'X' and then set the loop bound for 200,000 and find the performance metric 'Y'. Now,calculate Y-X. This will give you a much better insight into the behavior of the loops because by subtracting the two metrics, you have effectively removed the static effects of the code.
Say, you run your code for 10 times and with different user input size. You can maybe find the runtime per user input size and then sort this new metric in ascending order, remove the first and the last value(to remove the outliers) and then take the average. Finally, find the Coefficient of variance to understand how the run times behave.
On a side note, more often than not, we end up using the term 'average' or 'arithmetic mean' rashly. Look at the metric you plan to average and look at harmonic means, arithmetic means and geometric means in each of the cases. For example,finding the arithmetic mean for rates will give you incorrect answers. Simply finding arithmetic means of two events which do not occur equally in time can give incorrect results. Instead, use weighted arithmetic means.

Related

Manual SIMD code affordability [closed]

Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 5 years ago.
Improve this question
We're running a project that is highly computationally intensive, and right now we're letting the compiler do SSE optimizations. However, we're not sure we are getting the best performance for the code.
My question is, I understand, broad, but I don't find many suggestions about this: is writing manual SIMD code affordable, or in other terms, worth the effort?
Affordability means, here, a rough estimate of cost benefit, as for instance speedup / development_time, or any other measure that is reasonable in the context of project development.
To lessen the scope:
we profiled the code and we know the computation is the most heavy part
we have a C++ code that can easily make use of Boost.SIMD and similar libraries
affordability should not take care of code readability, we assume we're confident with SSE/AVX
the code is currently multithreaded (OpenMP)
our compiler is Intel's icc
Quite agree with Paul R, and just wanted to add that IMO in most cases intrinsics/asm optimizations are not worth the effort. In most cases those optimizations are marketing driven, i.e. we juicing the performance on a specific platform just to get (in most cases) a bit better numbers.
Nowadays it is almost impossible to get an order of magnitude of performance just rewriting your C/C++ code in asm. In most cases it is a matter of memory/cache access and methods/algorithms (i.e. parallelization) as Paul has noted.
The first thing you should try is to analyze your code with hardware performance counters (with free "perf" tool or Intel VTune) and understand the real bottlenecks. For example, memory access during the computation is the most common bottleneck in fact, not computation itself. So manual vectorization of such a code does not help, since the CPU stalls on memory anyway.
Such analysis is always worth the effort, since you better understand your code and CPU architecture.
The next thing you should try is to optimize your code. There are a variety of methods: optimize data structures, cache-friendly memory access patterns, better algorithms etc. For example, an order you declare fields in a structure might have a significant performance impact in some cases, because you structure might have holes and occupy two lines of cache instead of one. Another example is false sharing, when you ping-pong same cache lines between CPUs and simple cache alignment might give you an order of magnitude better performance.
Those optimization are always worth the effort, since they will impact your low-level code as well.
Then you should try to help your compiler. For example, by default compiler vectorize/unroll an inner loop, but it might be better to vectorize/unroll an outer loop. You do this with #pragma hints and sometimes it worth the effort.
The last thing you should try is to rewrite already highly optimized C/C++ code using intrinsics/asm. There might be some reasons for that, such as better instructions interleave (so your CPU pipelines are always busy) or use of special CPU instructions (i.e. for encryption). The actual number of reasonable intrinsics/asm usages are negligible, and they are always platform-dependent.
So, with no further details about your code/algorithms it is hard to guess if it makes sense in your case, but I would bet for no. Better spend the effort on analysis and platform-independent optimizations. Better have a look on OpenCL or similar frameworks if you really need that computation power. At last, invest in better CPUs: the effect of such an investment is predictable and instantaneous.
You need to do a cost-benefit analysis, e.g. if you can invest say X months of effort at a cost of $Y getting your code to run N times faster, and that translates to either a reduction in hardware costs (e.g. fewer CPUs in an HPC context), or reduced run-time which in some way equates to a cost benefit, then it's a simple exercise in arithmetic. (Note however that there are some intangible long-term costs, e.g. SIMD-optimized code tends to be more complex, more error-prone, less portable, and harder to maintain.)
If the performance-critical part of your code (the hot 10%) is vectorizable then you may be able to get an order of magnitude speed-up (less for double precision float, more for narrower data types such as 16 bit fixed point).
Note that this kind of optimisation is not always just a simple matter of converting scalar code to SIMD code - you may have to think about your data structures and your cache/memory access pattern.

find number of floating point register programmaticaly in c++ [closed]

Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
Questions asking for code must demonstrate a minimal understanding of the problem being solved. Include attempted solutions, why they didn't work, and the expected results. See also: Stack Overflow question checklist
Closed 9 years ago.
Improve this question
I am working on parallel algorithm optimization (sparse matrix) and working on register blocking. I want to find number and type of registers (specifically floating point registers and then others) available in machine In order to tune my code based on available registers and make it platform independent. Is there any way to do this in c++?
thank you.
mjr
In general, compilers do know this sort of stuff (and how to best use it), so I'm slightly surprised that you think that you can outsmart the compiler - unless I have very high domain knowledge, and start writing assembler code, I very rarely outsmart the compiler.
Since writing assembler code is highly unportable, I don't think that counts as a solution for optimising the code using knowledge as to how many registers, etc. It is very difficult to know how the compiler uses registers. If you have int x = y + z; as a simple example, how many registers does it take? Depends on the compiler - it could use none, one, two, three, four, five or six, without being below optimal register usage - it all depends on how the compiler decides to deal with things, machine architecture, where/how variables are being stored, etc. The same principle applies to number of floating point registers if we change int to double. There is no obvious way to tell how many registers are being used in this statement (although I suspect no more than three - however, it could be zero or one, depending on what the compiler decides to do).
It's probably possible to do some clever tricks if you know the processor architecture and how the compiler deals with certain types of code - but that also assumes that the compiler doesn't change its behaviour in the next release. But if you know what processor architecture it is, then you also know the number of registers of various kinds...
I am afraid there is no easy portable solution.
There are many factors that could influence the optimal block size for a given computer. One way to discover a good configuration is by automatically running a series of benchmarks, and using the results to tune your code at runtime.
Another approach is to automatically tweak the source code based on the results of some benchmarks. This is what Automatically Tuned Linear Algebra Software (ATLAS) does.

Fortran vs C++, does Fortran still hold any advantage in numerical analysis these days? [closed]

Closed. This question is opinion-based. It is not currently accepting answers.
Want to improve this question? Update the question so it can be answered with facts and citations by editing this post.
Closed 5 years ago.
Improve this question
With the rapid development of C++ compilers,especially the intel ones, and the abilities of directly applying SIMD functions in your C/C++ code, does Fortran still hold any real advantage in the world of numerical computations?
I am from an applied maths background, my job involves a lot of numerical analysis, computations, optimisations and such, with a strictly defined performance-requirement.
I hardly know anything about Fortran, I have some experience in C/CUDA/matlab(if you consider the latter as a computer language to begin with), and my daily task involves analysis of very large data (e.g. 10GB-large matrix), and it seems the program at least spend 2/3 of its time on memory-accessing (thats why I send some of its job to GPU), do you people think it may worth the effects for me to trying the fortran routine on at least some performance-critical part of my code to improve the performance of my program?
Because the complexity and things need to be done involved there, I will only go that routine if only there is significant performance benefit there, thanks in advance.
Fortran has strict aliasing semantics compared to C++ and has been aggressively tuned for numerical performance for decades. Algorithms that uses the CPU to work with arrays of data often have the potential to benefit from a Fortran implementation.
The programming languages shootout should not be taken too seriously, but of the 15 benchmarks, Fortran ranks #1 for speed on four of them (for Intel Q6600 one core), more than any other single language. You can see that the benchmarks where Fortran shines are the heavily numerical ones:
spectral norm 27% faster
fasta 67% faster
mandelbrot 56% faster
pidigits 18% faster
Counterexample:
k-nucleotide 500% slower (this benchmark focuses heavily on more sophisticated data structures and string processing, which is not Fortran's strength)
You can also see a summary page "how many times slower" that shows that out of all implementations, the Fortran code is on average closest to the fastest implementation for each benchmark -- although the quantile bars are much larger than for C++, indicating Fortran is unsuited for some tasks that C++ is good at, but you should know that already.
So the questions you will need to ask yourself are:
Is the speed of this function so critical that reimplementing it in Fortran is worth my time?
Is performance so important that my investment in learning Fortran will pay off?
Is it possible to use a library like ATLAS instead of writing the code myself?
Answering these questions would require detailed knowledge of your code base and business model, so I can't answer those. But yes, Fortran implementations are often faster than C++ implementations.
Another factor in your decision is the amount of sample code and the quantity of reference implementations available. Fortran's strong history means that there is a wealth of numerical code available for download and even with a trip to the library. As always you will need to sift through it to find the good stuff.
The complete and correct answer to your question is, "yes, Fortran does hold some advantages".
C++ also holds some, different, advantages. So do Python, R, etc etc. They're different languages. It's easier and faster to do some things in one language, and some in others. All are widely used in their communities, and for very good reasons.
Anything else, in the absence of more specific questions, is just noise and language-war-bait, which is why I've voted to close the question and hope others will too.
Fortran is just naturally suited for numerical programming. You tend to have a large amount of numbers in such programs, typically arranged arrays. Arrays are first class citizens in Fortran and it is often pretty straight forward to translate numerical kernels from Matlab into Fortran.
Regarding potential performance advantages see the other answers, that cover this quite nicely. The baseline is probably you can create highly efficient numerical applications with most compiled languages today, but you might jump through some loops to get there. Fortran was carefully designed to allow the compiler to recognize most spots for optimizations, due to the language features. Of course you can also write arbitrary slow code with any compiled language, including Fortran.
In any case you should pick the tools as suited. Fortran suits numerical applications, C suits system related development. On a final remark, learning Fortran basics is not hard, and it is always worthwhile to have a look into other languages. This opens a different view on problems you want to solve.
Also worth mentioning is that Fortran is a lot easier to master than C++. In fact, Fortran has a shorter language spec than plain C and it's syntax is arguably simpler. You can pick it up very quickly.
Meaning that if you are only interested in learning C++ or Fortran to solve a single specific problem you have at the moment (say, to speed up the bottlenecks in something you wrote in a prototyping language), Fortran might give you a better return on investment.
Fortran code is better for matrix and vector type operation in general. But you also can produce similar performance with c/c++ code by passing hints/suggestions to the compiler to produce similar quality vector instructions. One option that gave me good boost was not to assume memory aliasing among input variables that are array objects. This way, the compiler can aggressively do inner loop unrolling and pipelining for ILP where it can overlap loads and store operation across loop iteration with right prefetches.

Genetic programming in c++, library suggestions? [closed]

Closed. This question is opinion-based. It is not currently accepting answers.
Want to improve this question? Update the question so it can be answered with facts and citations by editing this post.
Closed 3 years ago.
Improve this question
I'm looking to add some genetic algorithms to an Operations research project I have been involved in. Currently we have a program that aids in optimizing some scheduling and we want to add in some heuristics in the form of genetic algorithms. Are there any good libraries for generic genetic programming/algorithms in c++? Or would you recommend I just code my own?
I should add that while I am not new to c++ I am fairly new to doing this sort of mathematical optimization work in c++ as the group I worked with previously had tended to use a proprietary optimization package.
We have a fitness function that is fairly computationally intensive to evaluate and we have a cluster to run this on so parallelized code is highly desirable.
So is c++ a good language for this? If not please recommend some other ones as I am willing to learn another language if it makes life easier.
thanks!
I would recommend rolling your own. 90% of the work in a GP is coding the genotype, how it gets operated on, and the fitness calculation. These are parts that change for every different problem/project. The actual evolutionary algorithm part is usually quite simple.
There are several GP libraries out there ( http://en.wikipedia.org/wiki/Symbolic_Regression#Implementations ). I would use these as examples and references though.
C++ is a good choice for GP because they tend to be very computationally intensive. Usually, the fitness function is the bottleneck, so it's worthwhile to at least make this part compiled/optimized.
I use GAUL
it's a C library with all you want.
( pthread/fork/openmp/mpi )
( various crossover / mutation function )
( non GA optimisation: Hill-Climbing, N-M Simplex, Simulated annealling, Tabu, ... )
Why build your own library when there is such powerful tools ???
I haven't used this personally yet, but the Age Layered Population Structure (ALPS) method has been used to generate human competitive results and has been shown to outperform several popular methods in finding optimal solutions in rough fitness landscapes. Additionally, the link contains source code in C++ FTW.
I have had similar problems. I used to have a complicated problem and defining a solution in terms of a fixed length vector was not desirable. Even a variable length vector does not look attractive. Most of the libraries focus on cases where the cost function is cheap to calculate which did not match my problem. Lack of parallelism is their another pitfall. Expecting the user to allocate memory for being used by the library is adding insult into injury. My cases were even more complicated because most of the libraries check the nonlinear conditions before evaluation. While, I needed to check the nonlinear condition during or after the evaluation based on the result of the evaluation. It is also undesirable when I needed to evaluate the solution to calculate its cost and then I had to recalculate the solution to present it. In most of the cases, I had to write the cost function two times. Once for GA and once for presentation.
Having all of these problems, I eventually, designed my own openGA library which is now mature.
This library is based on C++ and distributed with free Mozilla Public License 2.0. It guarantees that using this library does not limit your project and it can be used for commercial or none commercial purposes for free without asking for any permission. Not all libraries are transparent in this sense.
It supports three modes of single objective, multiple objective (NSGA-III) and Interactive Genetic Algorithm (IGA).
The solution is not mandated to be a vector. It can be any structure with any customized design containing any optional values with variable length. This feature makes this library suitable for Genetic Programming (GP) applications.
C++11 is used. Template feature allows flexibility of the solution structure design.
The standard library is enough to use this library. There is no dependency beyond that. The entire library is also a single header file for ease of use.
The library supports parallelism by default unless you turn it off. If you have an N-core CPU, the number of threads are set to N by default. You can change the settings. You can also set if the solution evaluations are distributed between threads equally or they are assigned to any thread which has finished its job and is currently idle.
The solution evaluation is separated from calculation of the final cost. It means that your evaluation function can simulate the system and keep a lot of information. Your cost function is called later and reports the cost based on the evaluation. While your evaluation results are kept to be used later by the user. You do not need to re-calculate it again.
You can reject a solution at any time during the evaluation. No waste of time. In fact, the evaluation and constraint check are integrated.
The GA assist feature help you to produce the C++ code base from the information you provide.
If these features match what you need, I recommend having a look at the user manual and the examples of openGA.
The number of the readers and citation of the related publication as well as its github favorite marks is increasing and its usage is keep growing.
I suggest you have a look into the matlab optimization toolkit - it comes with GAs out of the box, you only haver to code the fitness function (and a function to generate inital population eventually) and I believe matlab has some C++ interoperability so you could code you functions in C++. I am using it for my experiments and a very nice feature is that you get all sorts of charts out of the box as well.
Said so - if your aim is to learn about genetic algorithms you're better off coding it, but if you just want to run experiments matlab and C++ (or even just matlab) is a good option.

Recommended Open Source Profilers [closed]

As it currently stands, this question is not a good fit for our Q&A format. We expect answers to be supported by facts, references, or expertise, but this question will likely solicit debate, arguments, polling, or extended discussion. If you feel that this question can be improved and possibly reopened, visit the help center for guidance.
Closed 10 years ago.
I'm trying to find open source profilers rather than using one of the commercial profilers which I have to pay $$$ for. When I performed a search on SourceForge, I have come across these four C++ profilers that I thought were quite promising:
Shiny: C++ Profiler
Low Fat Profiler
Luke Stackwalker
FreeProfiler
I'm not sure which one of the profilers would be the best one to use in terms of learning about the performance of my program. It would be great to hear some suggestions.
You could try Windows Performance Toolkit. Completely free to use. This blog entry has an example of how to do sample-based profiling.
Valgrind (And related tools like cachegrind, etc.)
Google performance tools
There's more than one way to do it.
Don't forget the no-profiler method.
Most profilers assume you need 1) high statistical precision of timing (lots of samples), and 2) low precision of problem identification (functions & call-graphs).
Those priorities can be reversed. I.e. the problem can be located to the precise machine address, while cost precision is a function of the number of samples.
Most real problems cost at least 10%, where high precision is not essential.
Example: If something is making your program take 2 times as long as it should, that means there is some code in it that costs 50%. If you take 10 samples of the call stack while it is being slow, the precise line(s) of code will be present on roughly 5 of them. The larger the program is, the more likely the problem is a function call somewhere mid-stack.
It's counter-intuiitive, I know.
NOTE: xPerf is nearly there, but not quite (as far as I can tell). It takes samples of the call stack and saves them - that's good. Here's what I think it needs:
It should only take samples when you want them. As it is, you have to filter out the irrelevant ones.
In the stack view it should show specific lines or addresses at which calls take place, not just whole functions. (Maybe it can do this, I couldn't tell from the blog.)
If you click to get the butterfly view, centered on a single call instruction, or leaf instruction, it should show you not the CPU fraction, but the fraction of stack samples containing that instruction. That would be a direct measure of the cost of that instruction, as a fraction of time. (Maybe it can do this, I couldn't tell.)
So, for example, even if an instruction were a call to file-open or something else that idles the thread, it still costs wall clock time, and you need to know that.
NOTE: I just looked over Luke Stackwalker, and the same remarks apply. I think it is on the right track but needs UI work.
ADDED: Having looked over LukeStackwalker more carefully, I'm afraid it falls victim to the assumption that measuring functions is more important than locating statements. So on each sample of the call stack, it updates the function-level timing info, but all it does with the line-number info is keep track of min and max line numbers in each function, which, the more samples it takes, the farther apart those get. So it basically throws away the most important information - the line number information. The reason that is important is that if you decide to optimize a function, you need to know which lines in it need work, and those lines were on the stack samples (before they were discarded).
One might object that if the line number information were retained it would run out of storage quickly. Two answers. 1) There are only so many lines that show up on the samples, and they show up repeatedly. 2) Not so many samples are needed - the assumption that high statistical precision of measurement is necessary has always been assumed, but never justified.
I suspect other stack samplers, like xPerf, have similar issues.
It's not open source, but AMD CodeAnalyst is free. It also works on Intel CPUs despite the name. There are versions available for both Windows (with Visual Studio integration) and Linux.
From those who have listed, I have found Luke Stackwalker to work best - I liked its GUI, it was easy to get running.
Other similar is Very Sleepy - similar functionality, sampling seems more reliable, GUI perhaps a little bit harder to use (not that graphical).
After spending some more time with them, I have found one quite important drawback. While both try to sample at 1 ms resolution, in practice they do not achieve it because their sampling method (StackWalk64 of the attached process) is way too slow. For my application it takes something like 5-20 ms to get a callstack. Not only this makes your results imprecise, it also makes them skewed, as short callstacks are walked faster, therefore tend to get more hits.
We use LtProf and have been happy with it. Not open source, but only $$, not $$$ :-)