C++ Optimization on negative integers - c++

Lets say we have a negative integer say
int a;
is there a faster implementation of
-a?
Do I have to do some bitwise operation on this?

There's almost certainly nothing faster than the machine code NEG instruction that your compiler will most likely turn this into.
If there was, I'm sure the compiler would use it.
For a twos-complement number, you could NOT it and add 1 but that's almost certainly going to be slower. But I'm not entirely certain that the C/C++ standards mandate the use of twos-complement (they may, I haven't checked).
I think this question belongs with those that attempt to rewrite strcpy() et al to get more speed. Those people naively assume that the C library strcpy() isn't already heavily optimized by using special machine code instructions (rather than a simplistic loop that would be most people's first attempt).
Have you run performance tests which seem to indicate that your negations are taking an overly long time?
<subtle-humor-or-what-my-wife-calls-unfunny>
A NEG on a 486 (state of the art the last time I had to worry
about clock cycles) takes 3 clock cycles (memory version,
register only takes 1) - I'm assuming the later chips will be
similar. On a 3Ghz CPU, that means you can do 1 billion of
these every second. Is that not fast enough?
</subtle-humor-or-what-my-wife-calls-unfunny>

Have you ever heard the phrase "premature optimization"? If you've optimized all of your code, and this is the only thing left, fine. If not, you're wasting your time.

To clarify Pax's statement,
C++ compilers are not mandated to use two's complement, except in 1 case. When you convert a signed type to an unsigned type, if the number is negative, the result of the conversion must be the 2's complement representation of the integer.
In short, there is not a faster way than -a; even if there were, it would not be portable.
Keep in mind as well that premature optimization is evil. Profile your code first and then work on the bottlenecks.
See The C++ Programming Language, 3rd Ed., section C.6.2.1.

Negating a number is a very simple operation in terms of CPU hardware. I'm not aware of a processor that takes any longer to do negation than to do any bitwise operation - and that includes some 30 year old processors.
Just curious, what led you to ask this question? It certainly wasn't because you detected a bottleneck.

Perhaps you should think about optimizing your algorithms more-so than little things like this. If this is the last thing to optimize, your code is as fast as it's going to get.

All good answers.
If (-a) makes a difference, you've already done some really aggressive performance tuning.
Performance tuning a program is like getting water out of a wet sponge. As a program is first written, it is pretty wet. With a little effort, you can wring some time out of it. With more effort you can dry it out some more.
If you're really persistent you can get it down to where you have to put it in the hot sun to get the last few molecules of time out of it.
That's the level at which (-a) might make a difference.

Are you seeing a performance issue with negating numbers? I have a hard time thinking that most compilers would do a bitwise op against integers to negate them.

Related

Manual SIMD code affordability [closed]

Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 5 years ago.
Improve this question
We're running a project that is highly computationally intensive, and right now we're letting the compiler do SSE optimizations. However, we're not sure we are getting the best performance for the code.
My question is, I understand, broad, but I don't find many suggestions about this: is writing manual SIMD code affordable, or in other terms, worth the effort?
Affordability means, here, a rough estimate of cost benefit, as for instance speedup / development_time, or any other measure that is reasonable in the context of project development.
To lessen the scope:
we profiled the code and we know the computation is the most heavy part
we have a C++ code that can easily make use of Boost.SIMD and similar libraries
affordability should not take care of code readability, we assume we're confident with SSE/AVX
the code is currently multithreaded (OpenMP)
our compiler is Intel's icc
Quite agree with Paul R, and just wanted to add that IMO in most cases intrinsics/asm optimizations are not worth the effort. In most cases those optimizations are marketing driven, i.e. we juicing the performance on a specific platform just to get (in most cases) a bit better numbers.
Nowadays it is almost impossible to get an order of magnitude of performance just rewriting your C/C++ code in asm. In most cases it is a matter of memory/cache access and methods/algorithms (i.e. parallelization) as Paul has noted.
The first thing you should try is to analyze your code with hardware performance counters (with free "perf" tool or Intel VTune) and understand the real bottlenecks. For example, memory access during the computation is the most common bottleneck in fact, not computation itself. So manual vectorization of such a code does not help, since the CPU stalls on memory anyway.
Such analysis is always worth the effort, since you better understand your code and CPU architecture.
The next thing you should try is to optimize your code. There are a variety of methods: optimize data structures, cache-friendly memory access patterns, better algorithms etc. For example, an order you declare fields in a structure might have a significant performance impact in some cases, because you structure might have holes and occupy two lines of cache instead of one. Another example is false sharing, when you ping-pong same cache lines between CPUs and simple cache alignment might give you an order of magnitude better performance.
Those optimization are always worth the effort, since they will impact your low-level code as well.
Then you should try to help your compiler. For example, by default compiler vectorize/unroll an inner loop, but it might be better to vectorize/unroll an outer loop. You do this with #pragma hints and sometimes it worth the effort.
The last thing you should try is to rewrite already highly optimized C/C++ code using intrinsics/asm. There might be some reasons for that, such as better instructions interleave (so your CPU pipelines are always busy) or use of special CPU instructions (i.e. for encryption). The actual number of reasonable intrinsics/asm usages are negligible, and they are always platform-dependent.
So, with no further details about your code/algorithms it is hard to guess if it makes sense in your case, but I would bet for no. Better spend the effort on analysis and platform-independent optimizations. Better have a look on OpenCL or similar frameworks if you really need that computation power. At last, invest in better CPUs: the effect of such an investment is predictable and instantaneous.
You need to do a cost-benefit analysis, e.g. if you can invest say X months of effort at a cost of $Y getting your code to run N times faster, and that translates to either a reduction in hardware costs (e.g. fewer CPUs in an HPC context), or reduced run-time which in some way equates to a cost benefit, then it's a simple exercise in arithmetic. (Note however that there are some intangible long-term costs, e.g. SIMD-optimized code tends to be more complex, more error-prone, less portable, and harder to maintain.)
If the performance-critical part of your code (the hot 10%) is vectorizable then you may be able to get an order of magnitude speed-up (less for double precision float, more for narrower data types such as 16 bit fixed point).
Note that this kind of optimisation is not always just a simple matter of converting scalar code to SIMD code - you may have to think about your data structures and your cache/memory access pattern.

When should I use ASM calls?

I'm planning on writing a game with C++, and it will be extremely CPU-intensive (pathfinding,genetic algorithms, neural networks, ...)
So I've been thinking about how to tackle this situation best so that it would run smoothly.
(let this top section of this question be side information, I don't want it to restrict the main question, but it would be nice if you could give me side notes as well)
Is it worth it to learn how to work with ASM, so I can make ASM calls in C++,
can it give me a significant/notable performance advantage?
In what situations should I use it?
Almost never:
You only want to be using it once you've profiled your C++ code and have identified a particular section as a bottleneck.
And even then, you only want to do it once you've exhausted all C++ optimization options.
And even then, you only want to be using ASM for tight, inner loops.
And even then, it takes quite a lot of effort and skill to beat a C++ compiler on a modern platform.
If your not an experienced assembly programmer, I doubt you will be able to optimize assembly code better than your compiler.
Also note that assembly is not portable. If you decide to go this way, you will have to write different assembly for all the architectures you decide to support.
Short answer: it depends, most likely you won't need it.
Don't start optimizing prematurely. Write code that is also easy to read and to modify. Separate logical sections into modules. Write something that is easy to extend.
Do some profiling.
You can't tell where your bottlenecks are unless you profile your code. 99% of the time you won't get that much performance gain by writing asm. There's a high chance you might even worsen your performance. Optimizers nowadays are very good at what they do. If you do have a bottleneck, it will most probably be because of some poorly chosen algorithm or at least something that can be remedied at a high-level.
My suggestion is, even if you do learn asm, which is a good thing, don't do it just so you can optimize.
Profile profile profile....
A legitimate use case for going low-level (although sometimes a compiler can infer it for you) is to make use of SIMD instructions such as SSE. I would assume that at least some of the algorithms you mention will benefit from parallel processing.
However, you don't need to write actual assembly, instead you can simply use intrinsic functions. See, e.g. this.
Don't get ahead of yourself.
I've posted a sourceforge project showing how a simulation program was massively speeded up (over 700x).
This was not done by assuming in advance what needed to be made fast.
It was done by "profiling", which I put in quotes because the method I use is not to employ a profiler.
Rather I rely on random pausing, a method known and used to good effect by some programmers.
It proceeds through a series of iterations.
In each iteration a large source of time-consumption is identified and fixed, resulting in a certain speedup ratio.
As you proceed through multiple iterations, these speedup ratios multiply together (like compound interest).
That's how you get major speedup.
If, and only if, you get to a point where some code is taking a large fraction of time, and it doesn't contain any function calls, and you think you can write assembly code better than the compiler does, then go for it.
P.S. If you're wondering, the difference between using a profiler and random pausing is that profilers look for "bottlenecks", on the assumption that those are localized things. They look for routines or lines of code that are responsible for a large percent of overall time.
What they miss is problems that are diffuse.
For example, you could have 100 routines, each taking 1% of time.
That is, no bottlenecks.
However, there could be an activity being done within many or all of those routines, accounting for 1/3 of the time, that could be done better or not at all.
Random pausing will see that activity with a small number of samples, because you don't summarize, you examine the samples.
In other words, if you took 9 samples, on average you would notice the activity on 3 of them.
That tells you it's big.
So you can fix it and get your 3/2 speedup ratio.
"To understand recursion, you must first understand recursion." That quote comes to mind when I consider my response to your question, which is "until you understand when to use assembly, you should never use assembly." After you have completely implemented your soution, extensively profiled its performance and determined precise bottlenecks, and experimented with several alternative solutions, then you can begin to consider using assembly. If you code a single line of assembly before you have a working and extensively profiled program, you have made a mistake.
If you need to ask than you don't need it.

How to make large calculations program faster

I'm implementing a compression algorithm. Thing is, it is taking a second for a 20 Kib files, so that's not acceptable. I think it's slow because of the calculations.
I need suggestions on how to make it faster. I have some tips already, like shifting bits instead of multiplying, but I really want to be sure of which changes actually help because of the complexity of the program. I also accept suggestions concerning compiler options, I've heard there is a way to make the program do faster mathematical calculations.
Common operations are:
pow(...) function of math library
large number % 2
large number multiplying
Edit: the program has no floating point numbers
The question of how to make things faster should not be asked here to other people, but rather in your environment to a profiler. Use the profiler to determine where most of the time is spent, and that will hint you into which operations need to be improved, then if you don't know how to do it, ask about specific operations. It is almost impossible to say what you need to change without knowing what your original code is, and the question does not provide enough information: pow(...) function: what are the arguments to the function, is the exponent fixed? how much precision do you need? can you change the function for something that will yield a similar result? large number: how large is large in large number? what is number in this context? integers? floating point?
Your question is very broad, without enough informaiton to give you concrete advise, we have to do with a general roadmap.
What platform, what compiler? What is "large number"? What have you done already, what do you know about optimization?
Test a release build with optimization (/Ox /LTCG in Visual C++, -O3 IIRC for gcc)
Measure where time is spent - disk access, or your actual compression routine?
Is there a better algorithm, and code flow? The fastest operation is the one not executed.
for 20K files, memory working set should not be an issue (unless your copmpression requries large data structures), so so code optimization are the next step indeed
a modern compiler implements a lot of optimizations already, e.g replacing a division by a power-of-two constant with a bit shift.
pow is very slow for native integers
if your code is well written, you may try to post it, maybe someone's up to the challenge.
Hints :-
1) modulo 2 works only on the last bit.
2) power functions can be implemented in logn time, where n is the power. (Math library should be fast enough though). Also for fast power you may check this out
If nothing works, just check if there exists some fast algorithm.

What are the functions in the standard library that can be implemented faster with programming hacks? [closed]

Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 8 years ago.
Improve this question
I have recently read an article about fast sqrt calculation. Therefore, I have decided to ask SO community and its experts to help me find out, which STL algorithms or mathematical calculations can be implemented faster with programming hacks?
It would be great if you can give examples or links.
Thanks in advance.
System library developers have more concerns than just performance in mind:
Correctness and standards compliance: Critical!
General use: No optimisations are introduced, unless they benefit the majority of users.
Maintainability: Good hand-written assembly code can be faster, but you don't see much of it. Why?
Portability: Decent libraries should be portable to more than just Windows/x86/32bit.
Many optimisation hacks that you see around violate one or more of the requirements above.
In addition, optimisations that will be useless or even break when the next generation CPU comes around the corner are not a welcome thing.
If you don't have profiler evidence on it being really useful, don't bother optimising the system libraries. If you do, work on your own algorithms and code first, anyway...
EDIT:
I should also mention a couple of other all-encompassing concerns:
The cost/effort to profit/result ratio: Optimisations are an investment. Some of them are seemingly-impressive bubbles. Others are deeper and more effective in the long run. Their benefits must always be considered in relation to the cost of developing and maintaining them.
The marketing people: No matter what you think, you'll end up doing whatever they want - or think they want.
Probably all of them can be made faster for a specific problem domain.
Now the real question is, which ones should you hack to make faster? None, until the profiler tells you to.
Several of the algorithms in <algorithm> can be optimized for vector<bool>::[const_]iterator. These include:
find
count
fill
fill_n
copy
copy_backward
move // C++0x
move_backward // C++0x
swap_ranges
rotate
equal
I've probably missed some. But all of the above algorithms can be optimized to work on many bits at a time instead of just one bit at a time (as would a naive implementation).
This is an optimization that I suspect is sorely missing from most STL implementations. It is not missing from this one:
http://libcxx.llvm.org/
This is where you really need to listen to project managers and MBAs. What you're suggesting is re-implementing parts of the STL and or standard C library. There is an associated cost in terms of time to implement and maintenance burden of doing so, so you shouldn't do it unless you really, genuinely need to, as John points out. The rule is simple: is this calculation you're doing slowing you down (a.k.a. you are bound by the CPU)? If not, don't create your own implementation just for the sake of it.
Now, if you're really interested in fast maths, there are a few places you can start. The gnu multi-precision library implements many algorithms from modern computer arithmetic and semi numerical algorithms that are all about doing maths on arbitrary precision integers and floats insanely fast. The guys who write it optimise in assembly per build platform - it is about as fast as you can get in single core mode. This is the most general case I can think of for optimised maths i.e. that isn't specific to a certain domain.
Bringing my first paragraph and second in with what thkala has said, consider that GMP/MPIR have optimised assembly versions per cpu architecture and OS they support. Really. It's a big job, but it is what makes those libraries so fast on a specific small subset of problems that are programming.
Sometimes domain specific enhancements can be made. This is about understanding the problem in question. For example, when doing finite field arithmetic under rijndael's finite field you can, based on the knowledge that the characteristic polynomial is 2 with 8 terms, assume that your integers are of size uint8_t and that addition/subtraction are equivalent to xor operations. How does this work? Well basically if you add or subtract two elements of the polynomial, they contain either zero or one. If they're both zero or both one, the result is always zero. If they are different, the result is one. Term by term, that is equivalent to xor across a 8-bit binary string, where each bit represents a term in the polynomial. Multiplication is also relatively efficient. You can bet that rijndael was designed to take advantage of this kind of result.
That's a very specific result. It depends entirely on what you're doing to make things efficient. I can't imagine many STL functions are purely optimised for cpu speed, because amongst other things STL provides: collections via templates, which are about memory, file access which is about storage, exception handling etc. In short, being really fast is a narrow subset of what STL does and what it aims to achieve. Also, you should note that optimisation has different views. For example, if your app is heavy on IO, you are IO bound. Having a massively efficient square root calculation isn't really helpful since "slowness" really means waiting on the disk/OS/your file parsing routine.
In short, you as a developer of an STL library are trying to build an "all round" library for many different use cases.
But, since these things are always interesting, you might well be interested in bit twiddling hacks. I can't remember where I saw that, but I've definitely stolen that link from somebody else on here.
Almost none. The standard library is designed the way it is for a reason.
Taking sqrt, which you mention as an example, the standard library version is written to be as fast as possible, without sacrificing numerical accuracy or portability.
The article you mention is really beyond useless. There are some good articles floating around the 'net, describing more efficient ways to implement square roots. But this article isn't among them (it doesn't even measure whether the described algorithms are faster!) Carmack's trick is slower than std::sqrt on a modern CPU, as well as being less accurate.
It was used in a game something like 12 years ago, when CPUs had very different performance characteristics. It was faster then, but CPU's have changed, and today, it's both slower and less accurate than the CPU's built-in sqrt instruction.
You can implement a square root function which is faster than std::sqrt without losing accuracy, but then you lose portability, as it'll rely on CPU features not present on older CPU's.
Speed, accuracy, portability: choose any two. The standard library tries to balance all three, which means that the speed isn't as good as it could be if you were willing to sacrifice accuracy or portability, and accuracy is good, but not as good as it could be if you were willing to sacrifice speed, and so on.
In general, forget any notion of optimizing the standard library. The question you should be asking is whether you can write more specialized code.
The standard library has to cover every case. If you don't need that, you might be able to speed up the cases that you do need. But then it is no longer a suitable replacement for the standard library.
Now, there are no doubt parts of the standard library that could be optimized. the C++ IOStreams library in particular comes to mind. It is often naively, and very inefficiently, implemented. The C++ committee's technical report on C++ performance has an entire chapter dedicated to exploring how IOStreams could be implemented to be faster.
But that's I/O, where performance is often considered to be "unimportant".
For the rest of the standard library, you're unlikely to find much room for optimization.

Comparison of performance between Scala etc. and C/C++/Fortran?

I wonder if there is any reliable comparison of performance between "modern" multithreading-specialized languages like e.g. scala and "classic" "lower-level" languages like C, C++, Fortran using parallel libs like MPI, Posix or even Open-MP.
Any links and suggestions welcome.
Given that Java, and, therefore, Scala, can call external libraries, and given that those highly specialized external libraries will do most of the work, then the performance is the same as long as the same libraries are used.
Other than that, any such comparison is essentially meaningless. Scala code runs on a virtual machine which has run-time optimization. That optimization can push long-running programs towards greater performance than programs compiled with those other languages -- or not. It depends on the specific program written in each language.
Here's another non-answer: go to your local supercomputer centre and ask what fraction of the CPU load is used by each language you are interested in. This will only give you a proxy answer to your question, it will tell you what the people who are concerned with high performance on such machines use when tackling the kind of problem that they tackle. But it's as instructive as any other answer you are likely to get for such a broad question.
PS The answer will be that Fortran, C and C++ consume well in excess of 95% of the CPU cycles.
I'd view such comparisons as a fraction. The numerator is a constant (around 0.00001, I believe). The denominator is the number of threads multiplied by the number of logical processors.
IOW, for a single thread, the comparison has about a one chance in a million of meaning something. For a quad core processor running an application with (say) 16 threads, you're down to one chance in 64 million of a meaningful result.
In short, there are undoubtedly quite a few people working on it, but the chances of even a single result from any of them providing a result that's useful and meaningful is still extremely low. Worse, even if one of them really did mean something, it would be almost impossible to find, and even more difficult to verify to the point that you actually knew it meant something.