SSE gives no speedup for C++ number crunching - c++

I have a heavy number-crunching program that does image processing. It is mostly convolutions. It is written in C++ and compiled with Mingw GCC 4.8.1. I run it on a laptop with a Intel Core i7 4900MQ (with SSE up to SSE4.2 and AVX2).
When I tell GCC to use SSE optimisations (with -march=native -mfpmath=sse -msse2 ), I see no speedup compared to using the default x87 FPU.
When I use doubles instead of floats, there is no slowdown.
My understanding is that SSE should give me a 2x speedup when using floats instead of double. Am I mistaken?

My understanding is that SSE should give me a 2x speedup when using floats instead of double. Am I mistaken?
Yes, you are.
Compiler is as good as your code - remember that. If you didn't design your algorithm with vectorization in mind, compiler is powerless. It is not that easy: "turn the switch on and enjoy 100% performance boost".
First of all, compile your code with -ftree-vectorizer-verbose=N to see, what really was vectorized by the compiler.
N is the verbosity level, make that 5 to see all available output (more info can be found here).
Also, you may want to read about GCC's vectorizer.
And keep in mind, that for performance-critical sections of code, using SSE/AVX intrinsics (brilliantly documented here) directly may be the best option.

There is no code, no description on test procedures, but it generally can be explained this way:
It's not all about cpu bound, it's also bounded by memory speed.
Image processing usually have large working set and exceed the amount of cache of your non-xeon cpu. Eventually the cpu encounter starvation means the overall throughput can be bounded by memory speed.
You may be using an algorithm that is not friendly for vectorization.
Not every algorithm benefits from being vectorized. There are many conditions have to meet - flow dependency, memory layout, etc.

Related

Compiling the C++ code with the processor flag makes the code slower (intel compiler)

I am doing some profiling and performance is important for me (even 5%). The processor is Intel Xeon Platinum 8280 ("Cascade Lake") on Frontera. I compile my code with -Ofast flag, in Release mode. When I add -march=cascadelake, the timing gets worse (5-6%) in my test case. The same is true if use -xCORE-AVX512 instead of march. I am using icpc 19.1.1.217. Can anyone please explain why? Also, what compilation flags do you suggest for better performance?
Edit 1: I am solving a linear system, which consists of different operations, such as dot-product and matrix-vector product. So, it would hard for me to provide reproducible code, but I can say that there are multiple loops in my code that the compiler can apply auto-vectorization. I have used Intel Optimization reports on the critical loops in my code and the report mentioned potential speedups of at least 1.75 for them (for some of the loops it was over 5x potential speedup).
I have also used aligned_alloc(64, size) to allocate aligned memory with 64-alignment as this processor supports AVX512. Also, I round up the size to be a multiple of 64.
I have added OpenMP support to my code and have parallelized some loops, but for these experiments that I am reporting, I am using only 1 OpenMP thread.
I have tried -mavx2, and I got the same result as -xCORE-AVX512.
I have used -O3 instead of -Ofast. I did not get any speed-up.

Should I trust profiling inside or outside of callgrind for a function that calls glibc's sin()?

I'm working on an audio library in which the sine of a number needs to be calculated within a very tight loop. Various levels of inaccuracy in the results might be tolerable for the user depending on their goals and environment, so I'm providing the ability to pick between a few sine approximations with differing accuracy and speed characteristics. One of these shows as ~31% faster than glibc's sin() when running under callgrind, but ~2% slower when running outside of it if the library is compiled with -O3 and ~25% slower if compiled with -Ofast. Should I trust callgrind or the "native" results, in terms of designing the library's interface?
My gut instinct is to distrust callgrind and go with the wall-clock results, because that's what really matters in the end anyway. However, I'm worried that what I'm seeing is caused by something particular about my processor (i7-7700k), compiler (gcc 10.2.0) or other aspects of my environment (Arch Linux, kernel v5.9.13) that might not carry over for other users. Is there any chance that callgrind is showing me something "generally true", even if it's not quite true for me specifically?
The relative performance differences of the in-library sine implementations stay the same in and outside of callgrind; only the apparent performance of glibc's sin() differs. These patterns hold with variable amounts of work and across repeated runs. Interestingly, with -O1 the relative performance differences are comparable inside and outside of callgrind, but not with -O0, -O2, -O3, or -Ofast.
The input to glibc's sin() is in many ways a good case for it: it's a double that is always <= 2π, and is never subnormal, NaN, or infinite. This makes me wonder if the glibc sin() might be calling my CPU's fsin instruction some of the time, as Intel's documentation says it's reasonably accurate for arguments < ~3π/4 (see Intel 64 and IA-32 Architectures Developer's Manual: Vol. 1, pg. 8-22). If that is the case, it seems possible that the behavior of the Valgrind VM would have significantly different performance characteristics for that instruction, since in theory less attention might be paid to it during development than more frequently-used instructions. However, I've read the C source for the current Linux x86-64 implementation of sin() in glibc and I don't remember anything like that, nor do I see it in the callgrind disassembly (it seems to be doing its work "manually" using general-purpose AVX instructions). I've heard that glibc used to use fsin years ago, but my understanding is that they stopped because of its accuracy issues.
The only place I've found discussion of anything along the lines of what I'm seeing is an old thread on the GCC mailing list, but although it was interesting to look over I didn't notice anything there that clarified this (and I'd be wary about taking information from 2012 at face value anyway).
When you run a program under Callgrind or any other tool of the Valgrind family, it is disassembled on the fly. The intermediate representation is then instrumented, and translated back to the native instruction set.
The profiling figures that Callgrind and Cachegrind give you are figures for the simplified processors they are modeling. As they don't have a detailed model of a modern CPU's pipeline, their results will not accurately reflect differences of actual performance (they can capture effects on the order of "this function executes 3x more instructions than the other function", but not "this instruction sequence can be executed with higher instruction-level parallelism").
One of most important things of computing sin-like functions in a loop is allowing computations to be vectorized: on x86, SSE2 offers 2x vectorization factor for double, 4x for float. The compiler can achieve that more easily if you have inlinable branchless approximate functions, although a possibility exists with new enough Glibc and GCC too (but you need to pass a large subset of -ffast-math flags to GCC to achieve it).
If you haven't seen it already: Arm's optimized-routines repository has a number of modern vectorizable implementations of several functions, including sin/cos in both single and double precision.
P.S. sin should never returns a zero result for a tiny but non-zero argument. When x is close to zero, sin(x) and x differ by less than x*x*x, so as you approach zero, x becomes the closest representable number to sin x.

AVX equivalent for _mm_movelh_ps

since there is no AVX version of _mm_movelh_ps I usually used _mm256_shuffle_ps(a, b, 0x44) for AVX registers as a replacement. However, I remember reading in other questions, that swizzle instructions without a control integer (like _mm256_unpacklo_ps or _mm_movelh_ps) should be preferred if possible (for some reason I don't know). Yesterday, it occurred to me, that another alternative might be using the following:
_mm256_castpd_ps(_mm256_unpacklo_pd(_mm256_castps_pd(a), _mm256_castps_pd(b)));
Since the casts are supposed to be no-ops, is this better\equal\worse than using _mm256_shuffle_ps regarding performance?
Also, if it is truly the case, it would be nice if somebody could explain in simple words (I have very limited understanding of assembly and microarchitecture) why one should prefer instructions without a control integer.
Thanks in advance
Additional note:
Clang actually optimizes the shuffle to vunpcklpd: https://godbolt.org/z/9XFP8D
So it seems that my idea is not too bad. However, GCC and ICC create a shuffle instruction.
Avoiding an immediate saves 1 byte of machine-code size; that's all. It's at the bottom of the list for performance considerations, but all else equal shuffles like _mm256_unpacklo_pd with an implicit "control" are very slightly better than an immediate control byte for that reason.
(But taking the control operand in another vector like vpermilps can or vpermd requires is usually worse, unless you have some weird front-end bottleneck in a long-running loop, and can load the shuffle control outside the loop. Not very plausible and at this point you'd have to be writing by hand in asm to be caring that much about code size/alignment; in C++ that's still not something you can really control directly.)
Since the casts are supposed to be no-ops, is this better\equal\worse than using _mm256_shuffle_ps regarding performance?
Ice Lake has 2/clock vshufps vs. 1/clock vunpcklpd, according to testing by uops.info on real hardware, running on port 1 or port 5. Definitely use _mm256_shuffle_ps. The trivial extra code-size cost probably doesn't actually hurt at all on earlier CPUs, and is probably worth it for the future benefit on ICL, unless you're sure that port 5 won't be a bottleneck.
Ice Lake has a 2nd shuffle unit on port 1 that can handle some common XMM and in-lane YMM shuffles, including vpshufb and apparently some 2-input shuffles like vshufps. I have no idea why it doesn't just decode vunpcklpd as a vshufps with that control vector, or otherwise manage to run that shuffle on port 1. We know the shuffle HW itself can do the shuffle so I guess it's just a matter of control hardware to set up implicit shuffles, mapping an opcode to a shuffle control somehow.
Other than that, it's equal or better on older AVX CPUs; no CPUs have penalties for using PD shuffles between other PS instructions. The only different on any existing CPUs is code-size. Old CPUs like K8 and Core 2 had faster pd shuffles than ps, but no CPUs with AVX have shuffle units with that weakness. Also, AVX non-destructive instructions level differences between which operand has to be the destination.
As you can see from the Godbolt link, there are zero extra instructions before/after the shuffle. The "cast" intrinsics aren't doing conversion, just reinterpret to keep the C++ type system happy because Intel decided to have separate types for __m256 vs. __m256d (vs. __m256i), instead of having one generic YMM type. They chose not to have separate uint8x16 vs. uint32x4 vectors the way ARM did, though; for integer SIMD just __m256i.
So there's no need for compilers to emit extra instructions for casts, and in practice that's true; they don't introduce extra vmovaps/apd register copies or anything like that.
If you're using clang you can just write it conveniently and let clang's shuffle optimizer emit vunpcklpd for you. Or in other cases, do whatever it's going to do anyway; sometimes it makes worse choices than the source, often it does a good job.
Clang gets this wrong with -march=icelake-client, still using vunpcklpd even if you write _mm256_shuffle_ps. (Or depending on surrounding code, might optimize that shuffle into part of something else.)
Related bug report.

C++ techniques for reducing CPU instruction sizes?

Each CPU instruction consumes a number of bytes. The smaller the size, the most instructions which can be held in the CPU cache.
What techniques are available when writing C++ code which allow you to reduce CPU instruction sizes?
One example could be reducing the number of FAR jumps (literally, jumps to code across larger addresses). Because the offset is a smaller number, the type used is smaller and the overall instruction is smaller.
I thought GCC's __builtin_expect may reduce jump instruction sizes by putting unlikely instructions further away.
I think I have seen somewhere that its better to use an int32_t rather than int16_t due to being the native CPU integer size and therefore more efficient CPU instructions.
Or is something which can only be done whilst writing assembly?
Now that we've all fought over micro/macro optimization, let's try to help with the actual question.
I don't have a full, definitive answer, but you might be able to start here. GCC has some macro hooks for describing performance characteristics of the target hardware. You could theoretically set up a few key macros to help gcc favor "smaller" instructions while optimizing.
Based on very limited information from this question and its one reply, you might be able to get some gain from the TARGET_RTX_COSTS costs hook. I haven't yet done enough follow up research to verify this.
I would guess that hooking into the compiler like this will be more useful than any specific C++ idioms.
Please let us know if you manage any performance gain. I'm curious.
If a processor has various length (multi-byte) instructions, the best you can do is to write your code to help the compiler make use of the smaller instruction sizes.
Get The Code Working Robustly & Correct first.
Debugging optimized code is more difficult than debugging code that is not optimized. The symbols used by the debugger line up with the source code better. During optimization, the compiler can eliminate code, which gets your code out-of-sync with the source listing.
Know Your Assembly Instructions
Not all processors have variable length instructions. Become familiar with your processors instruction set. Find out which instructions are small (one byte) versus multi-byte.
Write Code to Use Small Assembly Instructions
Help out your compiler and write your code to take advantage of the small length instructions.
Print out the assembly language code to verify that the compiler uses the small instructions.
Change your code if necessary to help out the compiler.
There is no guarantee that the compiler will use small instructions. The compiler emits instructions that it thinks will have the best performance according to the optimization settings.
Write Your Own Assembly Language Function
After generating the assembly language source code, you are now better equipped to replace the high level language with an assembly language version. You have the freedom to use small instructions.
Beware the Jabberwocky
Smaller instructions may not be the best solution in all cases. For example, the Intel Processors have block instructions (perform operations on blocks of data). These block instructions perform better than loops of small instructions. However, the block instructions take up more bytes than the smaller instructions.
The processor will fetch as many bytes as necessary, depending on the instruction, into its instruction cache. If you can write loops or code that fits into the cache, the instruction sizes become less of a concern.
Also, many processors will use large instructions to communicate with other processors, such as a floating point processor. Reduction of floating point math in your program may reduce the quanitity of these instructions.
Trim the Code Tree & Reduce the Branches
In general, branching slows down processing. Branches are the change of execution to a new location, such as loops and function calls. Processors love to data instructions, because they don't have to reload the instruction pipeline. Increasing the amount of data instructions and reducing the quantity of branches will improve performance, usually without regards to the instruction sizes.

How to measure read/cycle or instructions/cycle?

I want to thoroughly measure and tune my C/C++ code to perform better with caches on a x86_64 system. I know how to measure time with a counter (QueryPerformanceCounter on my Windows machine) but I'm wondering how would one measure the instructions per cycle or reads/write per cycle with respect to the working set.
How should I proceed to measure these values?
Modern processors (i.e., those not very constrained that are less than some 20 years old) are superscalar, i.e., they execute more than one instruction at a time (given correct instruction ordering). Latest x86 processors translate the CISC instructions into internal RISC instructions, reorder them and execute the result, have even several regster banks so instructions using "the same registers" can be done in parallel. There isn't any reasonable way to define the "time the instruction execution takes" today.
The current CPUs are much faster than memory (a few hundred instructions is the typical cost of accessing memory), they are all heavily dependent on cache for performance. And then you have all kinds of funny effects of cores sharing (or not) parts of cache, ...
Tuning code for maximal performance starts with the software architecture, goes on to program organization, algorithm and data structure selection (here a modicum of cache/virtual memory awareness is useful too), careful programming and (as te most extreme measures to squeeze out the last 2% of performance) considerations like the ones you mention (and the other favorite, "rewrite in assembly"). And the ordering is that one because the first levels give more performance for the same cost. Measure before digging in, programmers are notoriously unreliable in finding bottlenecks. And consider the cost of reorganizing code for performance, both in the work itself, in convincing yourself this complex code is correct, and maintenance. Given the relative costs of computers and people, extreme performance tuning rarely makes any sense (perhaps for heavily travelled code paths in popular operating systems, in common code paths generated by a compiler, but almost nowhere else).
If you are really interested in where your code is hitting cache and where it is hitting memory, and the processor is less than about 10-15 years old in its design, then there are performance counters in the processor. You need driver level software to access these registers, so you probably don't want to write your own tools for this. Fortunately, you don't have to.
There is tools like VTune from Intel, CodeAnalyst from AMD and oprofile for Linux (works with both AMD and Intel processors).
There are a whole range of different registers that count the number of instructions actually completed, the number of cycles the processor is waiting for . You can also get a count of things like "number of memory reads", "number of cache misses", "number of TLB misses", "number of FPU instructions".
The next, more tricky part, is of course to try to fix any of these sort of issues, and as mentioned in another answer, programmers aren't always good at tweaking these sort of things - and it's certainly time consuming, not to mention that what works well on processor model X will not necessarily run fast on model Y (there were some tuning tricks for early Pentium 4 that works VERY badly on AMD processors - if on the other hand, you tune that code for AMD processors of that age, you get code that runs well on the same generation Intel processor too!)
You might be interested in the rdtsc x86 instruction, which reads a relative number of cycles.
See http://www.fftw.org/cycle.h for an implementation to read the counter in many compilers.
However, I'd suggest simply measuring using QueryPerformanceCounter. It is rare that the actual number of cycles is important, to tune code you typically only need to be able to compare relative time measurements, and rdtsc has many pitfalls (though probably not applicable to the situation you described):
On multiprocessor systems, there is not a single coherent cycle counter value.
Modern processors often adjust the frequency, changing the rate of change in time with respect to the rate of change in cycles.