C++ Centralizing SIMD usage - c++

i have a library and a lot of projects depending on that library. I want to optimize certain procedures inside the library using SIMD extensions. However it is important for me to stay portable, so to the user it should be quite abstract.
I say at the beginning that i dont want to use some other great library that does the trick. I actually want to understand if that what i want is possible and to what extent.
My very first idea was to have a "vector" wrapper class, that the usage of SIMD is transparent to the user and a "scalar" vector class could be used in case no SIMD extension is available on the target machine.
The naive thought came to my mind to use the preprocessor to select one vector class out of many depending on which target the library is compiled. So one scalar vector class, one with SSE (something like this basically: http://fastcpp.blogspot.de/2011/12/simple-vector3-class-with-sse-support.html) and so on... all with the same interface.
This gives me good performance but this would mean that i would have to compile the library for any kind of SIMD ISA that i use. I rather would like to evaluate the processor capabilities dynamically at runtime and select the "best" implementation available.
So my second guess was to have a general "vector" class with abstract methods. The "processor evaluator" function would than return instances of the optimal implementation. Obviously this would lead to ugly code, but the pointer to the vector object could be stored in a smart pointer-like container that just delegates the calls to the vector object. Actually I would prefer this method because of its abstraction but I'm not sure if calling the virtual methods actually will kill the performance that i gain using SIMD extensions.
The last option that i figured out would be to do optimizations whole routines and select at runtime the optimal one. I dont like this idea so much because this forces me to implement whole functions multiple times. I would prefer to do this once, using my idea of the vector class i would like to do something like this for example:
void Memcopy(void *dst, void *src, size_t size)
{
vector v;
for(int i = 0; i < size; i += v.size())
{
v.load(src);
v.store(dst);
dst += v.size();
src += v.size();
}
}
I assume here that "size" is a correct value so that no overlapping happens. This example should just show what i would prefer to have. The size-method of the vector object would for example just return 4 in case SSE is used and 1 in case the scalar version is used.
Is there a proper way to implement this using only runtime information without loosing too much performance? Abstraction is to me more important than performance but as this is a performance optimization i wouldn't include it if would not speedup my application.
I also found this on the web: http://compeng.uni-frankfurt.de/?vc
Its open source but i dont understand how the correct vector class is chosen.

Your idea will only compile to efficient code if everything inlines at compile time, which is incompatible with runtime CPU dispatching. For v.load(), v.store(), and v.size() to actually be different at runtime depending on the CPU, they'd have to be actual function calls, not single instructions. The overhead would be killer.
If your library has functions that are big enough to work without being inlined, then function pointers are great for dispatching based on runtime CPU detection. (e.g. make multiple versions of memcpy, and pay the overhead of runtime detection once per call, not twice per loop iteration.)
This shouldn't be visible in your library's external API/ABI, unless your functions are mostly so short that the overhead of an extra (direct) call/ret matters. In the implementation of your library functions, put each sub-task that you want to make a CPU-specific version of into a helper function. Call those helper functions through function pointers.
Start with your function pointers initialized to versions that will work on your baseline target. e.g. SSE2 for x86-64, scalar or SSE2 for legacy 32bit x86 (depending on whether you care about Athlon XP and Pentium III), and probably scalar for non-x86 architectures. In a constructor or library init function, do a CPUID and update the function pointers to the best version for the host CPU. Even if your absolute baseline is scalar, you could make your "good performance" baseline something like SSSE3, and not spend much/any time on SSE2-only routines. Even if you're mostly targetting SSSE3, some of your routines will probably end up only requiring SSE2, so you might as well mark them as such and let the dispatcher use them on CPUs that only do SSE2.
Updating the function pointers shouldn't even require any locking. Any calls that happen from other threads before your constructor is done setting function pointers may get the baseline version, but that's fine. Storing a pointer to an aligned address is atomic on x86. If it's not atomic on any platform where you have a version of a routine that needs runtime CPU detection, use C++ std:atomic (with memory-order relaxed stores and loads, not the default sequential consistency which would trigger a full memory barrier on every load). It matters a lot that there's minimal overhead when calling through the function pointers, and it doesn't matter what order different threads see the changes to the function pointers. They're write-once.
x264 (the heavily-optimized open source h.264 video encoder) uses this technique extensively, with arrays of function pointers. See x264_mc_init_mmx(), for example. (That function handles all CPU dispatching for Motion Compensation functions, from MMX to AVX2). I assume libx264 does the CPU dispatching in the "encoder init" function. If you don't have a function that users of your library are required to call, then you should look into some kind of mechanism for running global constructor / init functions when programs using your library start up.
If you want this to work with very C++ey code (C++ish? Is that a word?) i.e. templated classes & functions, the program using the library will probably have do the CPU dispatching, and arrange to get baseline and multiple CPU-requirement versions of functions compiled.

I do exactly this with a fractal project. It works with vector sizes of 1, 2, 4, 8, and 16 for float and 1, 2, 4, 8 for double. I use a CPU dispatcher at run-time to select the following instructions sets: SSE2, SSE4.1, AVX, AVX+FMA, and AVX512.
The reason I use a vector size of 1 is to test performance. There is already a SIMD library that does all this: Agner Fog's Vector Class Library. He even includes example code for a CPU dispatcher.
The VCL emulates hardware such as AVX on systems that only have SSE (or even AVX512 for SSE). It just implements AVX twice (for four times for AVX512) so in most cases you can just use the largest vector size you want to target.
//#include "vectorclass.h"
void Memcopy(void *dst, void *src, size_t size)
{
Vec8f v; //eight floats using AVX hardware or AVX emulated with SSE twice.
for(int i = 0; i < size; i +=v.size())
{
v.load(src);
v.store(dst);
dst += v.size();
src += v.size();
}
}
(however, writing an efficient memcpy is complicating. For large sizes you should consider non temroal stores and on IVB and above use rep movsb instead). Notice that that code is identical to what you asked for except I changed the word vector to Vec8f.
Using the VLC, as CPU dispatcher, templating, and macros you can write your code/kernel so that it looks nearly identical to scalar code without source code duplication for every different instruction set and vector size. It's your binaries which will be bigger not your source code.
I have described CPU dispatchers several times. You can also see some example using templateing and macros for a dispatcher here: alias of a function template
Edit: Here is an example of part of my kernel to calculate the Mandelbrot set for a set of pixels equal to the vector size. At compile time I set TYPE to float, double, or doubledouble and N to 1, 2, 4, 8, or 16. The type doubledouble is described here which I created and added to the VCL. This produces Vector types of Vec1f, Vec4f, Vec8f, Vec16f, Vec1d, Vec2d, Vec4d, Vec8d, doubledouble1, doubledouble2, doubledouble4, doubledouble8.
template<typename TYPE, unsigned N>
static inline intn calc(floatn const &cx, floatn const &cy, floatn const &cut, int32_t maxiter) {
floatn x = cx, y = cy;
intn n = 0;
for(int32_t i=0; i<maxiter; i++) {
floatn x2 = square(x), y2 = square(y);
floatn r2 = x2 + y2;
booln mask = r2<cut;
if(!horizontal_or(mask)) break;
add_mask(n,mask);
floatn t = x*y; mul2(t);
x = x2 - y2 + cx;
y = t + cy;
}
return n;
}
So my SIMD code for several several different data types and vector sizes is nearly identical to the scalar code I would use. I have not included the part of my kernel which loops over each super-pixel.
My build file looks something like this
g++ -m64 -c -Wall -g -std=gnu++11 -O3 -fopenmp -mfpmath=sse -msse2 -Ivectorclass kernel.cpp -okernel_sse2.o
g++ -m64 -c -Wall -g -std=gnu++11 -O3 -fopenmp -mfpmath=sse -msse4.1 -Ivectorclass kernel.cpp -okernel_sse41.o
g++ -m64 -c -Wall -g -std=gnu++11 -O3 -fopenmp -mfpmath=sse -mavx -Ivectorclass kernel.cpp -okernel_avx.o
g++ -m64 -c -Wall -g -std=gnu++11 -O3 -fopenmp -mfpmath=sse -mavx2 -mfma -Ivectorclass kernel.cpp -okernel_avx2.o
g++ -m64 -c -Wall -g -std=gnu++11 -O3 -fopenmp -mfpmath=sse -mavx2 -mfma -Ivectorclass kernel_fma.cpp -okernel_fma.o
g++ -m64 -c -Wall -g -std=gnu++11 -O3 -fopenmp -mfpmath=sse -mavx512f -mfma -Ivectorclass kernel.cpp -okernel_avx512.o
g++ -m64 -Wall -Wextra -std=gnu++11 -O3 -fopenmp -mfpmath=sse -msse2 -Ivectorclass frac.cpp vectorclass/instrset_detect.cpp kernel_sse2.o kernel_sse41.o kernel_avx.o kernel_avx2.o kernel_avx512.o kernel_fma.o -o frac
Then the dispatcher looks something like this
int iset = instrset_detect();
fp_float1 = NULL;
fp_floatn = NULL;
fp_double1 = NULL;
fp_doublen = NULL;
fp_doublefloat1 = NULL;
fp_doublefloatn = NULL;
fp_doubledouble1 = NULL;
fp_doubledoublen = NULL;
fp_float128 = NULL;
fp_floatn_fma = NULL;
fp_doublen_fma = NULL;
if (iset >= 9) {
fp_float1 = &manddd_AVX512<float,1>;
fp_floatn = &manddd_AVX512<float,16>;
fp_double1 = &manddd_AVX512<double,1>;
fp_doublen = &manddd_AVX512<double,8>;
fp_doublefloat1 = &manddd_AVX512<doublefloat,1>;
fp_doublefloatn = &manddd_AVX512<doublefloat,16>;
fp_doubledouble1 = &manddd_AVX512<doubledouble,1>;
fp_doubledoublen = &manddd_AVX512<doubledouble,8>;
}
else if (iset >= 8) {
fp_float1 = &manddd_AVX<float,1>;
fp_floatn = &manddd_AVX2<float,8>;
fp_double1 = &manddd_AVX2<double,1>;
fp_doublen = &manddd_AVX2<double,4>;
fp_doublefloat1 = &manddd_AVX2<doublefloat,1>;
fp_doublefloatn = &manddd_AVX2<doublefloat,8>;
fp_doubledouble1 = &manddd_AVX2<doubledouble,1>;
fp_doubledoublen = &manddd_AVX2<doubledouble,4>;
}
....
This sets function pointers to each of the different possible datatype vector combination for the instruction set found at runtime. Then I can call whatever function I'm interested.

Thanks Peter Cordes and Z boson. With your both replies I I came to a solution that satisfies me.
I chose the Memcopy just as an example just because of everyone knowing it and its beautiful simplicity (but also slowness) when implemented naively in contrast to SIMD optimizations that are often not well readable anymore but of course much faster.
I have now two classes (more possible of course) a scalar vector and an SSE vector both with inline methods. To the user i show something like:
typedef void(*MEM_COPY_FUNC)(void *, const void *, size_t);
extern MEM_COPY_FUNC memCopyPointer;
I declare my function something like this, as Z boson pointed out:
template
void MemCopyTemplate(void *pDest, const void *prc, size_t size)
{
VectorType v;
byte *pDst, *pSrc;
uint32 mask;
pDst = (byte *)pDest;
pSrc = (byte *)prc;
mask = (2 << v.GetSize()) - 1;
while(size & mask)
{
*pDst++ = *pSrc++;
}
while(size)
{
v.Load(pSrc);
v.Store(pDst);
pDst += v.GetSize();
pSrc += v.GetSize();
size -= v.GetSize();
}
}
And at runtime, when the library is loaded, i use CPUID to do either
memCopyPointer = MemCopyTemplate<ScalarVector>;
or
memCopyPointer = MemCopyTemplate<SSEVector>;
as you both suggested. Thanks a lot.

Related

Function pointer performance; slower on a single call than multiple calls?

I am interested in the execution speed of a function called through a pointer. I found initially that calling a function pointer through a pointer passed in as a parameter is slower than calling a locally declared function pointer. Please see the following code; you can see I have two function calls, both of which ultimately execute a lambda through a function pointer.
#include <chrono>
#include <iostream>
using namespace std;
__attribute__((noinline)) int plus_one(int x) {
return x + 1;
}
typedef int (*FUNC)(int);
#define OUTPUT_TIME(msg) std::cout << "Execution time (ns) of " << msg << ": " << std::chrono::duration_cast<chrono::nanoseconds>(t_end - t_start).count() << std::endl;
#define START_TIMING() auto const t_start = std::chrono::high_resolution_clock::now();
#define END_TIMING(msg) auto const t_end = std::chrono::high_resolution_clock::now(); OUTPUT_TIME(msg);
auto constexpr g_count = 1000000;
__attribute__((noinline)) int speed_test_no_param() {
int r;
auto local_lambda = [](int a) {
return plus_one(a);
};
FUNC f = local_lambda;
START_TIMING();
for (auto i = 0; i < g_count; ++i)
r = f(100);
END_TIMING("speed_test_no_param");
return r;
}
__attribute__((noinline)) int speed_test_with_param(FUNC &f) {
int r;
START_TIMING();
for (auto i = 0; i < g_count; ++i)
r = f(100);
END_TIMING("speed_test_with_param");
return r;
}
int main() {
int ret = 0;
auto main_lambda = [](int a) {
return plus_one(a);
};
ret += speed_test_no_param();
FUNC fp = main_lambda;
ret += speed_test_with_param(fp);
return ret;
}
Built on Ubuntu 20.04 with:
g++ -ggdb -ffunction-sections -O3 -std=c++17 -DNDEBUG=1 -DRELEASE=1 -c speed_test.cpp -o speed_test.o && g++ -o speed_test -Wl,-gc-sections -Wl,--start-group speed_test.o -Wl,--rpath='$ORIGIN' -Wl,--end-group
The results were not surprising; for any given number of runs, we see that the version without the parameter is clearly the fastest. Here is just one run; all of the many times I have run, this yields the same result:
Execution time (ns) of speed_test_no_param: 74
Execution time (ns) of speed_test_with_param: 1173849
When I dig into the assembly, I found what I believe is the reason for this. The code for speed_test_no_param() is:
0x000055555555534b call 0x555555555310 <plus_one(int)>
... whereas the code for speed_test_with_param is more complicated; a fetch of the address of the lambda, then a jump to the plus_one function:
0x000055555555544e call QWORD PTR [rbx]
...
0x0000555555555324 jmp 0x555555555310 <plus_one(int)>
(On compiler explorer at https://godbolt.org/z/b4hqYx7Eo. Different compiler but similar assembly; timing code commented out.)
What I didn't expect though is that when I reduce the number of calls down to 1 from 1000000 (auto constexpr g_count = 1), the results are flipped with the parameter version being the fastest:
Execution time (ns) of speed_test_no_param: 61
Execution time (ns) of speed_test_with_param: 31
I have also run this many times; the parameter version is always the fastest.
I do not understand why this is; I don't now believe a call through a parameter is slower than a local variable due to this conflicting evidence, but looking at the assembly suggests it really should be.
Can someone please explain?
UPDATE
As per the comment below, ordering matters. When I call speed_test_with_param() first, speed_test_no_param() is the fastest of the two! Yet when I call speed_test_no_param() first, speed_test_with_param() is the fastest! Any explanation to this would be greatly appreciated!
With multiple loop iterations in the C++ source, the fast version is only doing one in asm, because you gave the optimizer enough visibility to prove that's equivalent.
Why ordering matters with just one iteration: probably warm-up effects in the library code for std::chrono. Idiomatic way of performance evaluation?
Can you confirm that my suspicion that the call without the parameter technically should be the fastest, because with the parameter involves a memory read to find the location to call?
Much more significant is whether the compiler can constant-propagate the function pointer and see what function is being called; notice how speed_test_with_param has an actual loop that calls g_count times, but speed_test_no_param can see it's calling plus_one. Clang sees through the local lambda and the noinline to notice it has no side-effects, so it only calls it once.
It doesn't inline, but it still does inter-procedural optimization. With GCC, you could block that by using __attribute__((noipa)). GCC's noclone attribute can also stop it from making a copy of the function with constant-propagation into it, but noipa is I think stronger. noinline isn't sufficient for benchmarking stuff that becomes trivial to optimize when the compiler can see everything. But I don't think clang has anything like that.
You can make functions opaque to the optimizer by putting them in separate source files and not using -flto or other option like gcc -fwhole-program
The only reason store/reload is involved with the function pointer is because you passed it by reference for no reason, even though it's just a single pointer. If you pass it by value (https://godbolt.org/z/WEvvsvoxb) you can see call rbx in the loop.
Apparently clang couldn't hoist the load because it wasn't sure the caller's function-pointer wouldn't be modified by the call, because it was making a stand-alone version of speed_test_with_param that would work with any caller and any arg, not just the one main passes. So constprop didn't happen.
An indirect call can mispredict more easily, and yes store/reload adds a few cycles more latency before the prediction can be checked.
So yes, in general you'd expect it to be slower when the function to be called is a function-pointer arg, not a compile-time-constant fptr initialized within the calling function where the compiler can see the definition of what it's calling even if you artificially limit it.
If it becomes a call some_name instead of call rbx, that's still faster even if it does still have to loop like you were trying to make it.
(Microbenchmarking is hard, especially when you're trying to benchmark a C++ concept which can optimize differently depending on context; you have to know enough about compilers, optimization, and assembly to realize what makes the difference and what you're actually measuring. There isn't a meaningful answer to some questions, like "how fast or slow is the + operator?", even if you limit it to integers, because it can optimize away with constants, or vectorize, or not depending on how it's used.)
You're benchmarking a single iteration, which subjects you to cache effects and other warmup costs. The entire reason we normally run benchmarks several times is to amortize out these kinds of effects.
Caching refers to the memory hierarchy: your actual RAM is significantly slower than your CPU (and disk even more so), so to speed things up your CPU has a cache (often, multiple caches) which stores the most recently accessed bits of memory. The first time you start your program, it will need to be loaded from disk into RAM; thereafter, it will need to be loaded from RAM into the CPU caches. Uncached memory accesses can be orders of magnitudes slower than cached memory accesses. As your program runs, various bits of code and data will be loaded from RAM and cached; hence, subsequent executions of the same bit of code will often be faster than the first execution.
Other effects can include things like lazy dynamic linking and lazy initializations, wherein certain functions will perform extra work the first time they're called (for example, resolving dynamic library loads or initializing static data). These can all contribute to the first iteration being slower than subsequent iterations.
To address these issues, always make sure to run your benchmarks multiple times - and when possible, run your entire benchmark suite a few times in one process and take the lowest (fastest) run.

Why std::for_each is faster than __gnu_parallel::for_each

I'm trying to understand why std::for_each which runs on single thread is ~3 times faster than __gnu_parallel::for_each in the example below:
Time =0.478101 milliseconds
vs
Time =0.166421 milliseconds
Here the code i'm using to benchmark:
#include <iostream>
#include <chrono>
#include <parallel/algorithm>
//The struct I'm using for timming
struct TimerAvrg
{
std::vector<double> times;
size_t curr=0,n;
std::chrono::high_resolution_clock::time_point begin,end;
TimerAvrg(int _n=30)
{
n=_n;
times.reserve(n);
}
inline void start()
{
begin= std::chrono::high_resolution_clock::now();
}
inline void stop()
{
end= std::chrono::high_resolution_clock::now();
double duration=double(std::chrono::duration_cast<std::chrono::microseconds>(end-begin).count())*1e-6;
if ( times.size()<n)
times.push_back(duration);
else{
times[curr]=duration;
curr++;
if (curr>=times.size()) curr=0;}
}
double getAvrg()
{
double sum=0;
for(auto t:times)
sum+=t;
return sum/double(times.size());
}
};
int main( int argc, char** argv )
{
float sum=0;
for(int alpha = 0; alpha <5000; alpha++)
{
TimerAvrg Fps;
Fps.start();
std::vector<float> v(1000000);
std::for_each(v.begin(), v.end(),[](auto v){ v=0;});
Fps.stop();
sum = sum + Fps.getAvrg()*1000;
}
std::cout << "\rTime =" << sum/5000<< " milliseconds" << std::endl;
return 0;
}
This is my configuration:
gcc version 7.3.0 (Ubuntu 7.3.0-21ubuntu1~16.04)
Intel® Core™ i7-7600U CPU # 2.80GHz × 4
htop to check if the program is running in single or multiple threads
g++ -std=c++17 -fomit-frame-pointer -Ofast -march=native -ffast-math -mmmx -msse -msse2 -msse3 -DNDEBUG -Wall -fopenmp benchmark.cpp -o benchmark
The same code doesn't get compiled with gcc 8.1.0. I got that error message:
/usr/include/c++/8/tr1/cmath:1163:20: error: ‘__gnu_cxx::conf_hypergf’ has not been declared
using __gnu_cxx::conf_hypergf;
I already checked couple of posts but either they're very old or not the same issue..
My questions are:
Why is it slower in parallel?
I'm using the wrong functions?
In cppreference it is saying that gcc with Standardization of Parallelism TS is not supported (mentioned with red color in the table) and my code is running in parallel!?
Your function [](auto v){ v=0;} is extremely simple.
The function may be replaced it with a single call to memset or use SIMD instructions for single threaded parallellism. With the knowledge that it overwrites the same state as the vector initially had, the entire loop could be optimised away. It may be easier for the optimiser to replace std::for_each than a parallel implementation.
Furthermore, assuming the parallel loop uses threads, one must remember that creation and eventual synchronisation (in this case there is no need for synchronisation during processing) have overhead, which may be significant in relation to your trivial operation.
Threaded parallellism is often only worth it for computationally expensive tasks. v=0 is one of the least computationally expensive operations there are.
Your benchmark is faulty, I'm even surprised it takes time to run it.
You wrote:
std::for_each(v.begin(), v.end(),[](auto v){ v=0;});
As v is a local argument of the operator() with no reads, I would expect it to become removed by your compiler.
As you now have a loop with a body, that loop can be removed as well as there isn't an observable effect.
And similar to that, the vector can be removed as well as you don't have any readers.
So, without any side effects, this could all be removed. If you would use a parallel algorithm, chances are you have some kind of synchronization, which make optimizing this much harder as there might be side effects in another thread? Proving it doesn't is more complex, not to mention the side effects of the thread management which could exist?
To solve this, a lot of benchmarks have trucks in macros to force the compiler to assume side effects. Use them in the lambda so the compiler doesn't remove it.

Puzzling performance difference between mac and a relatively powerful desktop

My original intention for writing this piece of code is to measure performance difference when an entire array is operated on by a function vs operating individual elements of an array.
i.e. comparing the following two statements:
function_vector(x, y, z, n);
vs
for(int i=0; i<n; i++){
function_scalar(x[i], y[i], z[i]);
}
where function_* does some substantial but identical calculations.
With -ffast-math turned on, the scalar version is roughly 2x faster on multiple machines I have tested on.
However, whats puzzling is the comparison of timings on two different machines, both using gcc 6.3.0:
# on desktop with Intel-Core-i7-4930K-Processor-12M-Cache-up-to-3_90-GHz
g++ loop_test.cpp -o loop_test -std=c++11 -O3
./loop_test
vector time = 12.3742 s
scalar time = 10.7406 s
g++ loop_test.cpp -o loop_test -std=c++11 -O3 -ffast-math
./loop_test
vector time = 11.2543 s
scalar time = 5.70873 s
# on mac with Intel-Core-i5-4258U-Processor-3M-Cache-up-to-2_90-GHz
g++ loop_test.cpp -o loop_test -std=c++11 -O3
./loop_test
vector time = 2.89193 s
scalar time = 1.87269 s
g++ loop_test.cpp -o loop_test -std=c++11 -O3 -ffast-math
./loop_test
vector time = 2.38422 s
scalar time = 0.995433 s
By all means the first machine is superior in terms of cache size, clock speed etc. Still the code runs 5x faster on the second machine.
Question:
Can this be explained? Or am I doing something wrong here?
Link to the code: https://gist.github.com/anandpratap/262a72bd017fdc6803e23ed326847643
Edit
After comments from ShadowRanger, I added the __restrict__ keyword to function_vector and -march=native compilation flag. This gives:
# on desktop with Intel-Core-i7-4930K-Processor-12M-Cache-up-to-3_90-GHz
vector time = 1.3767 s
scalar time = 1.28002 s
# on mac with Intel-Core-i5-4258U-Processor-3M-Cache-up-to-2_90-GHz
vector time = 1.05206 s
scalar time = 1.07556 s
Odds are possible pointer aliasing is limiting optimizations in the vectorized case.
Try changing the declaration of function_vector to:
void function_vector(double *__restrict__ x, double *__restrict__ y, double *__restrict__ z, const int n){
to use g++'s non-standard support for a feature matching C99's restrict keyword.
Without it, function_vector likely has to assume that the writes to x[i] could be modifying values in y or z, so it can't do read-ahead to get the values.

What, in short words, does the GCC option -fipa-pta do?

According to the GCC manual, the -fipa-pta optimization does:
-fipa-pta: Perform interprocedural pointer analysis and interprocedural modification and reference analysis. This option can cause excessive
memory and compile-time usage on large compilation units. It is not
enabled by default at any optimization level.
What I assume is that GCC tries to differentiate mutable and immutable data based on pointers and references used in a procedure. Can someone with more in-depth GCC knowledge explain what -fipa-pta does?
I think the word "interprocedural" is the key here.
I'm not intimately familiar with gcc's optimizer, but I've worked on optimizing compilers before. The following is somewhat speculative; take it with a small grain of salt, or confirm it with someone who knows gcc's internals.
An optimizing compiler typically performs analysis and optimization only within each individual function (or subroutine, or procedure, depending on the language). For example, given code like this contrived example:
double *ptr = ...;
void foo(void) {
...
*ptr = 123.456;
some_other_function();
printf("*ptr = %f\n", *ptr);
}
the optimizer will not be able to determine whether the value of *ptr has been changed by the call to some_other_function().
If interprocedural analysis is enabled, then the optimizer can analyze the behavior of some_other_function(), and it may be able to prove that it can't modify *ptr. Given such analysis, it can determine that the expression *ptr must still evaluate to 123.456, and in principle it could even replace the printf call with puts("ptr = 123.456");.
(In fact, with a small program similar to the above code snippet I got the same generated code with -O3 and -O3 -fipa-pta, so I'm probably missing something.)
Since a typical program contains a large number of functions, with a huge number of possible call sequences, this kind of analysis can be very expensive.
As quoted from this article:
The "-fipa-pta" optimization takes the bodies of the called functions into account when doing the analysis, so compiling
void __attribute__((noinline))
bar(int *x, int *y)
{
*x = *y;
}
int foo(void)
{
int a, b = 5;
bar(&a, &b);
return b + 10;
}
with -fipa-pta makes the compiler see that bar does not modify b, and the compiler optimizes foo by changing b+10 to 15
int foo(void)
{
int a, b = 5;
bar(&a, &b);
return 15;
}
A more relevant example is the “slow” code from the “Integer division is slow” blog post
std::random_device entropySource;
std::mt19937 randGenerator(entropySource());
std::uniform_int_distribution<int> theIntDist(0, 99);
for (int i = 0; i < 1000000000; i++) {
volatile auto r = theIntDist(randGenerator);
}
Compiling this with -fipa-pta makes the compiler see that theIntDist is not modified within the loop, and the inlined code can thus be constant-folded in the same way as the “fast” version – with the result that it runs four times faster.

an optimized memcpy for small, or fixed size data in gcc

I use memcpy to copy both variable sizes of data and fixed sized data. In some cases I copy small amounts of memory (only a handful of bytes). In GCC I recall that memcpy used to be an intrinsic/builtin. Profiling my code however (with valgrind) I see thousands of calls to the actual "memcpy" function in glibc.
What conditions have to be met to use the builtin function? I can roll my own memcpy quickly, but I'm sure the builtin is more efficient than what I can do.
NOTE: In most cases the amount of data to be copied is available as a compile-time constant.
CXXFLAGS: -O3 -DNDEBUG
The code I'm using now, forcing builtins, if you take off the _builtin prefix the builtin is not used. This is called from various other templates/functions using T=sizeof(type). The sizes that get used are 1, 2, multiples of 4, a few 50-100 byte sizes, and some larger structures.
template<int T>
inline void load_binary_fixm(void *address)
{
if( (at + T) > len )
stream_error();
__builtin_memcpy( address, data + at, T );
at += T;
}
For the cases where T is small, I'd specialise and use a native assignment.
For example, where T is 1, just assign a single char.
If you know the addresses are aligned, use and appropriately sized int type for your platform.
If the addresses are not aligned, you might be better off doing the appropriate number of char assignments.
The point of this is to avoid a branch and keeping a counter.
Where T is big, I'd be surprised if you do better than the library memcpy(), and the function call overhead is probably going to be lost in the noise. If you do want to optimise, look around at the memcpy() implementations around. There are variants that use extended instructions, etc.
Update:
Looking at your actual(!) question about inlining memcpy, questions like compiler versions and platform become relevant. Out of curiosity, have you tried using std::copy, something like this:
template<int T>
inline void load_binary_fixm(void *address)
{
if( (at + T) > len )
stream_error();
std::copy(at, at + T, static_cast<char*>(address));
at += T;
}