I wondered if unordered_map is implemented using type erasure, since an unordered_map<Key, A*> and unordered_map<Key, B*> can use exactly the same code (apart from casting, which is a no-op in machine code). That is, the implementation of both could be based on unordered_map<Key, void*> to save code size.
Update: This technique is commonly referred to as the Thin Template Idiom (Thanks to the commenters below for pointing that out).
Update 2: I would be particlarly interested in Howard Hinnant's opinion. Let's hope he reads this.
So I wrote this small test:
#include <iostream>
#if BOOST
# include <boost/unordered_map.hpp>
using boost::unordered_map;
#else
# include <unordered_map>
using std::unordered_map;
#endif
struct A { A(int x) : x(x) {} int x; };
struct B { B(int x) : x(x) {} int x; };
int main()
{
#if SMALL
unordered_map<std::string, void*> ma, mb;
#else
unordered_map<std::string, A*> ma;
unordered_map<std::string, B*> mb;
#endif
ma["foo"] = new A(1);
mb["bar"] = new B(2);
std::cout << ((A*) ma["foo"])->x << std::endl;
std::cout << ((B*) mb["bar"])->x << std::endl;
// yes, it leaks.
}
And determined the size of the compiled output with various settings:
#!/bin/sh
for BOOST in 0 1 ; do
for OPT in 2 3 s ; do
for SMALL in 0 1 ; do
clang++ -stdlib=libc++ -O${OPT} -DSMALL=${SMALL} -DBOOST=${BOOST} map_test.cpp -o map_test
strip map_test
SIZE=$(echo "scale=1;$(stat -f "%z" map_test)/1024" | bc)
echo boost=$BOOST opt=$OPT small=$SMALL size=${SIZE}K
done
done
done
It turns out, that with all settings I tried, lots of inner code of unordered_map seems to be instantiated twice:
With Clang and libc++:
| -O2 | -O3 | -Os
-DSMALL=0 | 24.7K | 23.5K | 28.2K
-DSMALL=1 | 17.9K | 17.2K | 19.8K
With Clang and Boost:
| -O2 | -O3 | -Os
-DSMALL=0 | 23.9K | 23.9K | 32.5K
-DSMALL=1 | 17.4K | 17.4K | 22.3K
With GCC and Boost:
| -O2 | -O3 | -Os
-DSMALL=0 | 21.8K | 21.8K | 35.5K
-DSMALL=1 | 16.4K | 16.4K | 26.2K
(With the compilers from Apple's Xcode)
Now to the question: Is there some convincing technical reason due to which the implementers have chosen to omit this simple optimization?
Also: why the hell is the effect of -Os exactly the opposite of what is advertised?
Update 3:
As suggested by Nicol Bolas, I have repeated the measurements with shared_ptr<void/A/B> instead of naked pointers (created with make_shared and cast with static_pointer_cast). The tendency in the results is the same:
With Clang and libc++:
| -O2 | -O3 | -Os
-DSMALL=0 | 27.9K | 26.7K | 30.9K
-DSMALL=1 | 25.0K | 20.3K | 26.8K
With Clang and Boost:
| -O2 | -O3 | -Os
-DSMALL=0 | 35.3K | 34.3K | 43.1K
-DSMALL=1 | 27.8K | 26.8K | 32.6K
Since I've been specifically asked to comment, I will, though I'm not sure I have much more to add than has already been said. (sorry it took me 8 days to get here)
I've implemented the thin template idiom before, for some containers, namely vector, deque and list. I don't currently have it implemented for any container in libc++. And I've never implemented it for the unordered containers.
It does save on code size. It also adds complexity, much more so than the referenced wikibooks link implies. One can also do it for more than just pointers. You can do it for all scalars which have the same size. For example why have different instantiations for int and unsigned? Even ptrdiff_t can be stored in the same instantiation as T*. After all, it is all just a bag bits at the bottom. But it is extremely tricky to get the member templates which take a range of iterators correct when playing these tricks.
There are disadvantages though (besides difficulty of implementation). It doesn't play nearly as nicely with the debugger. At the very least it makes it much more difficult for the debugger to display container innards. And while the code size savings can be significant, I would stop short of calling the code size savings dramatic. Especially when compared to the memory required to store the photographs, animations, audio clips, street maps, years of email with all of the attachments from your best friends and family, etc. I.e. optimizing code size is important. But you should take into account that in many apps today (even on embedded devices), if you cut your code size in half, you might cut your app size by 5% (statistics admittedly pulled from thin air).
My current position is that this particular optimization is one best paid for and implemented in the linker instead of in the template container. Though I know this isn't easy to implement in the linker, I have heard of successful implementations.
That being said, I still do try to make code size optimizations in templates. For example in the libc++ helper structures such as __hash_map_node_destructor are templated on as few parameters as possible, so if any of their code gets outlined, it is more likely that one instantiation of the helper can serve more than one instantiation of unordered_map. This technique is debugger friendly, and not that hard to get right. And can even have some positive side effects for the client when applied to iterators (N2980).
In summary, I wouldn't hold it against code for going the extra mile and implementing this optimization. But I also wouldn't classify it as high a priority as I did a decade ago, both because linker technology has progressed, and the ratio of code size to application size has tended to fairly dramatically decrease.
When you have a void* parameter there is no type checking at compile-time.
Such maps as those you propose would be a flaw in a program since they would accept value elements of type A*, B*, and even more unimaginable fancy types that would have nothing to do in that map. ( for example int*, float*; std::string*, CString*, CWnd*... imagine the mess in your map...)
Your optimisation is premature. And premature optimization is root of all evil.
Related
While creating a version of std::basic_string_view for a private project (choices made for me: C++11; no boost:: allowed; with a pinch of NIH, so no GSL either) I came to implementing std::basic_string_view::max_size() for which the standard (n4820 21.4.2.3 Capacity) simply says:
Returns: The largest possible number of char-like objects that can be referred to by a basic_string_view.
Logically, this would be the maximum number that std::basic_string_view::size_type can represent: std::numeric_limits<std::basic_string_view::size_type>::max() which comes out at 18446744073709551615 on my platform where size_type is std::size_t.
I figured that, since I want to be compatible with the standard libraries, I should ensure that I arrive at the same number as other implementations. This is where I get lost.
Given that I have a auto max_size = string_view{"foo"}.max_size() then I get the following results:
+--------------+--------------------------+
| Library | Result |
+--------------+--------------------------+
| libstdc++ | 4611686018427387899 |
| libc++ | 18446744073709551615 |
| boost 1.72.0 | 3 |
+--------------+--------------------------+
If my interpretation is correct, then that means that libc++ and I agree on what the value should be. I feel that boost is completely wrong, since the specification for max_size is to return the largest possible number that a, not this, string_view can refer to. However, as noted in the comments, boost::string_view predates the standard and it is thus unfair to call it "completely wrong". Further, looking at the implementations of all three libraries libc++ returns
numeric_limits<size_type>::max();
libstdc++ returns
(npos - sizeof(size_type) - sizeof(void*)) / sizeof(value_type) / 4;
and boost returns:
len_;
Basically, two implementations appear to be wrong, but the question is: which one is correct?
In his book "Effective C++ 3rd Ed." Scott Meyer writes the following two lines of code
#define ASPECT_RATIO 1.653
and
const double AspectRatio = 1.653;
and explains
"...use of the constant may yield smaller code than using a #define. That’s because the pre-processor’s blind substitution of the macro name ASPECT_RATIO with 1.653 could result in multiple copies of 1.653 in your object code, while the use of the constant AspectRatio should never result in more than one copy."
Is this still true for current compilers? I played a little bit around with multiple uses of constants but got the same size for both variants with a current g++. Maybe someone could show me a working example?
Thanks.
...while the use of the constant AspectRatio should never result in more than one copy
More than one copy depends on many things, especially processor instruction and compiler optimization setting. When optimizing for speed, more than one copy may lead to faster execution. This kind of blanket statement shouldn't be made can't be justified or supported.
Preprocessing
The contents of a #define macro are handled by the preprocessing phase of compilation. The contents of the macro are inserted before the compilation (translation) begins. With a simple example:
#include <iostream>
#define THREE (3)
const int FOUR = 4;
int main()
{
int value = THREE;
std::cout << "Value is: " << value << "\n";
return 0;
}
After preprocessing, the compiler sees:
// contents of iostream header
const int FOUR = 4;
int main()
{
int value = 3;
std::cout << "Value is: " << value << "\n";
return 0;
}
The #define macro is no different than pasting the number directly into the code.
Since the number is pasted directly into the code, the compiler deduces the type of the constant (signed vs. unsigned, integer vs. floating point), then emit code to assign the number to the variable.
Identifiers / Symbols
When the compiler encounters the statement:
const int FOUR = 4;
the compiler creates the symbol "FOUR" places it into a symbol table with the associated value of 4. (There may be other attributes associated with the symbol, but let's keep it simple for illustrative purposes).
When the compiler encounters a statement like:
value = FOUR;
The compiler encounters the symbol "FOUR", looks it up in the symbol table, retrieves the value and continues processing, similar to processing the statement value = 4;.
Implementation
The processor instructions emitted for either case depend on the processor and the optimization level of the compiler (and maybe the complexity of the compiler).
Immediate Mode
Processors have accessing or fetching modes. For simplicity, we are concerned with immediate or direct access mode and indirect mode. Immediate mode is where the instruction has field for the value. Let's call it MOVE (as in move a constant into a register):
+--------------------------------------+ +-------+
| LOAD operation/instruction | | |
+--------------------+-----------------+ | |
+ Instruction Number | Register Number | | Value |
+--------------------+-----------------+ +-------+
The MOVE instruction comprises of two fields: instruction code and the value to load into the register. The MOVE instruction has two fields, always. Note: The value field may be incorporated into the instruction unit (word).
In this case, the compiler would insert the number into the value field of the instruction. No extra space consumed, no extra instructions emitted.
Indirect Mode
With indirect mode, the processor loads the register via pointer (address). The processor takes an addition step of dereferencing the pointer to fetch the value.
+--------------------------------------+ +---------+
| LOAD operation/instruction | | Pointer |
+--------------------+-----------------+ | to |
+ Instruction Number | Register Number | | Value |
+--------------------+-----------------+ +---------+
Immediate vs. Indirect
Some processors may have a limited range for the immediate value (for example 8-bits) and anything larger (such as int or double), would require indirect access (an addition word for the pointer/address). Compilers, in lazy mode, could simplify operations and always use indirect mode; immediate mode would be used for higher optimization levels.
When optimizing for space, compilers may save room by using indirect mode for common constants (e.g. PI). Using a constant variable (instead of a macro) would make this task easier. However, a compiler may also do this with the value anyway (when it encounters 3.14159... it could store it in a table for later usage).
Summary
The performance and size of using #define macro or const variables depends on the compiler's capabilities, optimization levels and the processor instructions. A blanket claim that a macro is better or worse than the constant variable for space or execution speed, cannot be justified. Too many dependencies of compiler and processor.
Common coding guidelines suggest using constant variables, as they have a type and prevent defects based on mismatched types (the compiler can issue warnings or errors).
I am here to ask if my perception is actually true.
I originally thought defining vector<T> v(size_t someSize, T init_value) would call a function such as vector<T>::reserve, instead of vector<T>::push_back. I found some discussion relating to this here: std::vector push_back is bottleneck, but this is slightly different in its idea.
Running some experiments, I notice that vector<T> v(size_t someSize, T init_value) calls ::push_back all along. Is this true? I have the following report using uftrace(https://github.com/namhyung/uftrace).
Avg total Min total Max total Function
========== ========== ========== ====================================
858.323 ms 858.323 ms 858.323 ms main
618.245 ms 618.245 ms 618.245 ms sortKaway
234.795 ms 234.795 ms 234.795 ms std::sort
72.752 us 72.752 us 72.752 us std::vector::_M_fill_initialize
65.788 us 49.551 us 82.026 us std::vector::vector
20.292 us 11.387 us 68.629 us std::vector::_M_emplace_back_aux
18.722 us 17.263 us 20.181 us std::equal
18.472 us 18.472 us 18.472 us std::vector::~vector
17.891 us 10.002 us 102.079 us std::vector::push_back // push_back?!
Does vector<T>::reserve also call on vector<t>::push_back eventually? Is there faster version for vector?
The above was the original post. After some comments, I tested a simple version, and realized I was completely mistaken.
#include <vector>
#include <functional>
#include <queue>
#include <cassert>
using namespace std; // for the time being
int main () {
vector<int> v(10, 0);
return 0;
}
This actually results in the following, which doesn't involve std::vector<T>::push_back.
# Function Call Graph for 'main' (session: 9ce7f6bb33885ff7)
=============== BACKTRACE ===============
backtrace #0: hit 1, time 12.710 us
[0] main (0x4009c6)
========== FUNCTION CALL GRAPH ==========
12.710 us : (1) main
0.591 us : +-(1) std::allocator::allocator
0.096 us : | (1) __gnu_cxx::new_allocator::new_allocator
: |
6.880 us : +-(1) std::vector::vector
4.338 us : | +-(1) std::_Vector_base::_Vector_base
0.680 us : | | +-(1) std::_Vector_base::_Vector_impl::_Vector_impl
0.445 us : | | | (1) std::allocator::allocator
0.095 us : | | | (1) __gnu_cxx::new_allocator::new_allocator
: | | |
3.294 us : | | +-(1) std::_Vector_base::_M_create_storage
3.073 us : | | (1) std::_Vector_base::_M_allocate
2.849 us : | | (1) std::allocator_traits::allocate
2.623 us : | | (1) __gnu_cxx::new_allocator::allocate
0.095 us : | | +-(1) __gnu_cxx::new_allocator::max_size
: | | |
1.867 us : | | +-(1) operator new
: | |
2.183 us : | +-(1) std::vector::_M_fill_initialize
0.095 us : | +-(1) std::_Vector_base::_M_get_Tp_allocator
: | |
1.660 us : | +-(1) std::__uninitialized_fill_n_a
1.441 us : | (1) std::uninitialized_fill_n
1.215 us : | (1) std::__uninitialized_fill_n::__uninit_fill_n
0.988 us : | (1) std::fill_n
0.445 us : | +-(1) std::__niter_base
0.096 us : | | (1) std::_Iter_base::_S_base
: | |
0.133 us : | +-(1) std::__fill_n_a
Sorry for the confusion. Yes, the library implementation works as we expect, it doesn't involve push_back if constructed with initial size.
Phew, looks like you answered your own question! I was extremely confused for a moment. Just hypothetically, I could have imagined some obscure case of vector's fill constructor using reserve and push_backs, but definitely not in the likes of high-quality implementations like those found accompanying GCC in the GNU standard lib. I would say, hypothetically, that it is possible for an obscure vector implementation to be implemented this way, but practically completely unlikely for any decent implementation.
To the contrary, this was almost two decades ago but I tried to implement my version of std::vector in hopes of matching its performance. This wasn't just some dumb exercise but the temptation was due to the fact that we had a software development kit and wanted to use some basic C++ containers for it, but it had the goal of allowing people to write plugins for our software using different compilers (and also different standard library implementations) than what we were using. So we couldn't safely use std::vector in those contexts since our version may not match the plugin writer's. We were forced to begrudgingly roll our own containers for the SDK.
Instead I found std::vector to be incredibly efficient in ways that were hard to match, especially for plain old data types with trivial ctors and dtors. Again this was over a decade ago but I found that using the fill constructor with vector<int> in MSVC 5 or 6 (forgot which one) actually translated to the same disassembly as using memset in ways that my naive version, just looping through things and using placement new on them regardless of whether they were PODs or not, did not. The range ctor also effectively translated to a super fast memcpy for PODs. And that's precisely what made vector so hard to beat for me, at least back then. Without getting deep into type traits and special casing PODs, I couldn't really match vector's performance for PODs. I could match it for UDTs, but most of our performance-critical code tended to use PODs.
So chances are that popular vector implementations today are just as efficient if not more than back when I conducted those tests, and I wanted to pitch in kind of as a reassurance that your vector implementation is most likely damned fast. The last thing I'd expect it to do is be implementing fill ctors or range ctors using push_backs.
From what I've read about Eigen (here), it seems that operator=() acts as a "barrier" of sorts for lazy evaluation -- e.g. it causes Eigen to stop returning expression templates and actually perform the (optimized) computation, storing the result into the left-hand side of the =.
This would seem to mean that one's "coding style" has an impact on performance -- i.e. using named variables to store the result of intermediate computations might have a negative effect on performance by causing some portions of the computation to be evaluated "too early".
To try to verify my intuition, I wrote up an example and was surprised at the results (full code here):
using ArrayXf = Eigen::Array <float, Eigen::Dynamic, Eigen::Dynamic>;
using ArrayXcf = Eigen::Array <std::complex<float>, Eigen::Dynamic, Eigen::Dynamic>;
float test1( const MatrixXcf & mat )
{
ArrayXcf arr = mat.array();
ArrayXcf conj = arr.conjugate();
ArrayXcf magc = arr * conj;
ArrayXf mag = magc.real();
return mag.sum();
}
float test2( const MatrixXcf & mat )
{
return ( mat.array() * mat.array().conjugate() ).real().sum();
}
float test3( const MatrixXcf & mat )
{
ArrayXcf magc = ( mat.array() * mat.array().conjugate() );
ArrayXf mag = magc.real();
return mag.sum();
}
The above gives 3 different ways of computing the coefficient-wise sum of magnitudes in a complex-valued matrix.
test1 sort of takes each portion of the computation "one step at a time."
test2 does the whole computation in one expression.
test3 takes a "blended" approach -- with some amount of intermediate variables.
I sort of expected that since test2 packs the entire computation into one expression, Eigen would be able to take advantage of that and globally optimize the entire computation, providing the best performance.
However, the results were surprising (numbers shown are in total microseconds across 1000 executions of each test):
test1_us: 154994
test2_us: 365231
test3_us: 36613
(This was compiled with g++ -O3 -- see the gist for full details.)
The version I expected to be fastest (test2) was actually slowest. Also, the version that I expected to be slowest (test1) was actually in the middle.
So, my questions are:
Why does test3 perform so much better than the alternatives?
Is there a technique one can use (short of diving into the assembly code) to get some visibility into how Eigen is actually implementing your computations?
Is there a set of guidelines to follow to strike a good tradeoff between performance and readability (use of intermediate variables) in your Eigen code?
In more complex computations, doing everything in one expression could hinder readability, so I'm interested in finding the right way to write code that is both readable and performant.
It looks like a problem of GCC. Intel compiler gives the expected result.
$ g++ -I ~/program/include/eigen3 -std=c++11 -O3 a.cpp -o a && ./a
test1_us: 200087
test2_us: 320033
test3_us: 44539
$ icpc -I ~/program/include/eigen3 -std=c++11 -O3 a.cpp -o a && ./a
test1_us: 214537
test2_us: 23022
test3_us: 42099
Compared to the icpc version, gcc seems to have problem optimizing your test2.
For more precise result, you may want to turn off the debug assertions by -DNDEBUG as shown here.
EDIT
For question 1
#ggael gives an excellent answer that gcc fails vectorizing the sum loop. My experiment also find that test2 is as fast as the hand-written naive for-loop, both with gcc and icc, suggesting that vectorization is the reason, and no temporary memory allocation is detected in test2 by the method mentioned below, suggesting that Eigen evaluate the expression correctly.
For question 2
Avoiding the intermediate memory is the main purpose that Eigen use expression templates. So Eigen provides a macro EIGEN_RUNTIME_NO_MALLOC and a simple function to enable you check whether an intermediate memory is allocated during calculating the expression. You can find a sample code here. Please note this may only work in debug mode.
EIGEN_RUNTIME_NO_MALLOC - if defined, a new switch is introduced which
can be turned on and off by calling set_is_malloc_allowed(bool). If
malloc is not allowed and Eigen tries to allocate memory dynamically
anyway, an assertion failure results. Not defined by default.
For question 3
There is a way to use intermediate variables, and to get the performance improvement introduced by lazy evaluation/expression templates at the same time.
The way is to use intermediate variables with correct data type. Instead of using Eigen::Matrix/Array, which instructs the expression to be evaluated, you should use the expression type Eigen::MatrixBase/ArrayBase/DenseBase so that the expression is only buffered but not evaluated. This means you should store the expression as intermediate, rather than the result of the expression, with the condition that this intermediate will only be used once in the following code.
As determing the template parameters in the expression type Eigen::MatrixBase/... could be painful, you could use auto instead. You could find some hints on when you should/should not use auto/expression types in this page. Another page also tells you how to pass the expressions as function parameters without evaluating them.
According to the instructive experiment about .abs2() in #ggael 's answer, I think another guideline is to avoid reinventing the wheel.
What happens is that because of the .real() step, Eigen won't explicitly vectorize test2. It will thus call the standard complex::operator* operator, which, unfortunately, is never inlined by gcc. The other versions, on the other hand, uses Eigen's own vectorized product implementation of complexes.
In contrast, ICC does inline complex::operator*, thus making the test2 the fastest for ICC. You can also rewrite test2 as:
return mat.array().abs2().sum();
to get even better performance on all compilers:
gcc:
test1_us: 66016
test2_us: 26654
test3_us: 34814
icpc:
test1_us: 87225
test2_us: 8274
test3_us: 44598
clang:
test1_us: 87543
test2_us: 26891
test3_us: 44617
The extremely good score of ICC in this case is due to its clever auto-vectorization engine.
Another way to workaround the inlining failure of gcc without modifying test2 is to define your own operator* for complex<float>. For instance, add the following at the top of your file:
namespace std {
complex<float> operator*(const complex<float> &a, const complex<float> &b) {
return complex<float>(real(a)*real(b) - imag(a)*imag(b), imag(a)*real(b) + real(a)*imag(b));
}
}
and then I get:
gcc:
test1_us: 69352
test2_us: 28171
test3_us: 36501
icpc:
test1_us: 93810
test2_us: 11350
test3_us: 51007
clang:
test1_us: 83138
test2_us: 26206
test3_us: 45224
Of course, this trick is not always recommended as, in contrast to the glib version, it might lead to overflow or numerical cancellation issues, but this what icpc and the other vectorized versions compute anyway.
One thing I have done before is to make use of the auto keyword a lot. Keeping in mind that most Eigen expressions return special expression datatypes (e.g. CwiseBinaryOp), an assignment back to a Matrix may force the expression to be evaluated (which is what you are seeing). Using auto allows the compiler to deduce the return type as whatever expression type it is, which will avoid evaluation as long as possible:
float test1( const MatrixXcf & mat )
{
auto arr = mat.array();
auto conj = arr.conjugate();
auto magc = arr * conj;
auto mag = magc.real();
return mag.sum();
}
This should essentially be closer to your second test case. In some cases I have had good performance improvements while keeping readability (you do not want to have to spell out the expression template types). Of course, your mileage may vary, so benchmark carefully :)
I just want you to note that you did profiling in a non-optimal way, so actually the issue could just be your profiling method.
Since there are many things like cache locality to keep into account you should do the profiling that way:
int warmUpCycles = 100;
int profileCycles = 1000;
// TEST 1
for(int i=0; i<warmUpCycles ; i++)
doTest1();
auto tick = std::chrono::steady_clock::now();
for(int i=0; i<profileCycles ; i++)
doTest1();
auto tock = std::chrono::steady_clock::now();
test1_us = (std::chrono::duration_cast<std::chrono::microseconds>(tock-tick)).count();
// TEST 2
// TEST 3
Once you did the test in the proper way, then you can come to conclusions..
I highly suspect that since you are profiling one operation at a time, you ends up by using the cached version on the third test since operations are likely to be re-ordered by the compiler.
Also you should try different compilers to see if the problem is the unrolling of templates (there is a depth limit to optimizing templates: it is likely you can hit it with a single big expression).
Also if Eigen support move semantics, there's no reason why one version should be faster since it is not always guaranteed that expressions can be optimized.
Please try and let me know, that's interesting. Also be sure to have enabled optimizations with flags like -O3, profiling without optimization is meaningless.
As to prevent compiler optimize everything away, use initial input from a file or cin and then re-feed the input inside the functions.
Is there any study or set of benchmarks showing the performance
degradation due to specifying -fno-strict-aliasing in GCC (or
equivalent in other compilers)?
It will vary a lot from compiler to compiler, as different compilers implement it with different levels of aggression. GCC is fairly aggressive about it: enabling strict aliasing will cause it to think that pointers that are "obviously" equivalent to a human (as in, foo *a; bar *b = (bar *) a;) cannot alias, which allows for some very aggressive transformations, but can obviously break non-carefully written code. Apple's GCC disables strict aliasing by default for this reason.
LLVM, by contrast, does not even have strict aliasing, and, while it is planned, the developers have said that they plan to implement it as a fall-back case when nothing else can judge equivalence. In the above example, it would still judge a and b equivalent. It would only use type-based aliasing if it could not determine their relationship in any other way.
In my experience, the performance impact of strict aliasing mostly has to do with loop invariant code motion, where type information can be used to prove that in-loop loads can't alias the array being iterated over, allowing them to be pulled out of the loop. YMMV.
What I can tell you from experience (having tested this with a large project on PS3, PowerPC being an architecture that due to it's many registers can actually benefit from SA quite well) is that the optimizations you're going to see are generally going to be very local (scope wise) and small. On a 20MB executable it scraped off maybe 80kb of the .text section (= code) and this was all in small scopes & loops.
This option can make your generated code a bit more lightweight and optimized than it is right now (think in the 1 to 5 percent range), but do not expect any big results. Hence, the effect of using -fno-strict-aliasing is probably not going to be a big influence on your performance, at all. That said, having code that requires -fno-strict-aliasing is a suboptimal situation at best.
Here is a link to study conducted in 2004: http://docs.lib.purdue.edu/cgi/viewcontent.cgi?article=1124&context=ecetr concerning, among others, strict aliasing impact on code performance. Figure 2.5 shows relative improvement of 3% to 10%.
Researchers' explanation of performance degradation:
From inspecting the assembly code, we found that the degradation is an effect of
the register allocation algorithm. GCC implements a graph coloring register allocator[2, 3]. With strict aliasing, the live ranges of the variables become longer, leading to high register pressure and ‘ spilling. With more conservative aliasing, the same variables incur memory transfers at the end of their (shorter) live ranges as well.
[2] Peter Bergner, Peter Dahl, David Engebretsen, and Matthew T. O’Keefe. Spill code
minimization via interference region spilling. In SIGPLAN Conference on Programming
Language Design and Implementation, pages 287–295, 1997.
[3] Preston Briggs, Keith D. Cooper, and Linda Torczon. Improvements to graph coloring
register allocation. ACM Transactions on Programming Languages and Systems,
16(3):428–455, May 1994.
This flag can have impact on the loop-vectorization and thus the performance, as shown in the following example:
// a simple test code
#include<vector>
void add(double *v, double *b, double *c, int *idx, std::vector<int> &v1) {
for(int i=v1[0];i<v1[2];i++){
v[i] = b[i] + c[i];
}
}
If you compile the code in https://godbolt.org/ using GCC11.2 with the flags -O3 -ftree-vectorize -ftree-loop-vectorize -fopt-info-vec-missed -fopt-info-vec-optimized -fno-strict-aliasing, you will see the message:
<source>:5:22: missed: couldn't vectorize loop
<source>:5:22: missed: not vectorized: number of iterations cannot be computed.
Now if you remove the -fno-strict-aliasing or replace it with -fstrict-aliasing, you will see:
<source>:5:22: optimized: loop vectorized using 16 byte vectors
<source>:5:22: optimized: loop versioned for vectorization because of possible aliasing