Achieving the effect of 'volatile' without the added MOV* instructions?

Achieving the effect of 'volatile' without the added MOV* instructions? - c++

I have a piece of code that must run under all circumstances, as it modifies things outside of its own scope. Let's define that piece of code as:
// Extremely simplified C++ as an example.
#include <iostream>
#include <map>
#include <cstdint>
#if defined(__GNUC__) || defined(__GNUG__) || defined(__clang__)
#include <x86intrin.h>
#elif defined(_MSC_VER)
#include <intrin.h>
#endif
uint64_t some_time_function() {
unsigned int waste;
return __rdtscp(&waste);
};
void insert_into_map(std::map<uint64_t, float>& data, uint64_t t1, uint64_t t0, float v) {
data.emplace((t1 - t0), v);
};
void fn(std::map<uint64_t, float>& map_outside_of_this_scope) {
const float a = 1;
const float b = 2;
float v = 0;
for (uint32_t i = 0; i < 1000000; i++) {
uint64_t t0 = some_time_function();
v = (v + b) - a;
uint64_t t1 = some_time_function();
insert_into_map(map_outside_of_this_scope, t1, t0, v);
}
}
int main(int argc, const char** argv) {
std::map<uint64_t, float> my_map;
fn(my_map);
std::cout << my_map.begin()->first << std::endl;
return 0;
}
This looks like an optimal target for the optimizer in compilers, and that is what I have observed with my code as well: map_outside_of_this_scope ends up empty. Unfortunately the map_outside_of_this_scope is critical to operation and must contain data, otherwise the application crashes. The only way to fix this is by marking v as volatile, however that makes the application significantly slower than an equivalent Assembly based function.
Is there a way to achieve the effect of volatile, without the MOV instructions of volatile?

Inasmuch as you assert in comments that you are most interested in a narrow answer to the question ...
Is there a way to achieve the effect of volatile, without the MOV instructions of volatile?
... the only thing we can say is that C and C++ do not specify the involvement of MOV or any other specific assembly instructions anywhere, for any purpose. If you observe such instructions in compiled binaries then those reflect implementation decisions by your compiler's developers. What's more, where you see them, the MOVs are most likely important to implementing those semantics.
Additionally, neither C nor C++ specifies any alternative feature that duplicates the rather specific semantics of volatile access (why would they?). You might be able to use inline assembly to achieve custom, different effects that serve your purpose, however.
With respect to the more general problem that inspired the above question, all we can really say is that the multiple compilers that perform the unwanted optimization are likely justified in doing so, for reasons that are not clear from the code presented. With that in mind, I recommend that you broaden your problem-solving focus to search for why the compilers think they can perform the optimization when volatile is not involved. To that end, construct a MRE -- not for us, but because the exercise of MRE construction is a powerful debugging technique in its own right.

volatile is directly relevant to optimizers because reads and writes of volatile variables are observable behavior. That means the reads and writes cannot be removed.
Similarly, optimizers cannot remove writes of variables that are observable by other means - whether you write the variable to std::cout, file or socket. And the burden of proof is on the compiler - the write can only be eliminated if the write is provably dead.
In the example above, for instance, mymap.begin()->first is written to std::cout. That is observable behavior, so even in absence of volatile the behavior must be kept. But the exact details do not matter. An optimizer may spot that only the ->first member is observed in this particular example. Hence, v (the ->second value) is not observed, and can legally be optimized out.
But if you copy mymap.begin()->second to a volatile float sink, then that write to sink is observable behavior, and the compiler must make sure the right value is written. That pretty much means that your v calculation inside the loop needs to be preserved, even though v itself is not volatile.
The compiler could do loop unrolls that affect how v is read and written, because the individual v updates are no longer observable. Only the value that's eventually written to volatile float sink counts.

Related

Is generally faster do _mm_set1_ps once and reuse it or set it on the fly?

Here's two example. Set float values on a vector every time within a loop:
static constexpr float kFMScale = 10.0f / k2PIf; // somewhere...
for (int i = 0; i < numValues; i++) {
paramKnob.v = _mm_mul_ps(_mm_sub_ps(paramKnob.v - _mm_set1_ps(0.5f)), _mm_set1_ps(2.0f));
paramKnob.v = _mm_mul_ps(paramKnob.v, _mm_set1_ps(kFMScale));
// code...
}
or do it once and "reload" the same register every time:
static constexpr float kFMScale = 10.0f / k2PIf; // somewhere...
__m128 v05 = _mm_set1_ps(0.5f);
__m128 v2 = _mm_set1_ps(2.0f);
__m128 vFMScale = _mm_set1_ps(kFMScale);
for (int i = 0; i < numValues; i++) {
paramKnob.v = _mm_mul_ps(_mm_sub_ps(paramKnob.v, v05), v2);
paramKnob.v = _mm_mul_ps(paramKnob.v, vFMScale);
// code...
}
which one is generally the best and suited approch? I'll bet on the second, but vectorization most of the time fool me.
And what if I use them as const in the whole project instead of "within a block"? "load everywhere" instead of "set everywhere" would be better?
i think its all about cache missing rather than latency/throughput used by the operations.
I'm on a windows/64 bit machine, using FLAGS += -O3 -march=nocona -funsafe-math-optimizations

In general, for best performance you want to make the loop body as concise as possible and move any code invariant to the iteration out of the loop body. Whether a given compiler is able to do it for you depends on its ability to perform this kind of optimization and on the currently enabled optimization level.
Modern gcc and clang versions are typically smart enough to convert _mm_set1_ps and similar intrinsics with constant arguments to in-memory constants, so even without hoisting this results in a fairly efficient binary code. On the other hand, MSVC is not very smart in SIMD optimizations.
As a rule of thumb I would still recommend to move constant generation (except for all-zero or all-ones constants) out of the loop body. There are several reasons to do so, even if your compiler is capable to do so on its own:
You are not relying on the compiler to do the optimization, which makes your code more portable across compilers. The code will also perform more similarly between different optimization levels, which may be useful.
Moving the constant out of the loop body may convince the compiler to pre-load the constant before entering the loop instead of referencing it in-memory inside the loop. Again, this is subject to compiler's optimization capabilities and current optimization level.
The constants can be reused in multiple places, including multiple functions. This reduces the binary size and makes your code more cache-friendly in run time. (Some linkers are able to merge equivalent constants when linking the binary, but this feature is not universally supported, it is subject to the optimization level, and in some cases makes the code non-compliant to the C/C++ standards, which can affect code correctness. For this reason, this feature is normally disabled by default, even when supported.)
If you are going to declare a constant in namespace scope, it is highly recommended to use an aligned array and/or a union to ensure static initialization. Compilers may not be able to perform static initialization if you use intrinsics to initialize the constant. You can use something like the following:
template< typename T >
union mm_constant128
{
T as_array[sizeof(__m128i) / sizeof(T)];
__m128i as_m128i;
__m128 as_m128;
__m128d as_m128d;
operator __m128i () const noexcept { return as_m128i; }
operator __m128 () const noexcept { return as_m128; }
operator __m128d () const noexcept { return as_m128d; }
};
constexpr mm_constant128< float > k05 = {{ 0.5f, 0.5f, 0.5f, 0.5f }};
void foo()
{
// Use as follows:
_mm_sub_ps(paramKnob.v, k05)
}
If you are targeting a specific compiler at a given optimization level, you can inspect the generated assembler code to see whether the compiler is doing good enough optimization on its own.

Or do this:
Declare a compile-time constant __m128 outside your function, preferably in a "constants" namespace.
inline constexpr __m128 _val = {0.5f,0.5f,0.5f,0.5f};
//Your function here
This will save you from using the _mm_set1_ps function, which does not evaluate to 1 instruction, it is a rather inefficient sequence of instructions.
Since templates are resolved at compile-time, there will be little to no performance difference between this and the other answer. However, I think this code is significantly simpler.

Why don't compilers optimize trivial wrapper function pointers?

Consider the following code snippet
#include <vector>
#include <cstdlib>
void __attribute__ ((noinline)) calculate1(double& a, int x) { a += x; };
void __attribute__ ((noinline)) calculate2(double& a, int x) { a *= x; };
void wrapper1(double& a, int x) { calculate1(a, x); }
void wrapper2(double& a, int x) { calculate2(a, x); }
typedef void (*Func)(double&, int);
int main()
{
std::vector<std::pair<double, Func>> pairs = {
std::make_pair(0, (rand() % 2 ? &wrapper1 : &wrapper2)),
std::make_pair(0, (rand() % 2 ? &wrapper1 : &wrapper2)),
};
for (auto& [a, wrapper] : pairs)
(*wrapper)(a, 5);
return pairs[0].first + pairs[1].first;
}
With -O3 optimization the latest gcc and clang versions do not optimize the pointers to wrappers to pointers to underlying functions. See assembly here at line 22:
mov ebp, OFFSET FLAT:wrapper2(double&, int) # tmp118,
which results later in call + jmp, instead of just call had the compiler put a pointer to the calculate1 instead.
Note that I specifically asked for no-inlined calculate functions to illustrate; doing it without noinline results in another flavour of non-optimization where compiler will generate two identical functions to be called by pointer (so still won't optimize, just in a different fashion).
What am I missing here? Is there any way to guide the compiler short of manually plugging in the correct functions (without wrappers)?
Edit 1. Following suggestions in the comments, here is a disassembly with all functions declared static, with exactly the same result (call + jmp instead of call).
Edit 2. Much simpler example of the same pattern:
#include <vector>
#include <cstdlib>
typedef void (*Func)(double&, int);
static void __attribute__ ((noinline)) calculate(double& a, int x) { a += x; };
static void wrapper(double& a, int x) { calculate(a, x); }
int main() {
double a = 5.0;
Func f;
if (rand() % 2)
f = &wrapper; // f = &calculate;
else
f = &wrapper;
f(a, 0);
return 0;
}
gcc 8.2 successfully optimizes this code by throwing pointer to wrapper away and storing &calculate directly in its place (https://gcc.godbolt.org/z/nMIBeo). However changing the line as per comment (that is, performing part of the same optimization manually) breaks the magic and results in pointless jmp.

You seem to be suggesting that &calculate1 should be stored in the vector instead of &wrapper1. In general this is not possible: later code might try to compare the stored pointer against &calculate1 and that must compare false.
I further assume that your suggestion is that the compiler might try to do some static analysis and determine that the function pointers values in the vector are never compared for equality with other function pointers, and in fact that none of the other operations done on the vector elements would produce a change in observable behaviour; and therefore in this exact program it could store &calculate1 instead.
Usually the answer to "why does the compiler not perform some particular optimization" is that nobody has conceived of and implemented that idea. Another common reason is that the static analysis involved is, in the general case, quite difficult and might lead to a slowdown in compilation with no benefit in real programs where the analysis could not be guaranteed to succeed.

You are making a lot of assumptions here. Firstly, your syntax. The second is that compilers are perfect in the eye of the beholder and catch everything. The reality is that it is easy to find and hand optimize compiler output, it is not difficult to write small functions to trip up a compiler that you are well in tune with or write a decent size application and there will be places where you can hand tune. This is all known and expected. Then opinion comes in where on my machine my blah is faster than blah so it should have made these instructions instead.
gcc is not a great compiler for performance, on some targets it has been getting worse for a number of major revs. It is pretty good at what it does, better than pretty good, it deals with a number of pre processors/languages has a common middle and a number of backends. Some backends get better optimization applied front to back others are just hanging on for the ride.There were a number of other compilers that could produce code that could easily outperform gcc.
These were mostly pay-for compilers. More than an individual would pay out of pocket: used car prices, sometimes recurring annually.
There are things that gcc can optimize that are simply amazing and times that it totally goes in the wrong direction. Same goes for clang, often they do similar jobs with similar output, sometimes do some impressive things sometimes just go off into the weeds. I now find it more fun to manipulate the optimizer to make it do good or bad things rater than worry about why didn't it do what I "think" it should have done on a particular occasion. If I need that code faster I take the compiled output and hand fix it and use it as an assembly function.
You get what you pay for with gcc, if you were to look deep in its bowels you will find it is barely held together with duct tape and bailing wire (llvm is catching up). But for a free tool it does a simply amazing job, it is so widely used that you can get free support just about anywhere. We are sadly well into a time where folks think that because gcc interprets the language in a certain way that is how the language is defined and sadly that is not remotely true. But so many folks don't try other compilers to find out what "implementation defined" really means.
Last and most important, it's open source, if you want to "fix" an optimization then just do it. Keep that fix for yourself, post it, or try to push it upstream.

Fast and Lock Free Single Writer, Multiple Reader

I've got a single writer which has to increment a variable at a fairly high frequence and also one or more readers who access this variable on a lower frequency.
The write is triggered by an external interrupt.
Since i need to write with high speed i don't want to use mutexes or other expensive locking mechanisms.
The approach i came up with was copying the value after writing to it. The reader now can compare the original with the copy. If they are equal, the variable's content is valid.
Here my implementation in C++
template<typename T>
class SafeValue
{
private:
volatile T _value;
volatile T _valueCheck;
public:
void setValue(T newValue)
{
_value = newValue;
_valueCheck = _value;
}
T getValue()
{
volatile T value;
volatile T valueCheck;
do
{
valueCheck = _valueCheck;
value = _value;
} while(value != valueCheck);
return value;
}
}
The idea behind this is to detect data races while reading and retry if they happen. However, i don't know if this will always work. I haven't found anything about this aproach online, therefore my question:
Is there any problem with my aproach when used with a single writer and multiple readers?
I already know that high writing frequencys may cause starvation of the reader. Are there more bad effects i have to be cautious of? Could it even be that this isn't threadsafe at all?
Edit 1:
My target system is a ARM Cortex-A15.
T should be able to become at least any primitive integral type.
Edit 2:
std::atomic is too slow on reader and writer site. I benchmarked it on my system. Writes are roughly 30 times slower, reads roughly 50 times compared to unprotected, primitive operations.

Is this single variable just an integer, pointer, or plain old value type, you can probably just use std::atomic.

You should try using std::atomic first, but make sure that your compiler knows and understands your target architecture. Since you are targeting Cortex-A15 (ARMv7-A cpu), make sure to use -march=armv7-a or even -mcpu=cortex-a15.
The first shall generate ldrexd instruction which should be atomic according to ARM docs:
Single-copy atomicity
In ARMv7, the single-copy atomic processor accesses are:
all byte accesses
all halfword accesses to halfword-aligned locations
all word accesses to word-aligned locations
memory accesses caused by LDREXD and STREXD instructions to doubleword-aligned locations.
The latter shall generate ldrd instruction which should be atomic on targets supporting Large Physical Address Extension:
In an implementation that includes the Large Physical Address Extension, LDRD and STRD accesses to 64-bit aligned locations are 64-bit single-copy atomic as seen by translation table walks and accesses to translation tables.
--- Note ---
The Large Physical Address Extension adds this requirement to avoid the need to complex measures to avoid atomicity issues when changing translation table entries, without creating a requirement that all locations in the memory system are 64-bit single-copy atomic.
You can also check how Linux kernel implements those:
#ifdef CONFIG_ARM_LPAE
static inline long long atomic64_read(const atomic64_t *v)
{
long long result;
__asm__ __volatile__("# atomic64_read\n"
" ldrd %0, %H0, [%1]"
: "=&r" (result)
: "r" (&v->counter), "Qo" (v->counter)
);
return result;
}
#else
static inline long long atomic64_read(const atomic64_t *v)
{
long long result;
__asm__ __volatile__("# atomic64_read\n"
" ldrexd %0, %H0, [%1]"
: "=&r" (result)
: "r" (&v->counter), "Qo" (v->counter)
);
return result;
}
#endif

There's no way anyone can know. You would have to see if either your compiler documents any multi-threaded semantics that would guarantee that this will work or look at the generated assembler code and convince yourself that it will work. Be warned that in the latter case, it is always possible that a later version of the compiler, or different optimizations options or a newer CPU, might break the code.
I'd suggest testing std::atomic with the appropriate memory_order. If for some reason that's too slow, use inline assembly.

Another option is to have a buffer of non-atomic values the publisher produces and an atomic pointer to the latest.
#include <atomic>
#include <utility>
template<class T>
class PublisherValue {
static auto constexpr N = 32;
T values_[N];
std::atomic<T*> current_{values_};
public:
PublisherValue() = default;
PublisherValue(PublisherValue const&) = delete;
PublisherValue& operator=(PublisherValue const&) = delete;
// Single writer thread only.
template<class U>
void store(U&& value) {
T* p = current_.load(std::memory_order_relaxed);
if(++p == values_ + N)
p = values_;
*p = std::forward<U>(value);
current_.store(p, std::memory_order_release); // (1)
}
// Multiple readers. Make a copy to avoid referring the value for too long.
T load() const {
return *current_.load(std::memory_order_consume); // Sync with (1).
}
};
This is wait-free, but there is a small chance that a reader might be de-scheduled while copying the value and hence read the oldest value while it has been partially overwritten. Making N bigger reduces this risk.

Optimization barrier for microbenchmarks in MSVC: tell the optimizer you clobber memory?

Chandler Carruth introduced two functions in his CppCon2015 talk that can be used to do some fine-grained inhibition of the optimizer. They are useful to write micro-benchmarks that the optimizer won't simply nuke into meaninglessness.
void clobber() {
asm volatile("" : : : "memory");
}
void escape(void* p) {
asm volatile("" : : "g"(p) : "memory");
}
These use inline assembly statements to change the assumptions of the optimizer.
The assembly statement in clobber states that the assembly code in it can read and write anywhere in memory. The actual assembly code is empty, but the optimizer won't look into it because it's asm volatile. It believes it when we tell it the code might read and write everywhere in memory. This effectively prevents the optimizer from reordering or discarding memory writes prior to the call to clobber, and forces memory reads after the call to clobber†.
The one in escape, additionally makes the pointer p visible to the assembly block. Again, because the optimizer won't look into the actual inline assembly code that code can be empty, and the optimizer will still assume that the block uses the address pointed by the pointer p. This effectively forces whatever p points to be in memory and not not in a register, because the assembly block might perform a read from that address.
(This is important because the clobber function won't force reads nor writes for anything that the compilers decides to put in a register, since the assembly statement in clobber doesn't state that anything in particular must be visible to the assembly.)
All of this happens without any additional code being generated directly by these "barriers". They are purely compile-time artifacts.
These use language extensions supported in GCC and in Clang, though. Is there a way to have similar behaviour when using MSVC?
† To understand why the optimizer has to think this way, imagine if the assembly block were a loop adding 1 to every byte in memory.

Given your approximation of escape(), you should also be fine with the following approximation of clobber() (note that this is a draft idea, deferring some of the solution to the implementation of the function nextLocationToClobber()):
// always returns false, but in an undeducible way
bool isClobberingEnabled();
// The challenge is to implement this function in a way,
// that will make even the smartest optimizer believe that
// it can deliver a valid pointer pointing anywhere in the heap,
// stack or the static memory.
volatile char* nextLocationToClobber();
const bool clobberingIsEnabled = isClobberingEnabled();
volatile char* clobberingPtr;
inline void clobber() {
if ( clobberingIsEnabled ) {
// This will never be executed, but the compiler
// cannot know about it.
clobberingPtr = nextLocationToClobber();
*clobberingPtr = *clobberingPtr;
}
}
UPDATE
Question: How would you ensure that isClobberingEnabled returns false "in an undeducible way"? Certainly it would be trivial to place the definition in another translation unit, but the minute you enable LTCG, that strategy is defeated. What did you have in mind?
Answer: We can take advantage of a hard-to-prove property from the number theory, for example, Fermat's Last Theorem:
bool undeducible_false() {
// It took mathematicians more than 3 centuries to prove Fermat's
// last theorem in its most general form. Hardly that knowledge
// has been put into compilers (or the compiler will try hard
// enough to check all one million possible combinations below).
// Caveat: avoid integer overflow (Fermat's theorem
// doesn't hold for modulo arithmetic)
std::uint32_t a = std::clock() % 100 + 1;
std::uint32_t b = std::rand() % 100 + 1;
std::uint32_t c = reinterpret_cast<std::uintptr_t>(&a) % 100 + 1;
return a*a*a + b*b*b == c*c*c;
}

I have used the following in place of escape.
#ifdef _MSC_VER
#pragma optimize("", off)
template <typename T>
inline void escape(T* p) {
*reinterpret_cast<char volatile*>(p) =
*reinterpret_cast<char const volatile*>(p); // thanks, #milleniumbug
}
#pragma optimize("", on)
#endif
It's not perfect but it's close enough, I think.
Sadly, I don't have a way to emulate clobber.

What, in short words, does the GCC option -fipa-pta do?

According to the GCC manual, the -fipa-pta optimization does:
-fipa-pta: Perform interprocedural pointer analysis and interprocedural modification and reference analysis. This option can cause excessive
memory and compile-time usage on large compilation units. It is not
enabled by default at any optimization level.
What I assume is that GCC tries to differentiate mutable and immutable data based on pointers and references used in a procedure. Can someone with more in-depth GCC knowledge explain what -fipa-pta does?

I think the word "interprocedural" is the key here.
I'm not intimately familiar with gcc's optimizer, but I've worked on optimizing compilers before. The following is somewhat speculative; take it with a small grain of salt, or confirm it with someone who knows gcc's internals.
An optimizing compiler typically performs analysis and optimization only within each individual function (or subroutine, or procedure, depending on the language). For example, given code like this contrived example:
double *ptr = ...;
void foo(void) {
...
*ptr = 123.456;
some_other_function();
printf("*ptr = %f\n", *ptr);
}
the optimizer will not be able to determine whether the value of *ptr has been changed by the call to some_other_function().
If interprocedural analysis is enabled, then the optimizer can analyze the behavior of some_other_function(), and it may be able to prove that it can't modify *ptr. Given such analysis, it can determine that the expression *ptr must still evaluate to 123.456, and in principle it could even replace the printf call with puts("ptr = 123.456");.
(In fact, with a small program similar to the above code snippet I got the same generated code with -O3 and -O3 -fipa-pta, so I'm probably missing something.)
Since a typical program contains a large number of functions, with a huge number of possible call sequences, this kind of analysis can be very expensive.

As quoted from this article:
The "-fipa-pta" optimization takes the bodies of the called functions into account when doing the analysis, so compiling
void __attribute__((noinline))
bar(int *x, int *y)
{
*x = *y;
}
int foo(void)
{
int a, b = 5;
bar(&a, &b);
return b + 10;
}
with -fipa-pta makes the compiler see that bar does not modify b, and the compiler optimizes foo by changing b+10 to 15
int foo(void)
{
int a, b = 5;
bar(&a, &b);
return 15;
}
A more relevant example is the “slow” code from the “Integer division is slow” blog post
std::random_device entropySource;
std::mt19937 randGenerator(entropySource());
std::uniform_int_distribution<int> theIntDist(0, 99);
for (int i = 0; i < 1000000000; i++) {
volatile auto r = theIntDist(randGenerator);
}
Compiling this with -fipa-pta makes the compiler see that theIntDist is not modified within the loop, and the inlined code can thus be constant-folded in the same way as the “fast” version – with the result that it runs four times faster.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js