I know that similar template exits in Intel's TBB, besides that I can't find any implementation on google or in Boost library.
You can find discussions about this feature implementation in boost there : http://lists.boost.org/Archives/boost/2008/11/144803.php
> Can the N2427 - C++ Atomic Types and Operations be implemented
> without the help of the compiler?
No.
They don't need to be intrinsics if you can write inline assembler (or separately-compiled assembler for
that matter) then you can write the
operations themselves directly. You
might even be able to use simple C++
(e.g. just plain assignment for load
or store). The reason you need
compiler support is preventing
inappropriate optimizations: atomic
operations can't be optimized out, and
generally must not be reordered
before or after any other operations.
This means that even non-atomic
stores performed before an atomic
store have to be complete, and can't
be cached in a register (for example).
Also, loads that occur after an
atomic operation cannot be hoisted
before the atomic op. On some
compilers, just using inline assembler
is enough. On others, calling an
external function is enough. MSVC
provides
_ReadWriteBarrier() to provide the compiler ordering. Other compilers
need other flags.
Related
In functions like std::atomic::compare_exchange, there's runtime parameters like std::memory_order_release, std::memory_order_relaxed.
(http://en.cppreference.com/w/cpp/atomic/atomic/compare_exchange)
I'm not sure is these memory order flags are guaranteed to exist in all kinds of CPU/architects, if some CPU don't support a flag, does this flag lead to crash or? Seems some of these flags are designed for intel Itanium, etc, not sure if std::memory_order related code is portable or not.
Would you give some suggestions?
The C++ standard does have the concept of so-called "freestanding" implementations, which can support a subset of the standard library. However, it also defines bare-minimum functionality that even a freestanding implementation must support. Within that list is the entirety of the <atomic> header.
So yes, implementations must support these.
However, that doesn't mean that a particular flag will do exactly and only what that flag describes. The flags represent the minimum memory barrier, the specific things that are guaranteed to be visible. The implementation could issue a full memory barrier even for flags that don't require it, if the hardware implementation doesn't have lower-tier memory barriers.
So you should write code against what the standard says, and let the compiler sort out the details. If it proves to be inefficient on a platform, you can inspect the assembly to see if you might be able to improve matters.
But to answer your main question, yes, atomic-based code is portable (modulo compiler bugs).
In general, compilers are free to only provide the strongest memory guarantees regardless of what you ask for.
On some platforms, there are relaxed guarantees that are sufficient. Not all platforms support these relaxed guarantees. On those platforms, compilers must provide strictly stronger guarantees.
So they are portable, in that comforming compilers must provide that guarantee or better when you ask for a paticular guarantee.
Note that it isn't just the hardware that is of concern. Certain optimizations and reorderings may be legal or not depending on what memory order guarantee you ask for. I am unaware of any compiler that relies on that, but I am not a compiler expert.
In practice, consume semantic is always strengthened to acquire by current compilers because it turned out to be very hard to implement safely without doing that. Either:
the architecture provides acquire on all loads, and then load consume does the same thing as load acquire which does the same as load relaxed: nothing special (like x86);
the architecture requires an acquire barrier even on dependent accesses (very very rare, maybe only DEC Alpha): then the compiler will use an acquire barrier on consume;
The ISA guarantees dependency ordering for loads in asm, but full acquire needs a barrier. (This is what consume is trying to expose to the programmer). The compiler should provide you the benefit of avoiding the barrier with logical (not crazy) uses of consume
either the compiler tries to do that, but it's tricky and fails in some corner cases that the compiler back end optimizations break (the front end often does not communicate with its back end enough to avoid these just for consume);
or you don't trust the compiler, set optimization to zero, which doesn't even guarantee that no trivial optimization done implicitly by the back end will break consume (and make performance very bad);
or the compiler writers did not care about efficiency or knew they couldn't do a reliable job providing consume so they provide acquire instead, which is semantically correct but really less efficient, and not the intent of the standard.
(And C++'s consume semantic is crazy. It's one of the most inconsistent part of C++ which tells you a lot.)
Since this is implementation dependent, is the only way to find that out is through the disassembly?
You can alway look at STL sources to see if it uses SIMD, but I believe it is compiler specific and STL library doesn't directly utilize SIMD & AVX . It is up to the compiler to do vectorization if possible as a part of optimization.
So I'd rather look at the optimization report for a specific loop to see if compiler was able to vectorize it, and the reason if not.
Since this is implementation dependent, is the only way to find that out is through the disassembly?
Yes, there's no other way. Nor there are any guarantees what is actually used.
I've done a bit of a google and can't seem to turn up a GCC option or libstdc++ macro for this. Is it possible to force the use of locking internally on all the std::atomic template specializations. On some platforms some of the specializations are locking anyway, so it certainly seems like a feasible option.
In the past I've found the use of std::atomic to be very painful when debugging data-races with tools such as Valgrind (Helgrind or DRD) due to the enormous number of false positives. If use of atomics is pervasive enough, suppression files seem to not be a very scalable solution.
There is no way, AFAIK. GCC implements C++11 atomics via lock-free builtin functions (__atomic_fetch_add, __atomic_test_and_set, etc). Depending on what is available in the machine definition, GCC may emit some efficient insn sequence or, as a last resort, use a compare-and-swap loop. If nothing useful is available, GCC just emits calls to external functions with the same names and arguments.
http://gcc.gnu.org/onlinedocs/gcc-4.8.1/gcc/_005f_005fatomic-Builtins.html#_005f_005fatomic-Builtins
PS. Actually, you may compile with -m32 -march=i386 and provide yourself the required external functions.
I read about cache optimization in C++ and the mechanisms, modern CPUs use to predict what data is needed next, to copy that into cache. But is there a direct way in C++ for the programmers, who know what actually is needed next, to determine what data gets copied into CPU cache?
This varies with the processor and compiler you're using.
Assuming you're using an Intel x86/x64 or compatible (e.g., AMD) processor, the processor provides a number of prefetch instructions, and most compilers include intrinsics to invoke them. With VC++ you use _m_prefetch or _m_prefetchw. With gcc you use __builtin_prefetch.
Likewise, VC++ on an ARM provides a __prefetch intrinsic for the same purpose (no, I really don't know why they couldn't have used the same name as on x86; the signature and effect appear identical).
Most other reasonably modern, higher-end processors probably provide similar instructions, and
I'd guess most compilers provide intrinsics to make them available, but just as with these, the names of the intrinsics will vary. For that matter, even though the functions are intrinsic to the compiler, most require that you include some header to use them -- and the name of the header will also vary.
The prefetch intrinsics Jerry provided would do the trick. keep in mind that there are several flavors controlled by an argument to that function, determining which levels of the cache (if any) would be used to keep the line. A prefetch_NTA for e.g. would not pollute the caches, but rather provide the line only for immediate use (and is used in cases where you're going to use it soon and once only)
Also keep in mind that these instructions are basically hints to the CPU (which also does quite well by itself trying to guess which lines to prefetch). As such, they are not guaranteed to work, they might fail in many cases (if the memory subsystem is loaded, or the address got swapped out of memory).
I stepped into the assembly of the transcendental math functions of the C library with MSVC in fp:strict mode. They all seem to follow the same pattern, here's what happens for sin.
First there is a dispatch routine from a file called "disp_pentium4.inc". It checks if the variable ___use_sse2_mathfcns has been set; if so, calls __sin_pentium4, otherwise calls __sin_default.
__sin_pentium4 (in "sin_pentium4.asm") starts by transferring the argument from the x87 fpu to the xmm0 register, performs the calculation using SSE2 instructions, and loads the result back in the fpu.
__sin_default (in "sin.asm") keeps the variable on the x87 stack and simply calls fsin.
So in both cases, the operand is pushed on the x87 stack and returned on it as well, making it transparent to the caller, but if ___use_sse2_mathfcns is defined, the operation is actually performed in SSE2 rather than x87.
This behavior is very interesting to me because the x87 transcendental functions are notorious for having slightly different behaviors depending on the implementation, whereas a given piece of SSE2 code should always give reproducible results.
Is there a way to determine for certain, either at compile or run-time, that the SSE2 code path will be used? I am not proficient writing assembly, so if this involves writing any assembly, a code example would be appreciated.
I found the answer through careful investigation of math.h. This is controlled by a method called _set_SSE2_enable. This is a public symbol documented here:
Enables or disables the use of Streaming SIMD Extensions 2 (SSE2)
instructions in CRT math routines. (This function is not available on
x64 architectures because SSE2 is enabled by default.)
This causes the aforementionned ___use_sse2_mathfcns flag to be set to the provided value, effectively enabling or disabling use of the _pentium4 SSE2 routines.
The documentation mentions this affects only certain transcendental functions, but looking at the disassembly, this seems to affect everyone of them.
Edit: stepping into every function reveals that they're all available in SSE2 except for the following:
fmod
sinh
cosh
tanh
sqrt
Sqrt is the biggest offender, but it's trivial to implement in SSE2 using intrinsics. For the others, there's no simple solution except perhaps using a third-party library, but I can probably do without.
Why not use your own library instead of the C runtime? This would provide an even stronger guarantee of consistency across computers (presumably the C runtime is provided as a DLL and might change slightly in time).
I would recommend CRlibm. If you are already targeting SSE2, and as long as you did not intend to change the FPU's rounding mode, you are in the ideal conditions to use it, and you won't find a more accurate implementation.
The short answer is that you can't tell IN YOUR CODE for certain what the library will do, unless you are also involving library-implementation specific details. These would make the code completely unportable - even two different builds of the same compiler may change the internals of the library.
Of course, if portability isn't an issue, then using extern <type> ___use_sse2_mathfcns; and checking if it's the true would clearly work.
I expect that if the processor has SSE2 and you are using a modern enough library, it would use SSE2 wherever possible. But to say that for certain is a different matter.
If this is critical for your code, then implement your own transcendental functions and use those - that's the only way to guarantee the same result. Or, use some suitable inline assembler (or transcendental) code to calculate selected sin, cos, etc values, and compare those with the sin() and cos() functions provided by the library.