Is atomic.load() is faster than calling it normal

Is atomic.load() is faster than calling it normal - c++

Recently I have been using atomic numbers alot in c++ as i use threading too much and thread safe is important to me
Well, I had a problem with printf() function here is an example
atomic_uint64_t count = {0}
printf("%lu",count);
// It gives error couple of errors like atomic(cost atomic&) = delete; and use of deleted function atomic so i had to write it like this to make it work
printf("%lu",count.load());
// Or
printf("%lu",(uint64_t)count);
Well anyways i don't which is better for performance i really care about the speed
So i started to thinking about which is better to retrieve the value and use it in if conditions or anywhere else
Like
if(count.load() < 8 ){
// Do smth
}
or
if(count < 8){
// Do smth
}
Which is better for speed and performance and thanks.

They're exactly identical in their meaning (unless you pass a non-default memory order like count.load(std::memory_order_acquire)).
I'd expect there to be no difference in the generated assembly for all compilers across all ISAs, with optimization enabled of course. There isn't for GCC/clang/MSVC/ICC in code I've looked at on https://godbolt.org/. This is true regardless of surrounding code it's inlining into.
If there is ever a difference, and one is slower or takes more code-size, report that as a missed-optimization compiler bug in whatever compiler you're using. (Unless you had optimization disabled, then an extra level of calls to wrapper functions is possible.)
As for the error, that's because you're evaluating it in a context that doesn't already imply a type: as an operand for a variadic function (printf).
If there's enough context to imply that you want the underlying T value from an atomic<T> (which is what atomic_uint64_t is), then the operator T() overload is called, which is documented as being equivalent to .load(). Same deal for assignment and .store().
There aren't any other functions that let you access only the low 32 bits of an atomic 64-bit integer (unfortunately); even on a 32-bit machine, current compilers will actually go to the trouble of doing a 64-bit atomic load (which is efficient on some 32-bit machines, not on others), then discarding the high 32 if you cast the value to a narrower type. (This is a missed-optimization, but compilers truly don't optimize atomics for the moment.)
So there's no ambiguity being resolved by .load, or any way a cast can pick a different load.
One reason for the existence of .load() and .store() is that they take a std::memory_order parameter, which is defaulted to seq_cst but can be weaker if you just need atomicity but only acq/rel synchronization between threads. Or none at all with relaxed, just atomicity.
Another reason is to let you write foo.load() to remind readers of you code that this is an atomic variable, not just a plain primitive type. For that style reason I'd prefer count.load(). Presumably if you changed its type away from uint64_t, you'd want to change how you printed it, not still cast it to uint64_t. Using .load() will let the compiler warn you about the format-string mismatch if you change its type.

Related

How does reordering numerical code in order to avoid temporary variables make the code faster?

I made the experience (this is not the question but a statement), that avoiding non-constant local variables in favor of const variables or avoiding local variables at all, enables the c++ compiler to generate faster code.
I assume, that this gives the compiler more freedom to interleave calculation of expressions, whereas assignments force the compiler to insert a sync point.
Is this assumption in fact the case?
Any other explanation? e.g. Compiler giving up on certain optimization levels, as soon as the code gets too complex in order to avoid astronomical compile times?

No, assignments don't force the compiler to insert a sync point. If the variables are local, and don't affect anything visible outside your function, compiler will remove all unneeded variables, as part of the usual "register allocation" optimization it does.
If your code is so complex it approaches the limit of what the compiler can keep in memory, additional local variables can make the compiler give up and produce unoptimized code. However, this is a very rare edge-case; and it can be triggered on any change in code, not only regarding local variables.
Generally, compiler optimization is hard to reason about, outside of well-known problems (aliasing, loop-carried dependencies, etc). You might feel like you found some related consideration, but it could disappear when you upgrade your compiler or switch to a different one.

Assignments to local variables that you don't subsequently modify allow the compiler to assume that that value in that variable won't change. It might therefore decide (for example) to store it in a register for the 'usage-span' of the variable. This is a simple optimisation, and no self-respecting compiler is going to miss it (unless perhaps register pressure means it is forced to spill).
An example of where this might speed up the code (and maybe reduce code size a little also) is to assign a member variable to a local and then subsequently use that instead of the member variable. If you are confident that the value is not going to change, this might help the compiler generate better code. But then again, it might be a good way of introducing bugs, you do have to be careful playing games like this.
As Thomas Matthews said in the comments, another advantage of doing what you might consider to be a redundant assignment is to help with debugging. It allows the variable to be inspected (and perhaps adjusted) during a debugging run and that can be really handy. I'm not proud, I make mistakes, so I do it a lot.
Just my $0.02

It's unusual that temp vars hurt optimization; usually they're optimized away, or they help the compiler do a load or calculation once instead of repeating it (common subexpression elimination).
Repeated access to arr[i] might actually load multiple times if the compiler can't prove that no other assignments to other pointers to the same type couldn't have modified that array element. float *__restrict arr can help the compiler figure it out, or float ai = arr[i]; can tell the compiler to read it once and keep using the same value, regardless of other stores.
Of course, if optimization is disabled, more statements are typically slower than using fewer large expressions, and store/reload latency bottlenecks are usually the main bottleneck. See How to optimize these loops (with compiler optimization disabled)? . But -O0 (no optimization) is supposed to be slow. If you're compiling without at least -O2, preferably -O3 -march=native -ffast-math -flto, that's your problem.
I assume, that this gives the compiler more freedom to interleave calculation of expressions, whereas assignments force the compiler to insert a sync point.
Is this assumption in fact the case?
"Sync point" isn't the right technical term for it, but ISO C++ rules for FP math do distinguish between optimization within one expression vs. across statements / expressions.
Contraction of a * b + c into fma(a,b,c) is only allowed within one expression, if at all.
GCC defaults to -ffp-contract=fast, allowing it across expressions. clang defaults to strict or no, but supports -ffp-contract=fast. See How to use Fused Multiply-Add (FMA) instructions with SSE/AVX . If fast makes the code with temp vars run as fast as without, strict FP-contraction rules were the reason why it was slower with temp vars.
(Legacy x87 80-bit FP math, or other unusual machines with FLT_EVAL_METHOD!=0 - FP math happens at higher precision, and rounding to float or double costs extra). Strict ISO C++ semantics require rounding at expression boundaries, e.g. on assignments. GCC defaults to ignoring that, -fno-float-store. But -std=c++11 or whatever (instead of -std=gnu++11) will enforce that extra rounding work (a store/reload which costs throughput and latency).
This isn't a problem for x86 with SSE2 for scalar math; computation happens at either float or double according to the type of the data, with instructions like mulsd (scalar double) or mulss (scalar single). So it implements FLT_EVAL_METHOD == 0 instead of x87's 2. Hopefully nobody in 2023 is building number crunching code for 32-bit x87 and caring about the performance, especially without mentioning that obscure build choice. I mention this mostly for completeness.

Can I use char variable without lock in the multi-threading case

As c/c++ standard said, the size of char must be 1. As my understanding, that means CPU guarantees that any read or write on a char must be done in one instruction.
Let's say we have many threads, which share a char variable:
char target = 1;
// thread a
target = 0;
// thread b
target = 1;
// thread 1
while (target == 1) {
// do something
}
// thread 2
while (target == 1) {
// do something
}
In a word, there are two kinds of threads: some of them are to set target into 0 or 1, and the others are to do some tasks if target == 1. The goal is that we can control the task-threds through modifying the value of target.
As my understanding, it doesn't seem that we need to use mutex/lock at all. But my coding experience gave me a strong feeling that we must use mutex/lock in this case.
I'm confused now. Should I use mutex/lock or not in this case?
You see, I can understand why we need mutex/lock in other cases, such as i++. Because i++ can't be done in only one instruction. So can target = 0 be done in one instruction, right? If so, does it mean that we don't need mutex/lock in this case?
Well, I know that we could use std::atomic, so my question is: is it OK to not use neither mutex/lcok nor std::atomic.

std::atomic guarantees that accessing a variable is atomic. From cppreference:
Each instantiation and full specialization of the std::atomic template
defines an atomic type. If one thread writes to an atomic object while
another thread reads from it, the behavior is well-defined (see memory
model for details on data races).
When a char actually is atomic (being size 1, is not sufficient), then std::atomic<char> needs no extra synchronization. However, on a platform where char is not atomic, std::atomic<char> guarantees that it can be read and written atomically by using a mutex or similar.
In practice, I'd expect char to be atomic, but the standard does not guarantee that.
Also consider that operations like eg += read and write the value, hence atomic reads and writes alone are not sufficient to safely call +=, while std::atomic<T> has a proper operator+=.
TL;DR
I'm confused now. Should I use mutex/lock or not in this case?
Let someone else take that decision for you. When you want something atomic, use a std::atomic<something> unless you want fine grained control over the synchronisation.

TL;DR
is it OK to not use neither mutex/lcok nor std::atomic
No
In general, it's not ok to assume things. If you need guarantees for something, then make sure you have them.
This is closely related to a common logical fallacy. Just because you cannot imagine why something could be true, that does not mean that it's true.
Longer version
As c/c++ standard said
There's no such thing as "C/C++" and definitely not "the C/C++ standard". They are two completely different languages with different standards. However, they do agree on this point. sizeof (char) is 1 in both languages.
(Sidenote: sizeof 'a' will yield different results.)
As my understanding, that means CPU guarantees that any read or write on a char must be done in one instruction.
That's not correct. The CPU has it's own specification, completely separate from the language standards. And there's nothing that says that this has to be true, even if it probably are in most or all cases.
Because i++ can't be done in only one instruction.
That is CPU dependent. The x86 architecture has an instruction for this. https://c9x.me/x86/html/file_module_x86_id_140.html
As my understanding, it doesn't seem that we need to use mutex/lock at all. But my coding experience gave me a strong feeling that we must use mutex/lock in this case.
Even if the target CPU does read and write in one instruction, which it probably does, there's nothing that says that the C or C++ code needs to be compiled to just that instruction.
The standards for both C and C++ describes the behavior of the code. Not how it is converted to assembly.
So no, you cannot make the assumptions you're doing.

In general, it cannot be assumed that reading or writing a char is an atomic operation. However, the target architecture may provide that guarantee. For embedded C programs it is common practice to rely on such underlying guarantees to avoid the overhead of synchronization mechanisms in certain situations.
In the example in the question it must be noted that even if reading/writing target is an atomic operation, the value could be changed at any time, so there is no guarantee that it will be 1 inside the while loops.

When should I use CUDA's built-in warpSize, as opposed to my own proper constant?

nvcc device code has access to a built-in value, warpSize, which is set to the warp size of the device executing the kernel (i.e. 32 for the foreseeable future). Usually you can't tell it apart from a constant - but if you try to declare an array of length warpSize you get a complaint about it being non-const... (with CUDA 7.5)
So, at least for that purpose you are motivated to have something like (edit):
enum : unsigned int { warp_size = 32 };
somewhere in your headers. But now - which should I prefer, and when? : warpSize, or warp_size?
Edit: warpSize is apparently a compile-time constant in PTX. Still, the question stands.

Let's get a couple of points straight. The warp size isn't a compile time constant and shouldn't be treated as one. It is an architecture specific runtime immediate constant (and its value just happens to be 32 for all architectures to date). Once upon a time, the old Open64 compiler did emit a constant into PTX, however that changed at least 6 years ago if my memory doesn't fail me.
The value is available:
In CUDA C via warpSize, where is is not a compile time constant (the PTX WARP_SZ variable is emitted by the compiler in such cases).
In PTX assembler via WARP_SZ, where it is a runtime immediate constant
From the runtime API as a device property
Don't declare you own constant for the warp size, that is just asking for trouble. The normal use case for an in-kernel array dimensioned to be some multiple of the warp size would be to use dynamically allocated shared memory. You can read the warp size from the host API at runtime to get it. If you have a statically declared in-kernel you need to dimension from the warp size, use templates and select the correct instance at runtime. The latter might seem like unnecessary theatre, but it is the right thing to do for a use case that almost never arises in practice. The choice is yours.

Contrary to talonmies's answer I find warp_size constant perfectly acceptable. The only reason to use warpSize is to make the code forward-compatibly with a possible future hardware that may have warps of different size. However, when such hardware arrives, the kernel code will most likely require other alterations as well in order to remain efficient. CUDA is not a hardware-agnostic language - on the contrary, it is still quite a low-level programming language. Production code uses various intrinsic functions that come and go over time (e.g. __umul24).
The day we get a different warp size (e.g. 64) many things will change:
The warpSize will have to be adjusted obviously
Many warp-level intrinsic will need their signature adjusted, or a new version produced, e.g. int __ballot, and while int does not need to be 32-bit, it is most commonly so!
Iterative operations, such as warp-level reductions, will need their number of iterations adjusted. I have never seen anyone writing:
for (int i = 0; i < log2(warpSize); ++i) ...
that would be overly complex in something that is usually a time-critical piece of code.
warpIdx and laneIdx computation out of threadIdx would need to be adjusted. Currently, the most typical code I see for it is:
warpIdx = threadIdx.x/32;
laneIdx = threadIdx.x%32;
which reduces to simple right-shift and mask operations. However, if you replace 32 with warpSize this suddenly becomes a quite expensive operation!
At the same time, using warpSize in the code prevents optimization, since formally it is not a compile-time known constant.
Also, if the amount of shared memory depends on the warpSize this forces you to use the dynamically allocated shmem (as per talonmies's answer). However, the syntax for that is inconvenient to use, especially when you have several arrays -- this forces you to do pointer arithmetic yourself and manually compute the sum of all memory usage.
Using templates for that warp_size is a partial solution, but adds a layer of syntactic complexity needed at every function call:
deviceFunction<warp_size>(params)
This obfuscates the code. The more boilerplate, the harder the code is to read and maintain.
My suggestion would be to have a single header that control all the model-specific constants, e.g.
#if __CUDA_ARCH__ <= 600
//all devices of compute capability <= 6.0
static const int warp_size = 32;
#endif
Now the rest of your CUDA code can use it without any syntactic overhead. The day you decide to add support for newer architecture, you just need to alter this one piece of code.

C++ conversion from int to bool

I want to know if the compiled code of a bool-to-int conversion contains a branch (jump) operation.
For example, given void func(bool b) and int i:
Is the compiled code of calling func(i) equivalent to the compiled code of func(i? 1:0)?
Or is there a more elaborate way for the compiler to perform this without the branch operation?
Update:
In other words, what code does the compiler generate in order to push 1 or 0 into the stack before jumping to the address of the function?
I assume that it really comes down to the architecture of the CPU at hand, and that some specific processors (certain DSPs, for example) may support this. So my question refers to "conventional" general-purpose CPUs (assuming that this definition is acceptable).
In terms of pure software, the question can also be phrased as: is there an efficient way for converting an integer value to 1 when it's not 0, and to 0 otherwise, without using a conditional statement?
Thanks

It's not your (compiler user) job too make built-in type conversion efficient. If the compiler is not dumb, it will make that sort of things as close as the CPU representation are.
For the most of the commercial CPU, bool and int are the exact same thing, and if(x) { ... }
translate in bit-anding (or bit-oring, whichever is faster: they are normally immediate instructions) x with itself and make a conditional jump after the } if the zero flag is set. (not that this is just a trick to force the zero-flag computation, that is an immediate consequence of the arithmetic unit electronics)
variants are much more a matter of CPU electronics, than code. So don'care about it. ifs are not triggered by a bool, but by the last arithmetic operation result.
Whatever arithmetic operation held by a CPU produces a result ans set some flags that represent certain result attributes: if it is zero, if it produced a carry or borrow, if it has an odd or even number of bit set to 1 etc. Resut and Flags are two registers, and can be loaded and stored from/to memory.

What is the performance implication of converting to bool in C++?

[This question is related to but not the same as this one.]
My compiler warns about implicitly converting or casting certain types to bool whereas explicit conversions do not produce a warning:
long t = 0;
bool b = false;
b = t; // performance warning: forcing long to bool
b = (bool)t; // performance warning
b = bool(t); // performance warning
b = static_cast<bool>(t); // performance warning
b = t ? true : false; // ok, no warning
b = t != 0; // ok
b = !!t; // ok
This is with Visual C++ 2008 but I suspect other compilers may have similar warnings.
So my question is: what is the performance implication of casting/converting to bool? Does explicit conversion have better performance in some circumstance (e.g., for certain target architectures or processors)? Does implicit conversion somehow confuse the optimizer?
Microsoft's explanation of their warning is not particularly helpful. They imply that there is a good reason but they don't explain it.

I was puzzled by this behaviour, until I found this link:
http://connect.microsoft.com/VisualStudio/feedback/ViewFeedback.aspx?FeedbackID=99633
Apparently, coming from the Microsoft Developer who "owns" this warning:
This warning is surprisingly
helpful, and found a bug in my code
just yesterday. I think Martin is
taking "performance warning" out of
context.
It's not about the generated code,
it's about whether or not the
programmer has signalled an intent to
change a value from int to bool.
There is a penalty for that, and the
user has the choice to use "int"
instead of "bool" consistently (or
more likely vice versa) to avoid the
"boolifying" codegen. [...]
It is an old warning, and may have
outlived its purpose, but it's
behaving as designed here.
So it seems to me the warning is more about style and avoiding some mistakes than anything else.
Hope this will answer your question...
:-p

The performance is identical across the board. It involves a couple of instructions on x86, maybe 3 on some other architectures.
On x86 / VC++, they all do
cmp DWORD PTR [whatever], 0
setne al
GCC generates the same thing, but without the warnings (at any warning-level).

The performance warning does actually make a little bit of sense. I've had it as well and my curiousity led me to investigate with the disassembler. It is trying to tell you that the compiler has to generate some code to coerce the value to either 0 or 1. Because you are insisting on a bool, the old school C idea of 0 or anything else doesn't apply.
You can avoid that tiny performance hit if you really want to. The best way is to avoid the cast altogether and use a bool from the start. If you must have an int, you could just use if( int ) instead of if( bool ). The code generated will simply check whether the int is 0 or not. No extra code to make sure the value is 1 if it's not 0 will be generated.

Sounds like premature optimization to me. Are you expecting that the performance of the cast to seriously effect the performance of your app? Maybe if you are writing kernel code or device drivers but in most cases, they all should be ok.

As far as I know, there is no warning on any other compiler for this. The only way I can think that this would cause a performance loss is that the compiler has to compare the entire integer to 0 and then assign the bool appropriately (unlike a conversion such as a char to bool, where the result can be copied over because a bool is one byte and so they are effectively the same), or an integral conversion which involves copying some or all of the source to the destination, possibly after a zero of the destination if it's bigger than the source (in terms of memory).
It's yet another one of Microsoft's useless and unhelpful ideas as to what constitutes good code, and leads us to have to put up with stupid definitions like this:
template <typename T>
inline bool to_bool (const T& t)
{ return t ? true : false; }

long t;
bool b;
int i;
signed char c;
...
You get a warning when you do anything that would be "free" if bool wasn't required to be 0 or 1. b = !!t is effectively assigning the result of the (language built-in, non-overrideable) bool operator!(long)
You shouldn't expect the ! or != operators to cost zero asm instructions even with an optimizing compiler. It is usually true that int i = t is usually optimized away completely. Or even signed char c = t; (on x86/amd64, if t is in the %eax register, after c = t, using c just means using %al. amd64 has byte addressing for every register, BTW. IIRC, in x86 some registers don't have byte addressing.)
Anyway, b = t; i = b; isn't the same as c = t; i = c; it's i = !!t; instead of i = t & 0xff;
Err, I guess everyone already knows all that from the previous replies. My point was, the warning made sense to me, since it caught cases where the compiler had to do things you didn't really tell it to, like !!BOOL on return because you declared the function bool, but are returning an integral value that could be true and != 1. e.g. a lot of windows stuff returns BOOL (int).
This is one of MSVC's few warnings that G++ doesn't have. I'm a lot more used to g++, and it definitely warns about stuff MSVC doesn't, but that I'm glad it told me about. I wrote a portab.h header file with stubs for the MFC/Win32 classes/macros/functions I used. This got the MFC app I'm working on to compile on my GNU/Linux machine at home (and with cygwin). I mainly wanted to be able to compile-test what I was working on at home, but I ended up finding g++'s warnings very useful. It's also a lot stricter about e.g. templates...
On bool in general, I'm not sure it makes for better code when used as a return values and parameter passing. Even for locals, g++ 4.3 doesn't seem to figure out that it doesn't have to coerce the value to 0 or 1 before branching on it. If it's a local variable and you never take its address, the compiler should keep it in whatever size is fastest. If it has to spill it from registers to the stack, it could just as well keep it in 4 bytes, since that may be slightly faster. (It uses a lot of movsx (sign-extension) instructions when loading/storing (non-local) bools, but I don't really remember what it did for automatic (local stack) variables. I do remember seeing it reserve an odd amount of stack space (not a multiple of 4) in functions that had some bools locals.)
Using bool flags was slower than int with the Digital Mars D compiler as of last year:
http://www.digitalmars.com/d/archives/digitalmars/D/opEquals_needs_to_return_bool_71813.html
(D is a lot like C++, but abandons full C backwards compat to define some nice new semantics, and good support for template metaprogramming. e.g. "static if" or "static assert" instead of template hacks or cpp macros. I'd really like to give D a try sometime. :)
For data structures, it can make sense, e.g. if you want to pack a couple flags before an int and then some doubles in a struct you're going to have quite a lot of.

Based on your link to MS' explanation, it appears that if the value is merely 1 or 0, there is not performance hit, but if it's any other non-0 value that a comparison must be built at compile time?

In C++ a bool ISA int with only two values 0 = false, 1 = true. The compiler only has to check one bit. To be perfectly clear, true != 0, so any int can override bool, it just cost processing cycles to do so.
By using a long as in the code sample, you are forcing a lot more bit checks, which will cause a performance hit.
No this is not premature optimization, it is quite crazy to use code that takes more processing time across the board. This is simply good coding practice.

Unless you're writing code for a really critical inner loop (simulator core, ray-tracer, etc.) there is no point in worrying about any performance hits in this case. There are other more important things to worry about in your code (and other more significant performance traps lurking, I'm sure).

Microsoft's explanation seems to be that what they're trying to say is:
Hey, if you're using an int, but are
only storing true or false information in
it, make it a bool!
I'm skeptical about how much would be gained performance-wise, but MS may have found that there was some gain (for their use, anyway). Microsoft's code does tend to run an awful lot, so maybe they've found the micro-optimization to be worthwhile. I believe that a fair bit of what goes into the MS compiler is to support stuff they find useful themselves (only makes sense, right?).
And you get rid of some dirty, little casts to boot.

I don't think performance is the issue here. The reason you get a warning is that information is lost during conversion from int to bool.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js