C++ Expression Evaluation: What Happens "Under The Hood"? - c++

I'm still learning C++. I'm trying to understand how evaluation is carried out, in a rather step-by-step fashion. So using this simple example, an expression statement:
int x = 8 * 5 - 5;
This is what I believe happens. Please tell me how far off the mark I am:
The operands x, 8, 5, and 5 are "evaluated." Possibly, a temporary object is created to hold each value (I am not too sure about this).
8 * 5 evaluates to 40, which is stored in a temporary.
40 (temporary) - 5 evaluates to 35 (another temporary).
35 is copied into x.
All temporary objects are destroyed in the reverse order they were created in (the value is discarded).
Am I at least close to being right?

"Thank you, sir. Hm. What would happen if all the operands were named objects, rather than literals? Would it create temporaries on the fly, so to speak, rather than at compile time?"
As Sam mentioned, you are on the right track on a high level.
In your first example it would use CPU registers to store temporaries (since they are not named objects), if they would be named objects it depends on the optimization flags that are set on the compiler and the complexity of the code as to how 'optimized' the code will be that is generated. you can take a look at the disassembly to really see what happens. for example if you do
a = 5;
b = 2;
c = a * b;
the compiler will try and generate the most optimal code, and since in this case there are 2 constants that are known at compile time, and you do a multiplication by 2, it will be able to take shortcuts, sometimes multiplications are replaced by bit operations which are cheaper (multiply by 2 is the same as shifting 1 to the left)
named variables have to live somewhere, either on the stack or heap, and the CPU will use the address of named objects to pass them around and perform functions on. (if they are small enough it will fit in registers and operate on them, otherwise it will start using memory, first the cache, and then bleed out to RAM)
You could google for 'abstract syntax tree' to get an idea of how readable c++ code is converted to machine code.
this is why it is important to learn about const correctness, aliasing and pointer vs references to make sure you give the compiler the best chance at generating optimal code for you. (aside from the advantages a user gets from that)

Related

memory steps while defining variable - are they true?

Lets say we are defining a variable :
float myFloat{3};
I assume that these steps are done in memory while defining a variable, but I am not certainly sure.
Initial Assume: Memory is consist of addresses and correspond values. And values are kept as binary codes.
1- create binary code(value) for literal 3 in an address1.
2- turn this integer binary code of 3 to float binary code of 3 in the address2. (type conversion)
3- copy this binary code(value) from address2 to the memory part created for myFloat.
Are these steps accurate ? I would like to hear from you. Thanks..
Conceptually that’s accurate, but with any optimization, the compiler will probably generate the 3.0f value at compile time, making it just a load of that constant to the right stack address. Furthermore, the optimizer may well optimize it out entirely. If the next line says myFloat *= 0.0f; return myFloat;, the compiler will turn the whole function into essentially return 0.0f; although it may spell it in a funny way. Check out Compiler Explorer to get a sense of it.

why passing string_view by value is faster than const reference

I checked this question, and most answers say that I should pass it by value despite it's clearly passing more data (since by value you pass 8 bytes while by reference only 4 bytes, in 32bit system sizeof(string_view) > sizeof(string_view*))
is that still relevant in C++20/17 ? and can someone explains why exactly ?
Indirection through a reference (as well as a pointer) has a cost. That cost can be more than the cost of copying a few bytes. As in most cases, you need to verify through measurement whether that is true for your use case / target system. Note that if the function is expanded inline, then there is unlikely to be any difference as you may end up with identical assembly in either case. Even if not, the difference may be extremely small and hard to measure.

When should I use CUDA's built-in warpSize, as opposed to my own proper constant?

nvcc device code has access to a built-in value, warpSize, which is set to the warp size of the device executing the kernel (i.e. 32 for the foreseeable future). Usually you can't tell it apart from a constant - but if you try to declare an array of length warpSize you get a complaint about it being non-const... (with CUDA 7.5)
So, at least for that purpose you are motivated to have something like (edit):
enum : unsigned int { warp_size = 32 };
somewhere in your headers. But now - which should I prefer, and when? : warpSize, or warp_size?
Edit: warpSize is apparently a compile-time constant in PTX. Still, the question stands.
Let's get a couple of points straight. The warp size isn't a compile time constant and shouldn't be treated as one. It is an architecture specific runtime immediate constant (and its value just happens to be 32 for all architectures to date). Once upon a time, the old Open64 compiler did emit a constant into PTX, however that changed at least 6 years ago if my memory doesn't fail me.
The value is available:
In CUDA C via warpSize, where is is not a compile time constant (the PTX WARP_SZ variable is emitted by the compiler in such cases).
In PTX assembler via WARP_SZ, where it is a runtime immediate constant
From the runtime API as a device property
Don't declare you own constant for the warp size, that is just asking for trouble. The normal use case for an in-kernel array dimensioned to be some multiple of the warp size would be to use dynamically allocated shared memory. You can read the warp size from the host API at runtime to get it. If you have a statically declared in-kernel you need to dimension from the warp size, use templates and select the correct instance at runtime. The latter might seem like unnecessary theatre, but it is the right thing to do for a use case that almost never arises in practice. The choice is yours.
Contrary to talonmies's answer I find warp_size constant perfectly acceptable. The only reason to use warpSize is to make the code forward-compatibly with a possible future hardware that may have warps of different size. However, when such hardware arrives, the kernel code will most likely require other alterations as well in order to remain efficient. CUDA is not a hardware-agnostic language - on the contrary, it is still quite a low-level programming language. Production code uses various intrinsic functions that come and go over time (e.g. __umul24).
The day we get a different warp size (e.g. 64) many things will change:
The warpSize will have to be adjusted obviously
Many warp-level intrinsic will need their signature adjusted, or a new version produced, e.g. int __ballot, and while int does not need to be 32-bit, it is most commonly so!
Iterative operations, such as warp-level reductions, will need their number of iterations adjusted. I have never seen anyone writing:
for (int i = 0; i < log2(warpSize); ++i) ...
that would be overly complex in something that is usually a time-critical piece of code.
warpIdx and laneIdx computation out of threadIdx would need to be adjusted. Currently, the most typical code I see for it is:
warpIdx = threadIdx.x/32;
laneIdx = threadIdx.x%32;
which reduces to simple right-shift and mask operations. However, if you replace 32 with warpSize this suddenly becomes a quite expensive operation!
At the same time, using warpSize in the code prevents optimization, since formally it is not a compile-time known constant.
Also, if the amount of shared memory depends on the warpSize this forces you to use the dynamically allocated shmem (as per talonmies's answer). However, the syntax for that is inconvenient to use, especially when you have several arrays -- this forces you to do pointer arithmetic yourself and manually compute the sum of all memory usage.
Using templates for that warp_size is a partial solution, but adds a layer of syntactic complexity needed at every function call:
deviceFunction<warp_size>(params)
This obfuscates the code. The more boilerplate, the harder the code is to read and maintain.
My suggestion would be to have a single header that control all the model-specific constants, e.g.
#if __CUDA_ARCH__ <= 600
//all devices of compute capability <= 6.0
static const int warp_size = 32;
#endif
Now the rest of your CUDA code can use it without any syntactic overhead. The day you decide to add support for newer architecture, you just need to alter this one piece of code.

Performance of Initialization from Different Type

I'm porting some code, and the original author was evidently quite concerned with squeezing as much performance as possible out of the code.
Throughout (and there's hundreds of source files), there are lots of things like this:
float f = (float)(6);
type_float tf = (type_float)(0); //type_float is a typedef of float xor double
In short, the author tried to make the RHS of assignments equal to the variable being assigned into. The aim, I presume, was to coerce the compiler into making e.g. the 6 in the first example into 6.0f so that no conversion overhead happens when that value is copied into the variable.
This would actually be useful for something like the second example, where the proper form of the literal (one of {0.0f,0.0}) isn't known/can be changed from a line far away. However, I can see it being problematic if the literal is converted and stored into a temporary and then copied, instead of the conversion happening on copy.
Is this author onto something here? Are all these literals actually being stored with the intended type? Or is this just a massive waste of source file bits? What is the best way to handle these sorts of cases in modern code?
Note: I believe this applies to both C and C++, so I have applied both tags.
This is a complete waste. No modern optimizing compiler will keep any track of intermediate values, but directly initialize with the final correct value. There is really no point in it, default conversion should always do the right thing, here. And yes this should apply to both, C and C++, and they shouldn't differ much in behavior.

Use of Literals, yay/nay in C++

I've recently heard that in some cases, programmers believe that you should never use literals in your code. I understand that in some cases, assigning a variable name to a given number can be helpful (especially in terms of maintenance if that number is used elsewhere). However, consider the following case studies:
Case Study 1: Use of Literals for "special" byte codes.
Say you have an if statement that checks for a specific value stored in (for the sake of argument) a uint16_t. Here are the two code samples:
Version 1:
// Descriptive comment as to why I'm using 0xBEEF goes here
if (my_var == 0xBEEF) {
//do something
}
Version 2:
const uint16_t kSuperDescriptiveVarName = 0xBEEF;
if (my_var == kSuperDescriptiveVarName) {
// do something
}
Which is the "preferred" method in terms of good coding practice? I can fully understand why you would prefer version 2 if kSuperDescriptiveVarName is used more than once. Also, does the compiler do any optimizations to make both versions effectively the same executable code? That is, are there any performance implications here?
Case Study 2: Use of sizeof
I fully understand that using sizeof versus a raw literal is preferred for portability and also readability concerns. Take the two code examples into account. The scenario is that you are computing the offset into a packet buffer (an array of uint8_t) where the first part of the packet is stored as my_packet_header, which let's say is a uint32_t.
Version 1:
const int offset = sizeof(my_packet_header);
Version 2:
const int offset = 4; // good comment telling reader where 4 came from
Clearly, version 1 is preferred, but what about for cases where you have multiple data fields to skip over? What if you have the following instead:
Version 1:
const int offset = sizeof(my_packet_header) + sizeof(data_field1) + sizeof(data_field2) + ... + sizeof(data_fieldn);
Version 2:
const int offset = 47;
Which is preferred in this case? Does is still make sense to show all the steps involved with computing the offset or does the literal usage make sense here?
Thanks for the help in advance as I attempt to better my code practices.
Which is the "preferred" method in terms of good coding practice? I can fully understand why you would prefer version 2 if kSuperDescriptiveVarName is used more than once.
Sounds like you understand the main point... factoring values (and their comments) that are used in multiple places. Further, it can sometimes help to have a group of constants in one place - so their values can be inspected, verified, modified etc. without concern for where they're used in the code. Other times, there are many constants used in proximity and the comments needed to properly explain them would obfuscate the code in which they're used.
Countering that, having a const variable means all the programmers studying the code will be wondering whether it's used anywhere else, keeping it in mind as they inspect the rest of the scope in which it's declared etc. - the less unnecessary things to remember the surer the understanding of important parts of the code will be.
Like so many things in programming, it's "an art" balancing the pros and cons of each approach, and best guided by experience and knowledge of the way the code's likely to be studied, maintained, and evolved.
Also, does the compiler do any optimizations to make both versions effectively the same executable code? That is, are there any performance implications here?
There's no performance implications in optimised code.
I fully understand that using sizeof versus a raw literal is preferred for portability and also readability concerns.
And other reasons too. A big factor in good programming is reducing the points of maintenance when changes are done. If you can modify the type of a variable and know that all the places using that variable will adjust accordingly, that's great - saves time and potential errors. Using sizeof helps with that.
Which is preferred [for calculating offsets in a struct]? Does is still make sense to show all the steps involved with computing the offset or does the literal usage make sense here?
The offsetof macro (#include <cstddef>) is better for this... again reducing maintenance burden. With the this + that approach you illustrate, if the compiler decides to use any padding your offset will be wrong, and further you have to fix it every time you add or remove a field.
Ignoring the offsetof issues and just considering your this + that example as an illustration of a more complex value to assign, again it's a balancing act. You'd definitely want some explanation/comment/documentation re intent here (are you working out the binary size of earlier fields? calculating the offset of the next field?, deliberately missing some fields that might not be needed for the intended use or was that accidental?...). Still, a named constant might be enough documentation, so it's likely unimportant which way you lean....
In every example you list, I would go with the name.
In your first example, you almost certainly used that special 0xBEEF number at least twice - once to write it and once to do your comparison. If you didn't write it, that number is still part of a contract with someone else (perhaps a file format definition).
In the last example, it is especially useful to show the computation that yielded the value. That way, if you encounter trouble down the line, you can easily see either that the number is trustworthy, or what you missed and fix it.
There are some cases where I prefer literals over named constants though. These are always cases where a name is no more meaningful than the number. For example, you have a game program that plays a dice game (perhaps Yahtzee), where there are specific rules for specific die rolls. You could define constants for One = 1, Two = 2, etc. But why bother?
Generally it is better to use a name instead of a value. After all, if you need to change it later, you can find it more easily. Also it is not always clear why this particular number is used, when you read the code, so having a meaningful name assigned to it, makes this immediately clear to a programmer.
Performance-wise there is no difference, because the optimizers should take care of it. And it is rather unlikely, even if there would be an extra instruction generated, that this would cause you troubles. If your code would be that tight, you probably shouldn't rely on an optimizer effect anyway.
I can fully understand why you would prefer version 2 if kSuperDescriptiveVarName is used more than once.
I think kSuperDescriptiveVarName will definitely be used more than once. One for check and at least one for assignment, maybe in different part of your program.
There will be no difference in performance, since an optimization called Constant Propagation exists in almost all compilers. Just enable optimization for your compiler.