First question here. I will in a few weeks/months need to create procedural code in which there will be functions assigning big (I mean really big) sets of data directly to pointers. Here is some example of code I will be doing :
void MyFuntion(string* str)
{
*str = "some data in a string";
}
As it surely is important : I am on windows 10, in visual-studio 2019, compiling with the default c++ compiler on release x86.
Imagine something like this but with strings that can contain several millions of characters, or with int/float arrays also with several millions of elements.
So, this is a single operation assigning a rvalue to a pointer, which is therefore on the heap. Of course, if I create a local variable containing the data, it will be more than 1MB and therefore will cause a stack overflow, right ?
As I understand, since the data only exists as a rvalue here, it doesn't have a memory existence, but I would like to know : how is the rvalue assigned to the pointer ? Like, how is it done in assembly ? I must say I have never done any assembly, I have a few (very few) notions but I'd like to get into it when I have time.
Is it temporary created in the stack or heap before being put in the final memory address ? My guess is that the memory address (the pointer in which I am assigning the data) is directly filled with the data, like, bit by bit, so no existence of the rvalue in memory.
If I'm correct, the only things that exist in the stack here are : the function call, the pointer copy, then the instruction, which should be something like "assign rvalue X to lvalue Y" and the size of the instruction doesn't depend on the size of the rvalue and lvalue, so there should not be any problem regarding the stack here.
So, if I'm correct, this code should not cause any problem, no matter how big the rvalue is, but I would still like to know how it is done exactly, assembly-wise. Note that I am not only looking for an answer, but more like some references, books or docs, that could explain in detail. I guess what I am looking for won't be in a c++ book, but more like a assembly book, this might be a good starting point to get myself into it !
Although a specific OS and compiler were mentioned, the example assembly in this answer will probably differ from what the querent's compiler would output, because I don't have a Windows 10 machine available at the time of writing and used a different environment having forgotten about Godbolt. However, this topic is general enough in my opinion that it shouldn't really matter in this specific case.
What even is a value on the right side of an assignment operator? What does assignment look like at the assembly level? Here's a simple example.
void assign_thing(int *p) {
*p = 42;
}
movl $42, (%rdi)
retq
"Move the 32-bit integer 42 into the memory location to which rdi is pointing." %rdi here represents p, and (%rdi) means *p. For something dead simple like an integer, it's pretty much that simple. How about a simple structure?
struct stuff {
int id;
float value;
char text[8];
};
void assign_thing(stuff *p) {
*p = {42, 1.5, "Hello!"};
}
movabsq $4593671619917905962, %rax
movq %rax, (%rdi)
movabsq $36762444129608, %rax
movq %rax, 8(%rdi)
retq
A little harder to read at first glance, but pretty much the same idea. The compiler was smart and packed the integer and float values 42 and 1.5 into a single 64-bit value and stuffs that directly into (%rdi). Likewise with the string "Hello!", which is short enough to fit into a single 64-bit value and gets stuffed into 8(%rdi) (8 bytes past p is the offset of text).
So far, none of the rvalues actually exist in memory when they get assigned. They're just part of the instructions. What if it's something a lot bigger, like a string?
// Overflow checking omitted for brevity.
void assign_thing(char *p) {
// Assignment with = doesn't actually do what you'd want here,
// so this'll have to do.
strcpy(p, "What if it's something a lot bigger, like a string?");
}
vmovups -5484(%rip), %ymm0
vmovups %ymm0, 20(%rdi) ; I'm guessing the disassembler meant to say 0x20
vmovups -5517(%rip), %ymm0
vmovups %ymm0, (%rdi)
vzeroupper
retq
Now, the rvalue does reside in memory when it gets assigned. Do note that this is not because strcpy was used instead of =, but because the compiler decided that it would be better to store that "rvalue" string somewhere in a read-only area like .rodata and just copy it over. If I had used a much shorter string, any reasonably modern compiler would probably optimize it into a few mov or movabsq instructions like in the second example. Unless p points to a buffer on the stack and your strcpy ends up overflowing it, you won't get a stack overflow here.
Now what about your example? I'm guessing that your string type is really std::string, and that's not a trivial type. So what happens there? In C++, the assignment operator = is overloadable, and std::string indeed has its own overloads, so instead of directly stuffing or copying values into the object, a special member function operator= is called. That is to say, your *str = "some data in a string" is really a str->operator=("some data in a string"). How your rvalue string gets copied is up to the implementation of std::string::operator=, but it'll most likely be optimized into something like my last example. The actual string data of an std::string resides on the heap, so stack overflow still isn't a problem here.
tl;dr (this answer + the comments, compressed into a few sentences)
If your string is small enough, it probably won't exist in memory during assignment. If it's big enough, it'll sit in a read-only area somewhere and get copied over when needed. The stack is often not even involved, so don't worry about overflow.
Related
From C Programming Language by Brian W. Kernighan
& operator only applies to objects in memory: variables and array
elements. It cannot be applied to expressions, constants or register
variables.
Where are expressions and constants stored if not in memory?
What does that quote mean?
E.g:
&(2 + 3)
Why can't we take its address? Where is it stored?
Will the answer be same for C++ also since C has been its parent?
This linked question explains that such expressions are rvalue objects and all rvalue objects do not have addresses.
My question is where are these expressions stored such that their addresses can't be retrieved?
Consider the following function:
unsigned sum_evens (unsigned number) {
number &= ~1; // ~1 = 0xfffffffe (32-bit CPU)
unsigned result = 0;
while (number) {
result += number;
number -= 2;
}
return result;
}
Now, let's play the compiler game and try to compile this by hand. I'm going to assume you're using x86 because that's what most desktop computers use. (x86 is the instruction set for Intel compatible CPUs.)
Let's go through a simple (unoptimized) version of how this routine could look like when compiled:
sum_evens:
and edi, 0xfffffffe ;edi is where the first argument goes
xor eax, eax ;set register eax to 0
cmp edi, 0 ;compare number to 0
jz .done ;if edi = 0, jump to .done
.loop:
add eax, edi ;eax = eax + edi
sub edi, 2 ;edi = edi - 2
jnz .loop ;if edi != 0, go back to .loop
.done:
ret ;return (value in eax is returned to caller)
Now, as you can see, the constants in the code (0, 2, 1) actually show up as part of the CPU instructions! In fact, 1 doesn't show up at all; the compiler (in this case, just me) already calculates ~1 and uses the result in the code.
While you can take the address of a CPU instruction, it often makes no sense to take the address of a part of it (in x86 you sometimes can, but in many other CPUs you simply cannot do this at all), and code addresses are fundamentally different from data addresses (which is why you cannot treat a function pointer (a code address) as a regular pointer (a data address)). In some CPU architectures, code addresses and data addresses are completely incompatible (although this is not the case of x86 in the way most modern OSes use it).
Do notice that while (number) is equivalent to while (number != 0). That 0 doesn't show up in the compiled code at all! It's implied by the jnz instruction (jump if not zero). This is another reason why you cannot take the address of that 0 — it doesn't have one, it's literally nowhere.
I hope this makes it clearer for you.
where are these expressions stored such that there addresses can't be retrieved?
Your question is not well-formed.
Conceptually
It's like asking why people can discuss ownership of nouns but not verbs. Nouns refer to things that may (potentially) be owned, and verbs refer to actions that are performed. You can't own an action or perform a thing.
In terms of language specification
Expressions are not stored in the first place, they are evaluated.
They may be evaluated by the compiler, at compile time, or they may be evaluated by the processor, at run time.
In terms of language implementation
Consider the statement
int a = 0;
This does two things: first, it declares an integer variable a. This is defined to be something whose address you can take. It's up to the compiler to do whatever makes sense on a given platform, to allow you to take the address of a.
Secondly, it sets that variable's value to zero. This does not mean an integer with value zero exists somewhere in your compiled program. It might commonly be implemented as
xor eax,eax
which is to say, XOR (exclusive-or) the eax register with itself. This always results in zero, whatever was there before. However, there is no fixed object of value 0 in the compiled code to match the integer literal 0 you wrote in the source.
As an aside, when I say that a above is something whose address you can take - it's worth pointing out that it may not really have an address unless you take it. For example, the eax register used in that example doesn't have an address. If the compiler can prove the program is still correct, a can live its whole life in that register and never exist in main memory. Conversely, if you use the expression &a somewhere, the compiler will take care to create some addressable space to store a's value in.
Note for comparison that I can easily choose a different language where I can take the address of an expression.
It'll probably be interpreted, because compilation usually discards these structures once the machine-executable output replaces them. For example Python has runtime introspection and code objects.
Or I can start from LISP and extend it to provide some kind of addressof operation on S-expressions.
The key thing they both have in common is that they are not C, which as a matter of design and definition does not provide those mechanisms.
Such expressions end up part of the machine code. An expression 2 + 3 likely gets translated to the machine code instruction "load 5 into register A". CPU registers don't have addresses.
It does not really make sense to take the address to an expression. The closest thing you can do is a function pointer. Expressions are not stored in the same sense as variables and objects.
Expressions are stored in the actual machine code. Of course you could find the address where the expression is evaluated, but it just don't make sense to do it.
Read a bit about assembly. Expressions are stored in the text segment, while variables are stored in other segments, such as data or stack.
https://en.wikipedia.org/wiki/Data_segment
Another way to explain it is that expressions are cpu instructions, while variables are pure data.
One more thing to consider: The compiler often optimizes away things. Consider this code:
int x=0;
while(x<10)
x+=1;
This code will probobly be optimized to:
int x=10;
So what would the address to (x+=1) mean in this case? It is not even present in the machine code, so it has - by definition - no address at all.
Where are expressions and constants stored if not in memory
In some (actually many) cases, a constant expression is not stored at all. In particular, think about optimizing compilers, and see CppCon 2017: Matt Godbolt's talk “What Has My Compiler Done for Me Lately? Unbolting the Compiler's Lid”
In your particular case of some C code having 2 + 3, most optimizing compilers would have constant folded that into 5, and that 5 constant might be just inside some machine code instruction (as some bitfield) of your code segment and not even have a well defined memory location. If that constant 5 was a loop limit, some compilers could have done loop unrolling, and that constant won't appear anymore in the binary code.
See also this answer, etc...
Be aware that C11 is a specification written in English. Read its n1570 standard. Read also the much bigger specification of C++11 (or later).
Taking the address of a constant is forbidden by the semantics of C (and of C++).
This question has been bothering me for a while.
If I do int* a = new int[n], for example, I only have an pointer that points to the beginning of array a, but how does C/C++ know about n? I know if I want to pass this array to another function, then I have to pass the length of the array with it, so I guess C/C++ does not really know how long this array is.
I know we can infer the end of a character array char* by looking for the NUL terminator. But is there a similar mechanism for other arrays, like int? Meanwhile, char can be more than a character -- you can also treat it as an integer type. Then how does C++ know where this array ends then?
This question starts to bother me even more when I am developing embedded Python (If you are not familiar with embedded python, you may ignore this paragraph and just answer the above questions. I will still appreciate it). In Python there is a "ByteArray", and the only way to convert this "ByteArray" to C/C++ is to use PyString_AsString() to convert it to char*. But if this ByteArray has 0 in it, then C/C++ would think that char* array stops early. This is not the worst part. The worst part is, say I do a
char* arr = PyString_AsString(something)
void* pt = calloc(1, 1000);
if st happens to start with 0, then C/C++ will almost guarantee to wipe out everything in arr, since it thinks arr ends right after a NULL appears. Then it might just wipe out everything in arr by allocating a a trunk of memory to pt.
Thank you very much for your time! I really appreciate it.
C/C++ doesn't; it's the allocator (the little piece of code that implements malloc(), free(), etc.) that knows how long it is. C/C++ is welcome to wee all over itself, free of the constraints of having to worry about the length.
Also, PyString_AsStringAndSize().
Let's hit the disassembler! This is going to be different for C and C++. How free works in C is covered in another question, and here's how it works in C++:
struct T {
~T();
int data;
};
void test(T* p)
{
delete[] p;
}
And let's run the compiler to produce assembly. Here's the relevant bits, compiled for i386:
movl -4(%edi), %eax
leal (%edi,%eax,4), %esi
cmpl %esi, %edi
je L4
.align 4,0x90
L8:
subl $4, %esi
movl %esi, (%esp)
call L__ZN1TD1Ev$stub
cmpl %esi, %edi
jne L8
You can see the important part: There is an integer stored before the start of p containing the length of p, and the code then loops over the p array, calling the destructor for each item in the array. It then calls delete, which is usually fairly boring because it just calls free (the C function). So you can see how C++ delete is expressed in terms of free.
Destructors and Exceptions: Based on the above assembly, you can notice that if the destructor for T threw an exception, then part of the p array would get the destructor called and the rest of the array would not. Destructors should never throw exceptions.
Caveat: This is only one possible way that your compiler and runtime can solve this problem. (Here, the destructor is called by compiler-generated code and delete is part of the runtime.) There is quite a bit of leeway in how these are implemented, and yours could be different. This also shows why you should always call the correct operator, delete[] or delete -- calling the wrong one will cause all sorts of trouble, such as stomping on memory and freeing invalid pointers.
About NUL terminators: The only reason NUL terminators are a problem is because PyString_AsString and other similar functions call strlen to figure out how long the string is. However, free doesn't care about NUL terminators, instead, it keeps track of the length from the original malloc call separately. For PyString_AsString (and strdup, etc.) this is not an option because there is no portable way to get the size of a region of memory -- malloc and free do not expose this functionality. Besides, you can pass a pointer to PyString_AsString which is in the middle of a malloc block or somewhere else entirely.
See also: How does free know how much to free?
c/c++ doesn't know the length of any array, so you can cross-border access a array easily. c/c++ doesn't know the length of char array also.
Char* can point to string but it is is not equal to a string. String terminated by NULL is a convention of c/c++.
This is my Vector struct: struct Vector{ float x, y; };
Should I pass it to functions by value or as const Vector&?
If you pass by value, the function will get a copy that it can modify locally without affecting the caller.
If you pass by const-reference, the function will only get a read-only reference. No copying involved, but the called function cannot modify it locally.
Given the size of the struct, the copy overhead will be very small. So choose what is best/easiest for the called function.
For such a small structure, either one may be the most efficient, depending on your platform and compiler.
(The name may be a bit confusing to other programmers, since vector in C++ usually means "dynamic array". How's about Vector2D?)
I really depends on your platform and compiler, and whether the function is inline or not.
When passed by reference, the structure is not copied, only its address is stored on the stack, not the content. When passed by value, the content is copied. On a 64-bit platform the size of the struct is the same as a pointer to the struct (assuming 64-bit pointers which seems to be the more common situation). So, the gains of passing by reference is not really clear here.
However, there is another thing to consider. Your structure contains float value. On Intel architecture, they can be stored in the FPU or in SIMD register before the call to the function. In that situation, if the function take the parameter by reference, then, they will have to be spilled to memory, and the address to this memory passed to the function. This can be really slow. If they had been passed by value, no copy to memory would be needed (faster). And one some platform (PS3), the compiler will not be smart enough to remove those spilling, even in case of inline function.
In fact, like every question of micro-optimisation, there is no "good answer", it all depends on what usage you make of the function, and what your compiler / platform want. The best would be to mesure (or use a tool to analyse assembly) to check what is the best for your platform / compiler combination.
I'm going to finish by quoting Jaymin Kessler from Q-Games that is much more versed in those topics than I can ever be:
2) If a type fits in a register, pass it by value. DO NOT PASS VECTOR TYPES BY REFERENCE, ESPECIALLY CONST REFERENCE. If the function ends up getting inlined, GCC occasionally will go to memory when it hits the reference. I’ll say it again: If the type you are using fits in registers (float, int, or vector) do not pass it to the function by anything but value. In the case of non-sane compilers like Visual Studio for x86, it can’t maintain the alignment of objects on the stack, and therefore objects that have align directives must be passed to functions by reference. This may be fixed or the Xbox 360. If you are multiplatform, the best thing you can do is make a parameter passing typedef to avoid having to cater to the lowest common denominator.
Considering the following code:
struct Vector { float x, y; };
extern Vector DoSomething1(Vector v);
extern Vector DoSomething2(const Vector& v);
void Test1()
{
Vector v0 = { 1., 2. };
Vector v1 = DoSomething1(v0);
}
void Test2()
{
Vector v0 = { 1., 2. };
Vector v1 = DoSomething2(v0);
}
From the point of view of the code, the only difference between Test1 and Test2 is the calling convention used by DoSomething1 and DoSomething2 to receive the Vector struct. When compiled with g++ (version 4.2, architecture x86_64), the generated code is:
.globl __Z5Test1v
__Z5Test1v:
LFB2:
movabsq $4611686019492741120, %rax
movd %rax, %xmm0
jmp __Z12DoSomething16Vector
LFE2:
.globl __Z5Test2v
__Z5Test2v:
LFB3:
subq $24, %rsp
LCFI0:
movl $0x3f800000, (%rsp)
movl $0x40000000, 4(%rsp)
movq %rsp, %rdi
call __Z12DoSomething2RK6Vector
addq $24, %rsp
ret
LFE3:
We can see that in the case of Test1, the value are passed via the %xmm0 SIMD register once loaded from memory (so if they where the result of a previous computation, they would already be in the register and there would be no need to load them from memory). On the other hand, in the case of Test2, the value are passed on the stack (movl $0x3f800000, (%rsp) push 1.0f on the stack). And if they where the result of a previous computation, that would require copying them from the %xmm0 SIMD register. And that can be really slow (it may well stall the pipeline until the value is available, and if the stack is not properly aligned, the copy will also be slow).
So if your function is not inline, prefer to pass by copy instead of const-reference. If the function is indeed inline, watch the code generated before making your mind.
As a reference. That is more efficient than making a copy of the structure and passing that (i.e. passing by value).
The only exception is if your platform can fit the entire structure into a register.
This is an efficiency question about 64 bit ints. Assuming I don't need to modify the value of a "int" parameter, should I pass it by value or reference.
Assuming 32 bit machine:
1) 32 bit int: I guess the answer is "pass by value" as "pass by reference" will have overhead of extra memory lookup.
2) 64 bit int: If I pass by reference, I only pass 32 bit address on the stack, but need an extra memory lookup. So which one of them is better (reference or value)?
What if the machine is 64 bit?
regards,
JP
Pass by value - definitely. If the system is 64-bit it means it copies 64-bit word extremely fast.
Even on a 64 bit machine pass by value is better (with some very few exceptions), because it can be passed as a register value.
Pass them as a boost::call_traits<int64_t>::param_type. This template captures the best practices for passing any type on the supported platforms. Hence, it will be different on 32 and 64 bits platforms, but you can use the same code everywhere. It even works inside other templates where you don't know the precise type yet.
Use a little common sense,
if the object requires a complex copy constructor, it's probably worth passing by reference (saying that - quite a lot of boost's objects are designed to be passed-by-value rather than reference simply because internal implementation is quite trivial) There is one odd one which I haven't really worked out, std::string, I always pass this by reference...
If you intend to modify the value that is passed in, use a reference
Else, PASS-BY-VALUE!
Do you have a particular performance bottleneck with arguments to functions? Else, don't spend too much time worrying about which is the best way to pass...
Optimizing by worrying about how an int is passed in is like pi**ing in the sea...
For argument's sake, lets ignore the trivial case of optimisers removing differences. Let's also say you're using Microsoft's Intel 64-bit calling conventions (which do differ from the Linux ABI), then you've got 4 64-bit registers for passing such values before you have to resort to pushing them on the stack. That's clearly better.
For a 32-bit app, by value and they'd go straight onto the stack. By-reference may instead put a pointer in a register (again, a few such register uses are allowed before resorting to the stack). We can this in some output from g++ -O3 -S, calling f1(99) by value and f2(101) by const reference:
void f1(int64_t);
void f2(const int64_t&);
int main()
{
f1(99);
f2(101);
}
...
pushl 0
pushl $99
call _Z2f1x // by value - pushed two halves to stack
leal -8(%ebp), %eax
movl %eax, (%esp)
movl $101, -8(%ebp)
movl $0, -4(%ebp)
call _Z2f2RKx // by const& - ugly isn't it!?!
The called function would then have to retrieve before first usage (if any). The called function's free to cache the values read in registers, so that's only needed once. With the stack approach, the value can be reread at will, so the register need not be reserved for that value. With the pointer approach, either the pointer or 64-bit value may need to be saved somewhere more predictable (e.g. pushed, or another less useful register) should that register need to be freed up momentarily for some other work, but the 64-bit int parameter be needed again later. All up, it's hard to guess which is faster - may be CPU/register-usage/optimiser/etc dependent, and it's not worth trying.
A node to pst's advice...
"efficiency" :( KISS. pass it how you pass every other bloody integer. - pst
...though, sometimes you apply KISS to template parameters and make them all const T& even though some may fit in registers....
Suppose we've got two integer and character variables:
int adad=12345;
char character;
Assuming we're discussing a platform in which, length of an integer variable is longer than or equal to three bytes, I want to access third byte of this integer and put it in the character variable, with that said I'd write it like this:
character=*((char *)(&adad)+2);
Considering that line of code and the fact that I'm not a compiler or assembly expert, I know a little about addressing modes in assembly and I'm wondering the address of the third byte (or I guess it's better to say offset of the third byte) here would be within the instructions generated by that line of code themselves, or it'd be in a separate variable whose address (or offset) is within those instructions ?
The best thing to do in situations like this is to try it. Here's an example program:
int main(int argc, char **argv)
{
int adad=12345;
volatile char character;
character=*((char *)(&adad)+2);
return 0;
}
I added the volatile to avoid the assignment line being completely optimized away. Now, here's what the compiler came up with (for -Oz on my Mac):
_main:
pushq %rbp
movq %rsp,%rbp
movl $0x00003039,0xf8(%rbp)
movb 0xfa(%rbp),%al
movb %al,0xff(%rbp)
xorl %eax,%eax
leave
ret
The only three lines that we care about are:
movl $0x00003039,0xf8(%rbp)
movb 0xfa(%rbp),%al
movb %al,0xff(%rbp)
The movl is the initialization of adad. Then, as you can see, it reads out the 3rd byte of adad, and stores it back into memory (the volatile is forcing that store back).
I guess a good question is why does it matter to you what assembly gets generated? For example, just by changing my optimization flag to -O0, the assembly output for the interesting part of the code is:
movl $0x00003039,0xf8(%rbp)
leaq 0xf8(%rbp),%rax
addq $0x02,%rax
movzbl (%rax),%eax
movb %al,0xff(%rbp)
Which is pretty straightforwardly seen as the exact logical operations of your code:
Initialize adad
Take the address of adad
Add 2 to that address
Load one byte by dereferencing the new address
Store one byte into character
Various optimizations will change the output... if you really need some specific behaviour/addressing mode for some reason, you might have to write the assembly yourself.
Without knowing anything about the compiler and the underlying CPU architecture no definitive answer can be given. For example, not all CPU architectures allow the addressing of every arbitrary byte in memory (though I believe all the currently popular ones do): on a CPU that's word-addressed, instead of byte-addressed, what the compiler will generate is inevitably going to be the loading into some register of the whole word adad (presumably by an offset from a base pointer register, if the variable in question is on stack [1]), followed by shifting and masking to isolate the byte of interest.
[1] note that, without knowing what CPU architecture we're talking about and how the compiler uses it, we can't even say whether "load a word at a fixed offset from a base register" is something that's done inline within the instruction (as one might hope, and many popular architectures definitely do support;-) or needs separate address arithmetic in an auxiliary register.
IOW, whether it's a good idea or not, it's definitely possible to define a CPU architecture which cannot load / store registers except from other registers or memory addresses defined by other registers or constant, and some such architectures exist (though they may not be all that popular at this time;-).