How does C/C++ know how long a dynamic allocated array is - c++

This question has been bothering me for a while.
If I do int* a = new int[n], for example, I only have an pointer that points to the beginning of array a, but how does C/C++ know about n? I know if I want to pass this array to another function, then I have to pass the length of the array with it, so I guess C/C++ does not really know how long this array is.
I know we can infer the end of a character array char* by looking for the NUL terminator. But is there a similar mechanism for other arrays, like int? Meanwhile, char can be more than a character -- you can also treat it as an integer type. Then how does C++ know where this array ends then?
This question starts to bother me even more when I am developing embedded Python (If you are not familiar with embedded python, you may ignore this paragraph and just answer the above questions. I will still appreciate it). In Python there is a "ByteArray", and the only way to convert this "ByteArray" to C/C++ is to use PyString_AsString() to convert it to char*. But if this ByteArray has 0 in it, then C/C++ would think that char* array stops early. This is not the worst part. The worst part is, say I do a
char* arr = PyString_AsString(something)
void* pt = calloc(1, 1000);
if st happens to start with 0, then C/C++ will almost guarantee to wipe out everything in arr, since it thinks arr ends right after a NULL appears. Then it might just wipe out everything in arr by allocating a a trunk of memory to pt.
Thank you very much for your time! I really appreciate it.

C/C++ doesn't; it's the allocator (the little piece of code that implements malloc(), free(), etc.) that knows how long it is. C/C++ is welcome to wee all over itself, free of the constraints of having to worry about the length.
Also, PyString_AsStringAndSize().

Let's hit the disassembler! This is going to be different for C and C++. How free works in C is covered in another question, and here's how it works in C++:
struct T {
~T();
int data;
};
void test(T* p)
{
delete[] p;
}
And let's run the compiler to produce assembly. Here's the relevant bits, compiled for i386:
movl -4(%edi), %eax
leal (%edi,%eax,4), %esi
cmpl %esi, %edi
je L4
.align 4,0x90
L8:
subl $4, %esi
movl %esi, (%esp)
call L__ZN1TD1Ev$stub
cmpl %esi, %edi
jne L8
You can see the important part: There is an integer stored before the start of p containing the length of p, and the code then loops over the p array, calling the destructor for each item in the array. It then calls delete, which is usually fairly boring because it just calls free (the C function). So you can see how C++ delete is expressed in terms of free.
Destructors and Exceptions: Based on the above assembly, you can notice that if the destructor for T threw an exception, then part of the p array would get the destructor called and the rest of the array would not. Destructors should never throw exceptions.
Caveat: This is only one possible way that your compiler and runtime can solve this problem. (Here, the destructor is called by compiler-generated code and delete is part of the runtime.) There is quite a bit of leeway in how these are implemented, and yours could be different. This also shows why you should always call the correct operator, delete[] or delete -- calling the wrong one will cause all sorts of trouble, such as stomping on memory and freeing invalid pointers.
About NUL terminators: The only reason NUL terminators are a problem is because PyString_AsString and other similar functions call strlen to figure out how long the string is. However, free doesn't care about NUL terminators, instead, it keeps track of the length from the original malloc call separately. For PyString_AsString (and strdup, etc.) this is not an option because there is no portable way to get the size of a region of memory -- malloc and free do not expose this functionality. Besides, you can pass a pointer to PyString_AsString which is in the middle of a malloc block or somewhere else entirely.
See also: How does free know how much to free?

c/c++ doesn't know the length of any array, so you can cross-border access a array easily. c/c++ doesn't know the length of char array also.
Char* can point to string but it is is not equal to a string. String terminated by NULL is a convention of c/c++.

Related

How are rvalues assigned to lvalues in assembly?

First question here. I will in a few weeks/months need to create procedural code in which there will be functions assigning big (I mean really big) sets of data directly to pointers. Here is some example of code I will be doing :
void MyFuntion(string* str)
{
*str = "some data in a string";
}
As it surely is important : I am on windows 10, in visual-studio 2019, compiling with the default c++ compiler on release x86.
Imagine something like this but with strings that can contain several millions of characters, or with int/float arrays also with several millions of elements.
So, this is a single operation assigning a rvalue to a pointer, which is therefore on the heap. Of course, if I create a local variable containing the data, it will be more than 1MB and therefore will cause a stack overflow, right ?
As I understand, since the data only exists as a rvalue here, it doesn't have a memory existence, but I would like to know : how is the rvalue assigned to the pointer ? Like, how is it done in assembly ? I must say I have never done any assembly, I have a few (very few) notions but I'd like to get into it when I have time.
Is it temporary created in the stack or heap before being put in the final memory address ? My guess is that the memory address (the pointer in which I am assigning the data) is directly filled with the data, like, bit by bit, so no existence of the rvalue in memory.
If I'm correct, the only things that exist in the stack here are : the function call, the pointer copy, then the instruction, which should be something like "assign rvalue X to lvalue Y" and the size of the instruction doesn't depend on the size of the rvalue and lvalue, so there should not be any problem regarding the stack here.
So, if I'm correct, this code should not cause any problem, no matter how big the rvalue is, but I would still like to know how it is done exactly, assembly-wise. Note that I am not only looking for an answer, but more like some references, books or docs, that could explain in detail. I guess what I am looking for won't be in a c++ book, but more like a assembly book, this might be a good starting point to get myself into it !
Although a specific OS and compiler were mentioned, the example assembly in this answer will probably differ from what the querent's compiler would output, because I don't have a Windows 10 machine available at the time of writing and used a different environment having forgotten about Godbolt. However, this topic is general enough in my opinion that it shouldn't really matter in this specific case.
What even is a value on the right side of an assignment operator? What does assignment look like at the assembly level? Here's a simple example.
void assign_thing(int *p) {
*p = 42;
}
movl $42, (%rdi)
retq
"Move the 32-bit integer 42 into the memory location to which rdi is pointing." %rdi here represents p, and (%rdi) means *p. For something dead simple like an integer, it's pretty much that simple. How about a simple structure?
struct stuff {
int id;
float value;
char text[8];
};
void assign_thing(stuff *p) {
*p = {42, 1.5, "Hello!"};
}
movabsq $4593671619917905962, %rax
movq %rax, (%rdi)
movabsq $36762444129608, %rax
movq %rax, 8(%rdi)
retq
A little harder to read at first glance, but pretty much the same idea. The compiler was smart and packed the integer and float values 42 and 1.5 into a single 64-bit value and stuffs that directly into (%rdi). Likewise with the string "Hello!", which is short enough to fit into a single 64-bit value and gets stuffed into 8(%rdi) (8 bytes past p is the offset of text).
So far, none of the rvalues actually exist in memory when they get assigned. They're just part of the instructions. What if it's something a lot bigger, like a string?
// Overflow checking omitted for brevity.
void assign_thing(char *p) {
// Assignment with = doesn't actually do what you'd want here,
// so this'll have to do.
strcpy(p, "What if it's something a lot bigger, like a string?");
}
vmovups -5484(%rip), %ymm0
vmovups %ymm0, 20(%rdi) ; I'm guessing the disassembler meant to say 0x20
vmovups -5517(%rip), %ymm0
vmovups %ymm0, (%rdi)
vzeroupper
retq
Now, the rvalue does reside in memory when it gets assigned. Do note that this is not because strcpy was used instead of =, but because the compiler decided that it would be better to store that "rvalue" string somewhere in a read-only area like .rodata and just copy it over. If I had used a much shorter string, any reasonably modern compiler would probably optimize it into a few mov or movabsq instructions like in the second example. Unless p points to a buffer on the stack and your strcpy ends up overflowing it, you won't get a stack overflow here.
Now what about your example? I'm guessing that your string type is really std::string, and that's not a trivial type. So what happens there? In C++, the assignment operator = is overloadable, and std::string indeed has its own overloads, so instead of directly stuffing or copying values into the object, a special member function operator= is called. That is to say, your *str = "some data in a string" is really a str->operator=("some data in a string"). How your rvalue string gets copied is up to the implementation of std::string::operator=, but it'll most likely be optimized into something like my last example. The actual string data of an std::string resides on the heap, so stack overflow still isn't a problem here.
tl;dr (this answer + the comments, compressed into a few sentences)
If your string is small enough, it probably won't exist in memory during assignment. If it's big enough, it'll sit in a read-only area somewhere and get copied over when needed. The stack is often not even involved, so don't worry about overflow.

Why does my disassembled C++ code use the instruction pointer and an offset to get string literals?

I have a C++ program of mine that I've disassembled, and it seems like the assembly is using the instruction pointer to get at string literals. For example:
leaq 0x15468(%rip), %rsi ## literal pool for: "special"
and
leaq 0x15457(%rip), %rsi ## literal pool for: "ordinary"
Why does the compiler use the instruction pointer to get at string literals? This seems like it would result in a substantial headache for any human programmer, although it's probably not as hard for the compiler.
My question, though, is why? Is there some machine based or historical reason or did the compiler writers just decide to use %rip arbitrarily?
Remember that string literals in C++ are constant and non-modifiable. One way to ensure that is to place them together with the code in the code-segment, which is loaded into memory pages marked as read-only.

Is copying in a loop less efficient than memcpy()?

I started to study IT and I am discussing with a friend right now whether this code is inefficient or not.
// const char *pName
// char *m_pName = nullptr;
for (int i = 0; i < strlen(pName); i++)
m_pName[i] = pName[i];
He is claiming that for example memcopy would do the same like the for loop above. I wonder if that's true, I don't believe.
If there are more efficient ways or if this is inefficient, please tell me why!
Thanks in advance!
I took a look at actual g++ -O3 output for your code, to see just how bad it was.
char* can alias anything, so even the __restrict__ GNU C++ extension can't help the compiler hoist the strlen out of the loop.
I was thinking it would be hoisted, and expecting that the major inefficiency here was just the byte-at-a-time copy loop. But no, it's really as bad as the other answers suggest. m_pName even has to be re-loaded every time, because the aliasing rules allow m_pName[i] to alias this->m_pName. The compiler can't assume that storing to m_pName[i] won't change class member variables, or the src string, or anything else.
#include <string.h>
class foo {
char *__restrict__ m_pName = nullptr;
void set_name(const char *__restrict__ pName);
void alloc_name(size_t sz) { m_pName = new char[sz]; }
};
// g++ will only emit a non-inline copy of the function if there's a non-inline definition.
void foo::set_name(const char * __restrict__ pName)
{
// char* can alias anything, including &m_pName, so the loop has to reload the pointer every time
//char *__restrict__ dst = m_pName; // a local avoids the reload of m_pName, but still can't hoist strlen
#define dst m_pName
for (unsigned int i = 0; i < strlen(pName); i++)
dst[i] = pName[i];
}
Compiles to this asm (g++ -O3 for x86-64, SysV ABI):
...
.L7:
movzx edx, BYTE PTR [rbp+0+rbx] ; byte load from src. clang uses mov al, byte ..., instead of movzx. The difference is debatable.
mov rax, QWORD PTR [r12] ; reload this->m_pName
mov BYTE PTR [rax+rbx], dl ; byte store
add rbx, 1
.L3: ; first iteration entry point
mov rdi, rbp ; function arg for strlen
call strlen
cmp rbx, rax
jb .L7 ; compare-and-branch (unsigned)
Using an unsigned int loop counter introduces an extra mov ebx, ebp copy of the loop counter, which you don't get with either int i or size_t i, in both clang and gcc. Presumably they have a harder time accounting for the fact that unsigned i could produce an infinite loop.
So obviously this is horrible:
a strlen call for every byte copied
copying one byte at a time
reloading m_pName every time through the loop (can be avoided by loading it into a local).
Using strcpy avoids all these problems, because strlen is allowed to assume that it's src and dst don't overlap. Don't use strlen + memcpy unless you want to know strlen yourself. If the most efficient implementation of strcpy is to strlen + memcpy, the library function will internally do that. Otherwise, it will do something even more efficient, like glibc's hand-written SSE2 strcpy for x86-64. (There is a SSSE3 version, but it's actually slower on Intel SnB, and glibc is smart enough not to use it.) Even the SSE2 version may be unrolled more than it should be (great on microbenchmarks, but pollutes the instruction cache, uop-cache, and branch-predictor caches when used as a small part of real code). The bulk of the copying is done in 16B chunks, with 64bit, 32bit, and smaller, chunks in the startup/cleanup sections.
Using strcpy of course also avoids bugs like forgetting to store a trailing '\0' character in the destination. If your input strings are potentially gigantic, using int for the loop counter (instead of size_t) is also a bug. Using strncpy is generally better, since you often know the size of the dest buffer, but not the size of the src.
memcpy can be more efficient than strcpy, since rep movs is highly optimized on Intel CPUs, esp. IvB and later. However, scanning the string to find the right length first will always cost more than the difference. Use memcpy when you already know the length of your data.
At best it's somewhat inefficient. At worst, it's quite inefficient.
In the good case, the compiler recognizes that it can hoist the call to strlen out of the loop. In this case, you end up traversing the input string once to compute the length, and then again to copy to the destination.
In the bad case, the compiler calls strlen every iteration of the loop, in which case the complexity becomes quadratic instead of linear.
As far as how to do it efficiently, I'd tend to so something like this:
char *dest = m_pName;
for (char const *in = pName; *in; ++in)
*dest++ = *in;
*dest++ = '\0';
This traverses the input only once, so it's potentially about twice as fast as the first, even in the better case (and in the quadratic case, it can be many times faster, depending on the length of the string).
Of course, this is doing pretty much the same thing as strcpy would. That may or may not be more efficient still--I've certainly seen cases where it was. Since you'd normally assume strcpy is going to be used quite a lot, it can be worthwhile to spend more time optimizing it than some random guy on the internet typing in an answer in a couple minutes.
Yes, your code is inefficient. Your code takes what is called "O(n^2)" time. Why? You have the strlen() call in your loop, so your code is recalculating the length of the string every single loop. You can make it faster by doing this:
unsigned int len = strlen(pName);
for (int i = 0; i < len; i++)
m_pName[i] = pName[i];
Now, you calculate the string length only once, so this code takes "O(n)" time, which is much faster than O(n^2). This is now about as efficient as you can get. However, A memcpy call would still be 4-8 times faster, because this code copies 1 byte at a time, whereas memcpy will use your system's word length.
Depends on interpretation of efficiency. I'd claim using memcpy() or strcpy() more efficient, because you don't write such loops every time you need a copy.
He is claiming that for example memcopy would do the same like the for loop above.
Well, not exactly the same. Probably, because memcpy() takes the size once, while strlen(pName) might be called with every loop iteration potentially. Thus from potential performance efficiency considerations memcpy() would be better.
BTW from your commented code:
// char *m_pName = nullptr;
Initializing like that would lead to undefined behavior without allocating memory for m_pName:
char *m_pName = new char[strlen(pName) + 1];
Why the +1? Because you have to consider putting a '\0' indicating the end of the c-style string.
Yes, it's inefficient, not because you're using a loop instead of memcpy but because you're calling strlen on each iteration. strlen loops over the entire array until it finds the terminating zero byte.
Also, it's very unlikely that the strlen will be optimized out of the loop condition, see In C++, should I bother to cache variables, or let the compiler do the optimization? (Aliasing).
So memcpy(m_pName, pName, strlen(pName)) would indeed be faster.
Even faster would be strcpy, because it avoids the strlen loop:
strcpy(m_pName, pName);
strcpy does the same as the loop in #JerryCoffin's answer.
For simple operations like that you should almost always say what you mean and nothing more.
In this instance if you had meant strcpy() then you should have said that, because strcpy() will copy the terminating NUL character, whereas that loop will not.
Neither one of you can win the debate. A modern compiler has seen a thousand different memcpy() implementations and there's a good chance it's just going to recognise yours and replace your code either with a call to memcpy() or with its own inlined implementation of the same.
It knows which one is best for your situation. Or at least it probably knows better than you do. When you second-guess that you run the risk of the compiler failing to recognise it and your version being worse than the collected clever tricks the compiler and/or library knows.
Here are a few considerations that you have to get right if you want to run your own code instead of the library code:
What's the largest read/write chunk size that is efficient (it's rarely bytes).
For what range of loop lengths is it worth the trouble of pre-aligning reads and writes so that larger chunks can be copied?
Is it better to align reads, align writes, do nothing, or to align both and perform permutations in arithmetic to compensate?
What about using SIMD registers? Are they faster?
How many reads should be performed before the first write? How much register file needs to be used for the most efficient burst accesses?
Should a prefetch instruction be included?
How far ahead?
How often?
Does the loop need extra complexity to avoid preloading over the end?
How many of these decisions can be resolved at run-time without causing too much overhead? Will the tests cause branch prediction failures?
Would inlining help, or is that just wasting icache?
Does the loop code benefit from cache line alignment? Does it need to be packed tightly into a single cache line? Are there constraints on other instructions within the same cache line?
Does the target CPU have dedicated instructions like rep movsb which perform better? Does it have them but they perform worse?
Going further; because memcpy() is such a fundamental operation it's possible that even the hardware will recognise what the compiler's trying to do and implement its own shortcuts that even the compiler doesn't know about.
Don't worry about the superfluous calls to strlen(). Compiler probably knows about that, too. (Compiler should know in some instances, but it doesn't seem to care) Compiler sees all. Compiler knows all. Compiler watches over you while you sleep. Trust the compiler.
Oh, except the compiler might not catch that null pointer reference. Stupid compiler!
This code is confused in various ways.
Just do m_pName = pName; because you're not actually copying the string.
You're just pointing to the one you've already got.
If you want to copy the string m_pName = strdup(pName); would do it.
If you already have storage, strcpy or memcpy would do it.
In any case, get strlen out of the loop.
This is the wrong time to worry about performance.
First get it right.
If you insist on worrying about performance, it's hard to beat strcpy.
What's more, you don't have to worry about it being right.
As a matter of fact, why do you need to copy at all ??? (either with the loop or memcpy)
if you want to duplicate a memory block, thats a different question, but since its a pointer all you need is &pName[0] (which is the address of the first location of the array) and sizeof pName ... thats it ... you can reference any object in the array by incrementing the address of first byte and you know the limit using the size value ... why have all these pointers ???(let me know if there is more to that than theoretical debate)

what will be the addressing mode in assembly code generated by the compiler here?

Suppose we've got two integer and character variables:
int adad=12345;
char character;
Assuming we're discussing a platform in which, length of an integer variable is longer than or equal to three bytes, I want to access third byte of this integer and put it in the character variable, with that said I'd write it like this:
character=*((char *)(&adad)+2);
Considering that line of code and the fact that I'm not a compiler or assembly expert, I know a little about addressing modes in assembly and I'm wondering the address of the third byte (or I guess it's better to say offset of the third byte) here would be within the instructions generated by that line of code themselves, or it'd be in a separate variable whose address (or offset) is within those instructions ?
The best thing to do in situations like this is to try it. Here's an example program:
int main(int argc, char **argv)
{
int adad=12345;
volatile char character;
character=*((char *)(&adad)+2);
return 0;
}
I added the volatile to avoid the assignment line being completely optimized away. Now, here's what the compiler came up with (for -Oz on my Mac):
_main:
pushq %rbp
movq %rsp,%rbp
movl $0x00003039,0xf8(%rbp)
movb 0xfa(%rbp),%al
movb %al,0xff(%rbp)
xorl %eax,%eax
leave
ret
The only three lines that we care about are:
movl $0x00003039,0xf8(%rbp)
movb 0xfa(%rbp),%al
movb %al,0xff(%rbp)
The movl is the initialization of adad. Then, as you can see, it reads out the 3rd byte of adad, and stores it back into memory (the volatile is forcing that store back).
I guess a good question is why does it matter to you what assembly gets generated? For example, just by changing my optimization flag to -O0, the assembly output for the interesting part of the code is:
movl $0x00003039,0xf8(%rbp)
leaq 0xf8(%rbp),%rax
addq $0x02,%rax
movzbl (%rax),%eax
movb %al,0xff(%rbp)
Which is pretty straightforwardly seen as the exact logical operations of your code:
Initialize adad
Take the address of adad
Add 2 to that address
Load one byte by dereferencing the new address
Store one byte into character
Various optimizations will change the output... if you really need some specific behaviour/addressing mode for some reason, you might have to write the assembly yourself.
Without knowing anything about the compiler and the underlying CPU architecture no definitive answer can be given. For example, not all CPU architectures allow the addressing of every arbitrary byte in memory (though I believe all the currently popular ones do): on a CPU that's word-addressed, instead of byte-addressed, what the compiler will generate is inevitably going to be the loading into some register of the whole word adad (presumably by an offset from a base pointer register, if the variable in question is on stack [1]), followed by shifting and masking to isolate the byte of interest.
[1] note that, without knowing what CPU architecture we're talking about and how the compiler uses it, we can't even say whether "load a word at a fixed offset from a base register" is something that's done inline within the instruction (as one might hope, and many popular architectures definitely do support;-) or needs separate address arithmetic in an auxiliary register.
IOW, whether it's a good idea or not, it's definitely possible to define a CPU architecture which cannot load / store registers except from other registers or memory addresses defined by other registers or constant, and some such architectures exist (though they may not be all that popular at this time;-).

off-by-one error with string functions (C/C++) and security potentials

So this code has the off-by-one error:
void foo (const char * str) {
char buffer[64];
strncpy(buffer, str, sizeof(buffer));
buffer[sizeof(buffer)] = '\0';
printf("whoa: %s", buffer);
}
What can malicious attackers do if she figured out how the function foo() works?
Basically, to what kind of security potential problems is this code vulnerable?
I personally thought that the attacker can't really do anything in this case, but I heard that they can do a lot of things even if they are limited to work with 1 byte.
The only off-by-one error I see here is this line:
buffer[sizeof(buffer)] = '\0';
Is that what you're talking about? I'm not an expert on these things, so maybe I've overlooking something, but since the only thing that will ever get written to that wrong byte is a zero, I think the possibilities are quite limited. The attacker can't control what's being written there. Most likely it would just cause a crash, but it could also cause tons of other odd behavior, all of it specific to your application. I don't see any code injection vulnerability here unless this error causes your app to expose another such vulnerability that would be used as the vector for the actual attack.
Again, take with a grain of salt...
Read Shell Coder's Handbook 2nd Edition for lots of information.
Disclaimer: This is inferred knowledge from some research I just did, and should not be taken as gospel.
It's going to overwrite part or all of your saved frame pointer with a null byte - that's the reference point that your calling function will use to offset it's memory accesses. So at that point the calling function's memory operations are going to a different location. I don't know what that location will be, but you don't want to be accessing the wrong memory. I won't say you can do anything, but you might be able to do something.
How do I know this (really, how did I infer this)? Smashing the stack for Fun and Profit by Aleph One. It's quite old, and I don't know if Windows or Compilers have changed the way the stack behaves to avoid these problems. But it's a starting point.
example1.c:
void function(int a, int b, int c) {
char buffer1[5];
char buffer2[10];
}
void main() {
function(1,2,3);
}
To understand what the program does to call function() we compile it with
gcc using the -S switch to generate assembly code output:
$ gcc -S -o example1.s example1.c
By looking at the assembly language output we see that the call to
function() is translated to:
pushl $3
pushl $2
pushl $1
call function
This pushes the 3 arguments to function backwards into the stack, and
calls function(). The instruction 'call' will push the instruction pointer
(IP) onto the stack. We'll call the saved IP the return address (RET). The
first thing done in function is the procedure prolog:
pushl %ebp
movl %esp,%ebp
subl $20,%esp
This pushes EBP, the frame pointer, onto the stack. It then copies the
current SP onto EBP, making it the new FP pointer. We'll call the saved FP
pointer SFP. It then allocates space for the local variables by subtracting
their size from SP.
We must remember that memory can only be addressed in multiples of the
word size. A word in our case is 4 bytes, or 32 bits. So our 5 byte buffer
is really going to take 8 bytes (2 words) of memory, and our 10 byte buffer
is going to take 12 bytes (3 words) of memory. That is why SP is being
subtracted by 20. With that in mind our stack looks like this when
function() is called (each space represents a byte):
bottom of top of
memory memory
buffer2 buffer1 sfp ret a b c
<------ [ ][ ][ ][ ][ ][ ][ ]
top of bottom of
stack stack
What can malicious attackers do if she
figured out how the function foo()
works? Basically, to what kind of
security potential problems is this
code vulnerable?
This is probably not the best example of a bug that could be easily exploited for security purposes although it could exploited to potentially crash the code simply by using a string of 64-characters or longer.
While it certainly is a bug that will corrupt the address immediately after the array (on the stack) with a single zero byte, there is no easy way for a hacker to inject data into the corrupted area. Calling the printf() function will push parameters on the stack and may clear the zero that was written out of array bounds and lead to a potentially unterminated string being passed to printf.
However, without intimate knowledge of what goes on in printf (and needing to exploit printf as well as foo), a hacker would be hard pressed to do anything other than crash your code.
FWIW, this is a good reason to compile with warnings on or to use functions like strncpy_s which both respects buffer size and also includes a terminating null even if the copied string is larger than the buffer. With strncpy_s, the line "buffer[sizeof(buffer)] = '\0';" is not even necessary.
The issue is that you don't have permission to write to the item after the array. When you asked for 64 chars for buffer, the system is required to give you at least 64 bytes. It's normal for the system to give you more than that -- in which case the memory belongs to you and there is no problem in practice -- but it is possible that even the first byte after the array belongs to "somebody else."
So what happens if you overwrite it? If the "somebody else" is actually inside your program (maybe in a different structure or thread) the operating system probably won't notice you trampled on that data, but that other structure or thread might. There's no telling what data should be there or how trampling over it will affect things.
In this case you allocated buffer on the stack, which means (1) the somebody else is you, and in fact is your current stack frame, and (2) it's not in another thread (but could affect other local variables in the current stack frame).