(Why) does an empty string have an address? - c++

I guessed no, but this output of something like this shows it does
string s="";
cout<<&s;
what is the point of having empty string with an address ?
Do you think that should not cost any memory at all ?

Yes, every variable that you keep in memory has an address. As for what the "point" is, there may be several:
Your (literal) string is not actually "empty", it contains a single '\0' character. The std::string object that is created to contain it may allocate its own character buffer for holding this data, so it is not necessarily empty either.
If you are using a language in which strings are mutable (as is the case in C++), then there is no guarantee that an empty string will remain empty.
In an object-oriented language, a string instance with no data associated with it can still be used to call various instance methods on the string class. This requires a valid object instance in memory.
There is a difference between an empty string and a null string. Sometimes the distinction can be important.
And yes, I very much agree with the implementation of the language that an "empty" variable should still exist in and consume memory. In an object-oriented language an instance of an object is more than just the data that it stores, and there's nothing wrong with having an instance of an object that is not currently storing any actual data.

Following your logic, int i; would also not allocate any memory space, since you are not assigning any value to it. But how is it possible then, that this subsequent operation i = 10; works after that?
When you declare a variable, you are actually allocating memory space of a certain size (depending on the variable's type) to store something. If you want to use this space right way or not is up to you, but the declaration of the variable is what triggers memory allocation for it.
Some coding practices say you shouldn't declare a variable until the moment you need to use it.

An 'empty' string object is still an object - there may be more to its internal implementation than just the memory required to store the literal string itself. Besides that, most C-style strings (like the ones used in C++) are null-terminated, meaning even that "empty" string still uses one byte for the terminator.

Every named object in C++ has an address. There is even a specific requirement that the size of every type be at least 1 so that T[N] and T[N+1] are different, or so that in T a, b; both variables have distinct addresses.
In your case, s is a named object of type std::string, so it has an address. The fact that you constructed s from a particular value is immaterial. What matters is that s has been constructed, so it is an object, so it has an address.

s is a string object so it has an address. It has some internal data structures keeping track of the string. For example, current length of the string, current storage reserved for string, etc.
More generally, the C++ standard requires all objects to have a nonzero size. This helps ensure that every object has a unique address.
9 Classes
Complete objects and member subobjects of class type shall have nonzero size.

In C++, all classes are a specific, unchanging size. (varying by compiler and library, but specific at compile-time.) The std::string usually consists of a pointer, a length of allocation, and a length used. That's ~12 bytes, no matter how long the string is, and you have allocated std::string s on the call stack. When you display the address of the std::string, cout displays the location of the std::string in memory.
If the string doesn't point at anything, it won't allocate any space from the heap, which is like what you're thinking. But, all c-strings end in a trailing NULL, so the c-string "" is one character long, not zero. This means when you assign the c-string "" to the std::string, the std::string allocates 1 (or more) bytes, and assigns it the value of the trailing NULL character (usually zero '\0').

If there truly was no point to the empty string, then the programmer would not write the instruction at all. The language is loyal and trusting! And will never assume memory you allocate to be "wasted". Even if you are lost and heading over a cliff, it will hold your hand to the bitter end.
I think it'd be interesting to know, just as a curiosity though, that if you create a variable that isn't 'used' later, such as your empty string, the compiler may very well optimize it away so it incurs no cost to begin with. I guess compilers aren't as trusting...

Related

What happens if I write less than 12 bytes to a 12 byte buffer?

Understandably, going over a buffer errors out (or creates an overflow), but what happens if there are less than 12 bytes used in a 12 byte buffer? Is it possible or does the empty trailing always fill with 0s? Orthogonal question that may help: what is contained in a buffer when it is instantiated but not used by the application yet?
I have looked at a few pet programs in Visual Studio and it seems that they are appended with 0s (or null characters) but I am not sure if this is a MS implementation that may vary across language/ compiler.
Take the following example (within a block of code, not global):
char data[12];
memcpy(data, "Selbie", 6);
Or even this example:
char* data = new char[12];
memcpy(data, "Selbie", 6);
In both of the above cases, the first 6 bytes of data are S,e,l,b,i, and e. The remaining 6 bytes of data are considered "unspecified" (could be anything).
Is it possible or does the empty trailing always fill with 0s?
Not guaranteed at all. The only allocator that I know of that guarantees zero byte fill is calloc. Example:
char* data = calloc(12,1); // will allocate an array of 12 bytes and zero-init each byte
memcpy(data, "Selbie");
what is contained in a buffer when it is instantiated but not used by the application yet?
Technically, as per the most recent C++ standards, the bytes delivered by the allocator are technically considered "unspecified". You should assume that it's garbage data (anything). Make no assumptions about the content.
Debug builds with Visual Studio will often initialize buffers with with 0xcc or 0xcd values, but that is not the case in release builds. There are however compiler flags and memory allocation techniques for Windows and Visual Studio where you can guaranteed zero-init memory allocations, but it is not portable.
Consider your buffer, filled with zeroes:
[00][00][00][00][00][00][00][00][00][00][00][00]
Now, let's write 10 bytes to it. Values incrementing from 1:
[01][02][03][04][05][06][07][08][09][10][00][00]
And now again, this time, 4 times 0xFF:
[FF][FF][FF][FF][05][06][07][08][09][10][00][00]
what happens if there are less than 12 bytes used in a 12 byte buffer? Is it possible or does the empty trailing always fill with 0s?
You write as much as you want, the remaining bytes are left unchanged.
Orthogonal question that may help: what is contained in a buffer when
it is instantiated but not used by the application yet?
Unspecified. Expect junk left by programs (or other parts of your program) that used this memory before.
I have looked at a few pet programs in Visual Studio and it seems that they are appended with 0s (or null characters) but I am not sure if this is a MS implementation that may vary across language/ compiler.
It is exactly what you think it is. Somebody had done that for you this time, but there are no guarantees it will happen again. It could be a compiler flag that attaches cleaning code. Some versions of MSVC used to fill fresh memory with 0xCD when ran in debug but not in release. It can also be a system security feature that wipes memory before giving it to your process (so you can't spy on other apps). Always remember to use memset to initialize your buffer where it matters. Eventually, mandate using certain compiler flag in readme if you depend on fresh buffer to contain a certain value.
But cleaning is not really necessary. You take a 12 byte-long buffer. You fill it with 7 bytes. You then pass it somewhere - and you say "here is 7 bytes for you". The size of the buffer is not relevant when reading from it. You expect other functions to read as much as you've written, not as much as possible. In fact, in C it is usually not possible to tell how long the buffer is.
And a side note:
Understandably, going over a buffer errors out (or creates an overflow)
It doesn't, that's the problem. That's why it's a huge security issue: there is no error and the program tries to continue, so it sometimes executes the malicious content it never meant to. So we had to add bunch of mechanisms to the OS, like ASLR that will increase probability of a crashing the program and decrease probability of it continuing with corrupted memory. So, never depend on those afterthought guards and watch your buffer boundaries yourself.
C++ has storage classes including global, automatic and static. The initialization depends on how the variable is declared.
char global[12]; // all 0
static char s_global[12]; // all 0
void foo()
{
static char s_local[12]; // all 0
char local[12]; // automatic storage variables are uninitialized, accessing before initialization is undefined behavior
}
Some interesting details here.
The program knows the length of a string because it ends it with a null-terminator, a character of value zero.
This is why in order to fit a string in a buffer, the buffer has to be at least 1 character longer than the number of characters in the string, so that it can fit the string plus the null-terminator too.
Any space after that in the buffer is left untouched. If there was data there previously, it is still there. This is what we call garbage.
It is wrong to assume this space is zero-filled just because you haven't used it yet, you don't know what that particular memory space was used for before your program got to that point. Uninitialized memory should be handled as if what is in it is random and unreliable.
All of the previous answers are very good and very detailed, but the OP appears to be new to C programming. So, I thought a Real World example might be helpful.
Imagine you have a cardboard beverage holder that can hold six bottles. It's been sitting around in your garage so instead of six bottles, it contains various unsavory things that accumulate in the corners of garages: spiders, mouse houses, et al.
A computer buffer is a bit like this just after you allocate it. You can't really be sure what's in it, you just know how big it is.
Now, let's say you put four bottles in your holder. Your holder hasn't changed size, but you now know what's in four of the spaces. The other two spaces, complete with their questionable contents, are still there.
Computer buffers are the same way. That's why you frequently see a bufferSize variable to track how much of the buffer is in use. A better name might be numberOfBytesUsedInMyBuffer but programmers tend to be maddeningly terse.
Writing part of a buffer will not affect the unwritten part of the buffer; it will contain whatever was there beforehand (which naturally depends entirely on how you got the buffer in the first place).
As the other answer notes, static and global variables will be initialized to 0, but local variables will not be initialized (and instead contain whatever was on the stack beforehand). This is in keeping with the zero-overhead principle: initializing local variables would, in some cases, be an unnecessary and unwanted run-time cost, while static and global variables are allocated at load-time as part of a data segment.
Initialization of heap storage is at the option of the memory manager, but in general it will not be initialized, either.
In general, it's not at all unusual for buffers to be underfull. It's often good practice to allocate buffers bigger than they need to be. (Trying to always compute an exact buffer size is a frequent source of error, and often a waste of time.)
When a buffer is bigger than it needs to be, when the buffer contains less data than its allocated size, it's obviously important to keep track of how much data is there. In general there are two ways of doing this: (1) with an explicit count, kept in a separate variable, or (2) with a "sentinel" value, such as the \0 character which marks the end of a string in C.
But then there's the question, if not all of a buffer is in use, what do the unused entries contain?
One answer is, of course, that it doesn't matter. That's what "unused" means. You care about the values of the entries that are used, that are accounted for by your count or your sentinel value. You don't care about the unused values.
There are basically four situations in which you can predict the initial values of the unused entries in a buffer:
When you allocate an array (including a character array) with static duration, all unused entries are initialized to 0.
When you allocate an array and give it an explicit initializer, all unused entries are initialized to 0.
When you call calloc, the allocated memory is initialized to all-bits-0.
When you call strncpy, the destination string is padded out to size n with \0 characters.
In all other cases, the unused parts of a buffer are unpredictable, and generally contain whatever they did last time (whatever that means). In particular, you cannot predict the contents of an uninitialized array with automatic duration (that is, one that's local to a function and isn't declared with static), and you cannot predict the contents of memory obtained with malloc. (Some of the time, in those two cases the memory tends to start out as all-bits-zero the first time, but you definitely don't want to ever depend on this.)
It depends on the storage class specifier, your implementation, and its settings.
Some interesting examples:
- Uninitialized stack variables may be set to 0xCCCCCCCC
- Uninitialized heap variables may be set to 0xCDCDCDCD
- Uninitialized static or global variables may be set to 0x00000000
- or it could be garbage.
It's risky to make any assumptions about any of this.
I think the correct answer is that you should always keep track of how many char are written.
As with the low level functions like read and write need or give the number of character read or writen. In the same way std::string keep tracks of the number of characters in its implementatiin
Declared objects of static duration (those declared outside a function, or with a static qualifier) which have no specified initializer are initialized to whatever value would be represented by a literal zero [i.e. an integer zero, floating-point zero, or null pointer, as appropriate, or a structure or union containing such values]. If the declaration of any object (including those of automatic duration) includes an initializer, portions whose values are specified by that initializer will be set as specified, and the remainder will be zeroed as with static objects.
For automatic objects without initializers, the situation is somewhat more ambiguous. Given something like:
#include <string.h>
unsigned char static1[5], static2[5];
void test(void)
{
unsigned char temp[5];
strcpy(temp, "Hey");
memcpy(static1, temp, 5);
memcpy(static2, temp, 5);
}
the Standard is clear that test would not invoke Undefined Behavior, even though it copies portions of temp that were not initialized. The text of the Standard, at least as of C11, is unclear as to whether anything is guaranteed about the values of static1[4] and static2[4], most notably whether they might be left holding different values. A defect report states that the Standard was not intended to forbid a compiler from behaving as though the code had been:
unsigned char static1[5]={1,1,1,1,1}, static2[5]={2,2,2,2,2};
void test(void)
{
unsigned char temp[4];
strcpy(temp, "Hey");
memcpy(static1, temp, 4);
memcpy(static2, temp, 4);
}
which could leave static1[4] and static2[4] holding different values. The Standard is silent on whether quality compilers intended for various purposes should behave in that function. The Standard also offers no guidance as to how the function should be written if the intention if the programmer requires that static1[4] and static2[4] hold the same value, but doesn't care what that value is.

What does reinterpret_cast do binary-wise?

I'm writing a logger in C++, and I've come to the part where I'd like to take a log record and write in to a file.
I have created a LogRecord struct, and would like to serialize it and write it to a file in binary mode.
I have read some posts about serialization in C++, and one of the answers included this following snippet:
reinterpret_cast<char*>(&logRec)
I've tried reading about reinterpret_cast and what it does, but I couldn't fully understand what's really happening in the background.
From what I understand, it takes a pointer to my struct, and turns it into a pointer to a char, so it thinks that the chunk of memory that holds my struct is actually a string, is that true? How can that work?
A memory address is just a memory address. Memory isn't inherently special - it's just a huge array of bytes, for all we care. What gives memory its meaning is what we do with it, and the lenses through which we view it.
A pointer to a struct is just an integer that specifies some offset into memory - surely you can treat one integer in any way you want, in your case, as a pointer to some arbitrary number of bytes (chars).
reinterpret_cast() doesn't do anything special except allow you to convert one view of a memory address into another view of a memory address. It's still up to you to treat that memory address correctly.
For instance, char* is the conventional way to refer to a string of characters in C++ - but the type char* literally means "a pointer to a single char". How does it come to mean a pointer to a null-terminated string of characters? By convention, that's how. We treat the type differently depending on the context, but it's up to us to make sure we do so correctly.
For instance, how do you know how many bytes to read through your char* pointer to your struct? The type itself gives you zero information - it's up to you to know that you've really got a byte-oriented pointer to a struct of fixed length.
Remember, under the hood, the machine has no types. A piece of paper doesn't care if you write an essay on each line, or if you scribble all over the thing. It's how we treat it - and how the tools we use (C++) treat it.
Binary-wise, it does nothing at all. This casting is a higher-level concept that has no bearing in any actual machine instructions.
At a low level, a pointer is just a numeric value that holds a memory address. There is nothing to be done in telling the compiler "although you thought the destination memory contained a struct, now please think that it contains a char". The actual address itself doesn't change in any way.
From what I understand, it takes a pointer to my struct, and turns it into a pointer to a char, so it thinks that the chunk of memory that holds my struct is actually a string, is that true?
Yes.
How can that work?
A string is just a sequence of bytes, and your object is just a sequence of bytes, so that's how it works.
But it won't if your object is logically more than just a sequence of bytes. Any indirection, and you're hosed. Furthermore, any implementation-defined padding or representation/endianness and your data is non-portable. This might be acceptable; it really depends on your requirements.
Casting a struct into an array of bytes (chars) is a classic low impact method of binary serialization. This is based on the assumption that the content of the struct exists contiguously in memory. The casting allows us write this data to a file or socket using the normal APIs.
This only works though if the data is contiguous. This is true for C style structs or PODs in C++ terminology. It will not work with complex C++ objects or any struct with pointers to storage outside the struct. For text data you will need to use fixed size character arrays.
struct {
int num;
char name[50];
};
will serialize correctly.
struct {
int num;
char* name;
};
will not serialize correctly since the data for the string is stored outside the struct;
If you are sending data across a nework you will also need to ensure that the struct is packed or at least of known alignment and that integers are converted to a consistent endianness (network byte order is normally big endian)

Why uninitialized char array is filled with random symbols?

I have a this fragment of code in C++:
char x[50];
cout << x << endl;
which outputs some random symbols as seen here:
So my first question: what is the reason behind this output? Shouldn't it be spaces or at least same symbols?
The reason I am concerned with this is that I am writing program in CUDA and I'm doing some character manipulations inside __global__ function, hence the use of string gives a "calling host function is not allowed" error.
But if I am using "big enough" char array (each chunk of text I am operating with differs in size, meaning that it will not always utilize char array fully) it's sometimes not fully filled and I left with junk like in the picture below hanging at the end of text:
So my second question: is there any way to avoid this?
what is the reason behind this output?
The values in an automatic variable are indeterminate. The standard doesn't specify it, so it might be spaces as you said, it might be random content.
[...] sometimes not fully filled and I left with junk [...]
Strings in C are null-terminated, so any routine dedicated to printing a string will loop as long as no null byte is encountered. In uninitialized memory, this null byte occurs randomly (or not at all). These weird, trailing characters are a result of that.
is there any way to avoid this?
Yes. Initialize it.
(will assume x86 in this post)
what is the reason behind this output?
Here's roughly what happens, in assembly, when you do char x[50];:
ADD ESP, 0x34 ; 52 bytes
Essentially, the stack is moved up by 0x34 bytes (must be divisible by 4). Then, that space on the stack becomes x. There's no cleaning, no changes or pushes or pops, just this space becoming x. Anything that was there before (abandoned params, return addresses, variables from previous function calls) will be in x.
Here's roughly what happens when you do new char[50]:
1. Control gets passed to the allocator
2. The allocator looks for any heap of sufficient size (readas: an already allocated but uncommited heap)
3. If 2 fails, the allocator makes a new heap
4. The allocator takes the heap (either the found or allocated one) and commits it
5. The address of that heap is returned to your code where it is used as a char*
The same as with a stack, you get whatever data is there. Some programs or systems may have allocators that zero out heaps when they are allocated or committed, where others may only zero when allocated but not committed, and some may not zero at all. Depending on the allocator, you may get clean memory or you may get re-used and dirty memory. This is why the values here can be non-zero and aren't predictable.
is there any way to avoid this?
In the case of heap memory, you can overload the new and delete operators in C++ and always zero newly allocated memory. You can see examples of overloading these operators here. As for memory on the stack, you just have to live with zeroing it out every time.
ZeroMemory(myArray, sizeof(myarray));
Alternatively, for both methods, you could stay away from naked arrays and use std::vector or other wrappers that take care of initialization for you. You'll still want to make sure to initialize integers and other numeric or pointer data-types, though.
No, there is no way to avoid it. C++ does not initialize automatic variables of built-in types (such as arrays of built-in types in your case) automatically, you need to initialize them yourself.
Why are you having issues with this code?
char x[50];
cout << new char[50] << endl;
cout << x << endl;
You're leaking memory with the 'new char[50] without a corresponding delete.
Also, uninitialized memory is undefined as others have said and in most cases you get garbage within that memory block. A better method is to initialize it:
char x[50] = {};
char* y = new char[50]();
Then just remember to call delete on y later to free the memory. Yes, the OS will do it for you, but this is never a way to write good programs though.

Does the compiler copy a std::string into stack while passing it to a function in C++?

I have a simple question. I have a long std::string that I want to pass it to a function.
I wanna know that this string will be copy to stack then a copy of that will be passed or something like pointer will be passed and no additional space will be required?
(C++)
I have another little question: How much memory does an element of a string take?Just like char?
Yes, it will be deep copied, so use const reference is recommended.
void fun(const std::string & arg)
Typically std::string has 2 fields, a pointer pointing to dynamic allocated memory and the length, so it is 16+actual length on 64bit machines.
Spoiler Alert: My answer wont be that relevant, just an optimization technique.
If you dont want to duplicate the string, write your customized string class, which has two pointers or one pointer with size. In the past it has reduced me a lot of duplicates. This will work only as read-only and do a copy_on_write, i.e duplicate only if you encounter a write.
When passing an argument by value in C++ it is conceptually copied. Whether this copy really happens is another question, though, and depends on how the argument is passed and, to some extend, on the compiler: the compiler is explicitly allowed to elide certain copies, in particular copies of temporary objects. For example, when you return an object from a function and it us clear that the object will be returned, the copy is likely to be elided. Similarily, when passing the result of a function directly on to another function, it is likely not to be copied.
Beyond this C++ 2011 added another dimension of possibilities by supporting move constructors. These cover to some extend similar ground but also allow you to have better control: you can explicitly indicate that it would be acceptable for an object to be moved rather than being copied. Still, in no event will an object passed by reference.
With respect to the used bytes per element, the std::string uses just sizeof(cT) bytes (where cT is the character template argument of the std::basic_string). However, the string will overallocate the space in many cases and certainly when characters are added to the string. You can determine the overallocation by comparing size() and capacity() and control it to some extend with reserve() although this function isn't required of getting rid of any overallocation but the capacity() has to be at least as much as was last reserve()d. If the string is small (e.g. at most 15 characters) modern implementations won't make any allocation. This is called the string optimization.
With respect to the actual represention of the string: unless it is small it will use one word for the address of the storage, one word each for the the size and the capacity, and for strings with stateful allocators the size of the allocator (typically another word). Given alignment requirements this effectively means that in most cases the string will take four words in addition to the elements. Typically the small string optimization uses these words to store characters if the string firs there unless, of course, it needs to store a stateful allocator.

Can C++ automatic variables vary in size?

In the following C++ program:
#include <string>
using namespace std;
int main()
{
string s = "small";
s = "bigger";
}
is it more correct to say that the variable s has a fixed size or that the variable s varies in size?
It depends on what you mean by "size".
The static size of s (as returned by sizeof(s)) will be the same.
However, the size occupied on the heap will vary between the two cases.
What do you want to do with the information?
i'll say yes and no.
s will be the same string instance but it's internal buffer (which is preallocated depending on your STL implementation) will contain a copy of the constant string you wanted to affect to it.
Should the constant string (or any other char* or string) have a bigger size than the internal preallocated buffer of s, s buffer will be reallocated depending on string buffer reallocation algorithm implemented in your STL implmentation.
This is going to lead to a dangerous discussion because the concept of "size" is not well defined in your question.
The size of a class s is known at compile time, it's simply the sum of the sizes of it's members + whatever extra information needs to be kept for classes (I'll admit I don't know all the details) The important thing to get out of this, however is the sizeof(s) will NOT change between assignments.
HOWEVER, the memory footprint of s can change during runtime through the use of heap allocations. So as you assign the bigger string to s, it's memory footprint will increase because it will probably need more space allocated on the heap. You should probably try and specify what you want.
The std::string variable never changes its size. It just refers to a different piece of memory with a different size and different data.
Neither, exactly. The variable s is referring to a string object.
#include <string>
using namespace std;
int main()
{
string s = "small"; //s is assigned a reference to a new string object containing "small"
s = "bigger"; //s is modified using an overloaded operator
}
Edit, corrected some details and clarified point
See: http://www.cplusplus.com/reference/string/string/ and in particular http://www.cplusplus.com/reference/string/string/operator=/
The assignment results in the original content being dropped and the content of the right side of the operation being copied into the object. similar to doing s.assign("bigger"), but assign has a broader range of acceptable parameters.
To get to your original question, the contents of the object s can have variable size. See http://www.cplusplus.com/reference/string/string/resize/ for more details on this.
A variable is an object we refer to by a name. The "physical" size of an object -- sizeof(s) in this case -- doesn't change, ever. They type is still std::string and the size of a std::string is always constant. However, things like strings and vectors (and other containers for that matter) have a "logical size" that tells us how many elements of some type they store. A string "logically" stores characters. I say "logically" because a string object doesn't really contain the characters directly. Usually it has only a couple of pointers as "physical members". Since the string objects manages a dynamically allocated array of characters and provides proper copy semantics and convenient access to the characters we can thing of those characters as members ("logical members"). Since growing a string is a matter of reallocating memory and updating pointers we don't even need sizeof(s) to change.
i would say this is string object , And it has capability to grow dynamically and vice-versa