std::string vs. char* - c++

does std::string store data differently than a char* on either stack or heap or is it just derived from char* into a class?

char*
Is the size of one pointer for your CPU architecture.
May be a value returned from malloc or calloc or new or new[].
If so, must be passed to free or delete or delete[] when you're done.
If so, the characters are stored on the heap.
May result from "decomposition" of a char[ N ] (constant N) array or string literal.
Generically, no way to tell if a char* argument points to stack, heap, or global space.
Is not a class type. It participates in expressions but has no member functions.
Nevertheless implements the RandomAccessIterator interface for use with <algorithm> and such.
std::string
Is the size of several pointers, often three.
Constructs itself when created: no need for new or delete.
Owns a copy of the string, if the string may be altered.
Can copy this string from a char*.
By default, internally uses new[] much as you would to obtain a char*.
Provides for implicit conversion which makes transparent the construction from a char* or literal.
Is a class type. Defines other operators for expressions such as catenation.
Defines c_str() which returns a char* for temporary use.
Implements std::string::iterator type with begin() and end().
string::iterator is flexible: an implementation may make it a range-checked super-safe debugging helper or simply a super-efficient char* at the flip of a switch.

If you mean, does it store contiguously, then the answer is that it's not required but all known (to me, anyway) implementations do so. This is most likely to support the c_str() and data() member requirements, which is to return a contiguous string (null-terminated in the case of c_str())
As far as where the memory is stored, it's usually on the heap. But some implementations employ the "Short String Optimization", whereby short string contents are stored within a small internal buffer. So, in the case that the string object is on the stack, it's possible that the stored contents are also on the stack. But this should make no difference to how you use it, since one the object is destroyed, the memory storing the string data is invalidated in either case.
(btw, here's an article on a similar technique applied generally, which explains the optimization.)

These solve different problems. char* (or char const*) points to a C style string which isn't necessarily owned by the one storing the char* pointer. In C, because of the lack of a string type, necessarily you often use char* as "the string type".
std::string owns the string data it points to. So if you need to store a string somewhere in your class, chances are good you want to use std::string or your librarie's string class instead of char*.
On contiguity of the storage of std::string, other people already answered.

Related

Why does string::c_str() return a const char* when strings are allocated dynamically?

Why does it return a constant char pointer? The C++11 standard says:
The pointer returned points to the internal array currently used by the string object to store the characters that conform its value.
How can something that is dynamically allocated be constant(const char*)?
In C and C++, const translates more or less to "read only".
So, when something returns a char const *, that doesn't necessarily mean the data it's pointing at is actually const--it just means that the pointer you're receiving only supports reading, not writing, the data it points at.
The string object itself may be able to modify that data--but (at least via the pointer you're receiving) you're not allowed to modify the data directly.
The pointer returned by c_str is declared to point to a const char to prevent modifying the internal string buffer via that pointer.
The string buffer is indeed dynamically allocated, and the pointer returned by c_str is only valid while the string itself does not change. Quoting from cppreference.com:
The pointer obtained from c_str() may be invalidated by:
Passing a non-const reference to the string to any standard library function, or
Calling non-const member functions on the string, excluding operator[], at(), front(), back(), begin(), rbegin(), end() and rend().
I'm going to interpret the question as why can't you cast it back to a char * and write to it and expect it to work.
The standard library reserves the option to itself to lazy-copy strings in the copy constructor; thus if you wrote to it via the result of c_str() you would potentially write to other strings. As most uses of c_str() would not need to write to the string, dedupling on the call to c_str() would impose too large a penalty.

How to make std::string compatible with char*?

I usually find some functions with char* as its parameter.But I heard that std::string is more recommended in C++. How can I use a std::string object with functions taking char*s as parameters? Till now I have known the c_str(), but it doesn't work when the content of string should be modified.
For that purpose, you use std::string::data(). Returns a pointer to the internal data. Be careful not to free this memory or anything like that as this memory is managed by the string object.
You can use the address of the first element after C++11 like this:
void some_c_function(char* s, int n);
// ...
std::string s = "some text";
some_c_function(&s[0], s.size());
Before C++11 there was no guarantee that the internal string was stored in a contiguous buffer or that it would be null terminated. In those cases making a copy of the string was the only safe option.
After C++17 (the current standard) you can use this:
some_c_function(s.data(), s.size());
In C++17 a non-const return value of std::string::data() was added in addition to the const version.
Since C++17 std::string::data() returns a pointer to the underlying char array that does allow to modify the contents of the string. However, as usual with strings, you may not write beyond its end. From cppreference:
Modifying the past-the-end null terminator stored at data()+size() to any value other than CharT() has undefined behavior.

std::string& vs boost::string_ref

Does it matter anymore if I use boost::string_ref over std::string& ? I mean, is it really more efficient to use boost::string_ref over the std version when you are processing strings ? I don't really get the explanation offered here: http://www.boost.org/doc/libs/1_61_0/libs/utility/doc/html/string_ref.html . What really confuses me is the fact that std::string is also a handle class that only points to the allocated memory, and since c++11, with move semantics the copy operations noted in the article above are not going to happen. So, which one is more efficient ?
The use case for string_ref (or string_view in recent Boost and C++17) is for substring references.
The case where
the source string happens to be std::string
and the full length of a source string is referenced
is a (a-typical) special case, where it does indeed resemble std::string const&.
Note also that operations on string_ref (like sref.substring(...)) automatically return more string_ref objects, instead of allocating a new std::string.
I have never used it be it seems to me that its purpose is to provide an interface similar to std::string but without having to allocate a string for manipulation. Take the example given extract_part(): it is given a hard-coded C array "ABCDEFG", but because the initial function takes a std::string an allocation takes place (std::string will have its own version of "ABCDEFG"). Using string_ref, no allocation occurs, it uses the reference to the initial "ABCDEFG". The constraint is that the string is read-only.
This answer uses the new name string_view to mean the same as string_ref.
What really confuses me is the fact that std::string is also a handle class that only points to the allocated memory
A string allocates, owns, and manages its own memory. A string_view is a handle to some memory that was already allocated. The memory is managed by some other mechanism, unrelated to the string_view.
If you already have some text data, for example in a char array, then the additional memory allocation involved in constructing a string might be redundant. A string_view could be more efficient because it would allow you to operate directly on the original data in the char array. However, it would not permit the data to be modified; string_view allows no non-const access, because it doesn't own the data it refers to.
and since c++11, with move semantics the copy operations noted in the article above are not going to happen.
You can only move from an object that is ready to be discarded. Copying still serves a purpose and is necessary in many cases.
The example in the article constructs two new strings (not copies) and also constructs two copies of existing strings. In C++98 the copies could already be elided by RVO without move semantics, so they're not a big deal. By using string_view it avoids constructing the two new strings. Move semantics are irrelevant here.
In the call to extract_part("ABCDEFG") a string_view is constructed which refers to the char array represented by the string literal. Constructing a string here would have involved a memory allocation and a copy of the char array.
In the call to bar.substr(2,3) a string_view is constructed which refers to parts of the data already referred to by the first string_view. Using a string here would have involved another memory allocation and copy of part of the data.
So, which one is more efficient?
This is a bit like asking if a hammer is more efficient than a screwdriver. They serve different purposes, so it depends what it is you're trying to accomplish.
You need to be careful when using string_view that the memory it refers to remains valid throughout its lifetime.
If you stick to std::string it does not matter, but boost::string_ref also supports const char*. That is, do you intend to call your string processing function foo with std::string only?
void foo(const std::string&);
foo("won't work"); // no support for `const char*`
Since boost::string_ref is constructable from const char*, it is more flexible since it works with both const char* and std::string.
The proposal N3442 might be helpful.
In short: The main benefit of std::string_view over const std::string& is that you can pass both const char* and std::string objects without doing a copy. As others have said, it also allows you to pass substrings without copying, although (in my experience) this is somewhat less often important.
Consider the following (silly) function (yes I know you could just call s.at(2)):
char getThird(std::string s)
{
if (s.size() < 3) throw std::runtime_error("String too short");
return s[2];
}
This function works, but the string is passed by value. This means the whole length of the string is copied even though we don't look at all of it, and it also (often) incurs a dynamic memory allocation. Doing this in a tight loop can be very expensive. One solution to this is to pass the string by const reference instead:
char getThird(const std::string& s);
This works a lot better if you have a std::string variable and you pass it as a parameter to getThird. But now there's a problem: what if you have a null-terminated const char* string? When you call this function, a temporary std::string will get constructed, so you still get still get the copy and dynamic memory allocation.
Here's another attempt:
char getThird(const char* s)
{
if (std::strlen(s) < 3) throw std::runtime_error("String too short");
return s[2];
}
This will obviously now work fine for const char* variables. It will also work for std::string variables, but calling it is a little awkward: getThird(myStr.c_str()). What's more, std::string supports embedded null characters, and getThird will misinterpret the string as ended at the first of these. At worst this could cause a security vulnerability - imagine if the function were called checkStringForBadHacks!
Another problem is simply that it's annoying to write a function in terms of old null-terminated strings instead of std::string objects with their handy methods. Did you notice, for example, that this function looks at the whole length of the string even though only the first few characters are important? It's hidden in std::strlen, which iterates over all characters looking for the null terminator. We could replace that with a manual check that the first three characters aren't null, but you can see this is a lot less convenient than the other versions.
Step in std::string_view (or boost::string_view, previously known as boost::string_ref):
char getThird(std::string_view s)
{
if (s.size() < 3) throw std::runtime_error("String too short");
return s[2];
}
This gives you the nice methods you expect from a proper string class, like .size(), and it works in both the situations discussed above, plus another:
It works with std::string objects, which can be implicitly be converted to std::string_view objects.
It works with const char* null-terminated strings, which can also be implicitly be converted to std::string_view objects.
This does have the potential disadvantage that constructing the std::string_view requires iterating over the whole string to find the length, even if the function that uses it never needs it (as is the case here). However, if a caller is using a const char* as a parameter to several functions (or one function in a loop) that take std::string_view objects it could always manually construct that object beforehand. This could even give a performance increase, because if that function(s) do need the length then it is precomputed once and reused.
As other answers have mentioned, it also avoids a copy when you only want to pass a substring. For example, this is very useful in parsing. But std::string_view is justified even without this feature.
It's worth noting that there is a case where the original function signature, taking a std::string by value, may actually be better than a std::string_view. That's where you were going to make a copy of the string anyway, for example to store in some other variable or to return from the function. Imagine this function:
std::string changeThird(std::string s, char c)
{
if (s.size() < 3) throw std::runtime_error("String too short");
s[2] = c;
return s;
}
// vs.
std::string changeThird(std::string_view s, char c)
{
if (s.size() < 3) throw std::runtime_error("String too short");
std::string result = s;
result[2] = c;
return result;
}
Note that both of these involve exactly one copy: In the first case this is done implicitly when the parameter s is constructed from whatever is passed in (including if it is another std::string). In the second case we do it explicitly when we create result. But the return statement does not do a copy, because uses move semantics (as if we had done std::move(result)), or more likely uses the return value optimisation.
The reason the first version can be better is that it is actually possible for it to perform zero copies, if the caller moves the argument:
std::string something = getMyString();
std::string other = changeThird(std::move(something), "x");
In this case, the first changeThird does not involve any copy at all, whereas the second one does.

returning const char* to char* and then changing the data

I am confused about the following code:
string _str = "SDFDFSD";
char* pStr = (char*)_str.data();
for (int i = 0; i < iSize; i++)
pStr[i] = ::tolower(pStr[i]);
here _str.data() returns const char*. But we are assigning it to a char*. My questions is,
_str.data()is returning pointer to a constant data. How is it possible to store it in a pointer to data? The data was constant right? If we assign it to char pointer than we can change it like we are doing inside the for statement which should not be possible for a constant data.
Don't do that. It may be fine in this case, but as the documentation for data() says:
The pointer returned may be invalidated by further calls to other
member functions that modify the object.
A program shall not alter any of the characters in this sequence.
So you could very accidentally write to invalid memory if you keep that pointer around. Or, in fact, ruin the implementation of std::string. I would almost go as far as to say that this function shouldn't be exposed.
std::string offers a non-const operator[] for that purpose.
string _str = "SDFDFSD";
for (int i = 0; i < iSize; i++)
_str[i] = ::tolower(_str[i]);
What you are doing is not valid at the standard library level (you're violating std::string contract) but valid at the C++ core language level.
The char * returned from data should not be written to because for example it could be in theory(*) shared between different strings with the same value.
If you want to modify a string just use std::string::operator[] that will inform the object of the intention and will take care of creating a private buffer for the specific instance in case the string was originally shared instead.
Technically you are allowed to cast-away const-ness from a pointer or a reference, but if it's a valid operation or not depends on the semantic of the specific case. The reason for which the operation is allowed is that the main philosophy of C++ is that programmers make no mistakes and know what they are doing. For example is technically legal from a C++ language point of view to do memcpy(&x, "hello", 5) where x is a class instance, but the results are most probably "undefined behavior".
If you think that your code "works" it's because you've the wrong understanding of what "works" really should mean (hint: "works" doesn't mean that someone once observed the code doing what seemed reasonable, but that will work in all cases). A valid C++ implementation is free to do anything it wants if you run that program: that you observed something you think is fine doesn't really mean anything, may be you didn't look close enough, or may be you were just lucky (unfortunate, actually) that no crash happened right away.
(*) In modern times the COW (copy-on-write) implementations of std::string are low in popularity because they pose a lot of problems (e.g. with multithreading) and memory is a lot cheaper now. Still std::string contract says you're not allowed to change the memory pointed by the return value of data(); if you do anything may happen.
You never must change the data returned from std::string::data() or std::string::c_str() directly.
To create a copy of a std::string:
std::string str1 = "test";
std::string str2 = str1; // copy.
Change characters in a string:
std::string str1 = "test"
str1[0] = 'T';
The "correct" way would be to use std::transform instead:
std::transform(_str.begin(), _str.end(), _str.begin(), ::tolower);
The simple answer to your question is that in C++ you can cast away the 'const' of a variable.
You probably shouldn't though.
See this for const correctness in C++
String always allocates memory on heap, so this is not actually const data, it is just marked so (in method data() signature) to prevent modification.
But nothing is impossible in C++, so with a simple cast, though unsafe, you can now treat the same memory space as modifiable.
All constants in a C/C++ program (like "SDFDFSD" below)will be stored in a separate section .rodata. This section is mapped as read-only when the binary is loaded into memory during execution.
int main()
{
char* ptr = "SDFDFSD";
ptr[0]='x'; //segmentation fault!!
return 0;
}
Hence any attempt to modify the data at that location will result in a run-time error i.e. a segmentation fault.
Coming to the above question, when creating a string and assigning a string to it, a new copy in memory now exists (memory used to hold the properties of the string object _str). This is on the heap and NOT mapped to a read-only section. The member function _str.data() points to the location in memory which is mapped read/write.
The const qualifier to the return type ensure that this function is NOT accidentally passed to string manipulation functions which expect a non-const char* pointer.
In your current iteration there was no limitation on the memory location itself that was holding the string object's data; i.e. it was mapped with both read/write permissions. Hence modifying the location using another non-const pointer worked i.e. pStr[i] on the left hand side of an assignment did NOT result in a run-time error as there were no inherent restrictions on the memory location itself.
Again this is NOT guaranteed to work and just a implementation specific behaviour that you have observed (i.e. it simply happens to work for you) and cannot always depend on this.

What is happening? Three different types for the same stuff in four lines of code?

I borrowed this from another question about fragmentation, but I'm not bothered by that. I'm more worried that I don't understand the function at all. in terms of types and data lifetimes :(
The same data is represented by a std::vector, (a dynamic aray type with internal metadata), a pointer to the string data therein, (return parameter), and the declared return type, which is a std:string.
QUESTIONS:
How does the data get safely out of the function when the std::vector is going to be destroyed? Is it implicitly copied? Is the dynamic array of char in the vector 'detached' from the vector and returned as a std::string type so that no bulk copy is required? Sometimes, I think that C++ and the std library is trying to get me...
I've been using C++ for some while but stuff like this does my 'ed in.
std::string TestFragmentation()
{
std::vector<char> buffer(500);
SomeCApiFunction( &buffer[0], buffer.size() ); // Sets buffer to null-terminated string data
return &buffer[0];
}
The data stored by a std::vector is guaranteed to be contiguous, so &buffer[0] gets you a raw pointer to the beginning of that data.1
And std::string has a constructor which takes a const char *, which copies the data. That is being implicitly called in the return statement (the compiler is allowed to call at most one implicit conversion operation to avoid compile-time errors).
In both cases (the vector and the string), the memory for the corresponding backing buffer is managed by the container class, so there is no possibility of a memory leak or similar (so long as your raw C function creates a valid null-terminated C-style string and doesn't trample beyond the buffer bounds).
1. Note, however, that there are no guarantees that it will stay in one place. As soon as you grow or shrink the vector, it's likely that it will be copied elsewhere in memory, invalidating all raw pointers that were pointing at it. And if the vector itself is destructed, then of course the backing data is no longer valid.