Why is `std::string_view` not implemented differently? - c++

Given the following code we can see that std::string_view is invalidated when string grows beyond capacity (here SSO is in effect initially then contents are put on the heap)
#include <iostream>
#include <cassert>
#include <string>
using std::cout;
using std::endl;
int main() {
std::string s = "hi";
std::string_view v = s;
cout << v << endl;
s = "this is a long long long string now";
cout << v << endl;
}
output:
hi
#
so if I store a string_view to a string then change the contents of the string I can be in big trouble.
Would it be possible, given the existing std::string implementations to make a smarter string_view? which would not face such a drawback? We could store a pointer to the string object itself and then determine if the string is in SSO more or not and work accordingly.(Not sure how this would work with literal strings though, so maybe that is why it was not done this way?)
I am aware that string_view is akin to storing the return value of string::c_str() but given we have this wrapper around std::string I do not think this gotcha would occur to a lot of people using this feature. Most disclaimers are to make sure the pointed to std::string is within scope but this is a different issue altogether.

string_view knows nothing about string. It is not a "wrapper" around a string. It has no idea that std::string even exists as a type; the conversion from string to string_view happens within std::string. string_view has no association with or reliance on std::string.
In fact, that is the entire purpose of string_view: to be able to have a non-modifiable sized string without knowing how it is allocated or managed. That it can reference any string type that stores its characters contiguously is the point of the thing. It allows you to create an interface that takes a string_view without knowing or caring whether the caller is using std::string, CString, or any other string type.
Since the owning string's behavior is not string_view's business, there is no possible mechanism for string_view to be told when the string it references is no longer valid.
We could store a pointer to the string object itself and then determine if the string is in SSO more or not and work accordingly.
For the sake of argument, let us ignore that string_view is not supposed to know or care whether its characters come from std::string. Let's assume that string_view only works with std::string (even though that makes the type completely worthless).
Even then, this would not work. Or rather, it would only work if the type was functionally no different from a std::string const&.
If string_view stores a pointer to the first character and a size, then any modification to the std::string might change this. It could change the size even without breaking small-string optimization. It could change the size without causing reallocation. The only way to correct this is to have the string_view always ask the std::string it references what its character data and size are.
And that's no different from just using a std::string const& directly.

Related

Is implicit construction of `const std::string` from `const char *` efficient?

Like many people I'm in the habit of writing new string functions as functions of const std::string &. The advantages are efficiency (you can pass existing std::string objects without incurring overhead for copying/moving) and flexibility/readability (if all you have is a const char * you can just pass it and have the construction done implicitly, without cluttering up your code with an explicit std::string construction):
#include <string>
#include <iostream>
unsigned int LengthOfStringlikeObject(const std::string & s)
{
return s.length();
}
int main(int argc, const char * argv[])
{
unsigned int n = LengthOfStringlikeObject(argv[0]);
std::cout << "'" << argv[0] << "' has " << n << " characters\n";
}
My aim is to write efficient cross-platform code that can handle long strings efficiently. My question is, what happens during the implicit construction? Are there any guarantees that the string will not be copied? It strikes me that, because everything is const, copying is not necessary—a thin STL wrapper around the existing pointer is all that's needed—but I'm not sure how compiler- and platform-dependent I should expect that behavior to be. Would it be safer to always explicitly write two versions of the function, one for const std::string & and one for const char *?
If you pass a const char* to something that takes a std::string, reference or not, a string will be constructed. A compiler might even complain if you send it to a reference with a warning that there is an implicit temporary object.
Now this may be optimized by the compiler and also some implementations will not allocate memory for small strings. The compiler might also internally optimize it to use a C++17 string_view. It essentially depends on what you will do to the string in your code. If you only use constant member functions, a clever compiler might optimize out.
But that is up to the implementation and outside your control. You can use explicitly std::string_view if you want to take over.
If you don't want copying, then string_view is what you want.
However, with this benefit comes problems. Specifically, you have to ensure that the storage that you pass lasts "long enough".
For string literals, that's no problem. For argv[0], that's almost certainly not a problem. For arbitrary sequences of characters, then you'll need to think about them.
but you can write:
unsigned int LengthOfStringlikeObject(std::string_view sv)
{
return sv.length();
}
and call it with a string, or a const char *, and it will be fine.
It strikes me that, because everything is const, copying is not
necessary—a thin STL wrapper around the existing pointer is all that's
needed
I don't think this assumption is correct. Just because you have a pointer to const, it does not imply that the underlying value cannot change. It only implies that the value cannot be changed through that pointer. The pointer could be pointing to non-const storage which can change at any time.
Because of this, the library must make its own copy (to provide the "correct" string observable behavior). A quick review of libstdc++ shows that it always makes a copy. The construction from char* is not inline, so it cannot be optimized away without static linking and LTO.
While extremely trivial statically linked programs might have the copy optimized away with LTO (I wasn't able to reproduce this), I think in general it would be unlikely this optimization could be performed (especially considering the aliasing rules for char*). g++ doesn't even perform this optimization for a string literal.

"Use & to create a pointer to a member" when using c_str

I'm trying to add a char* to a vector of char*'s by casting it over from a string. Here's the code I'm using:
vector<char*> actionLog;
// lots of code
int value = ...
// lots of code
string str = "string";
cout << str << value << endl;
str += std::to_string(player->scrap);
actionLog.push_back(str.c_str());
The problem is that I get the specified "Use & to create a pointer to a member" error for the push_back line. str.c_str should return a char*, which is the type that actionLog uses. I'm either incorrect about how c_str works, or doing something else wrong. Pushing to actionLog with
actionLog.push_back("something");
Works fine, but not what I mentioned. What am I doing wrong?
EDIT: I was actually using c_str() as a function, I just copied it incorrectly
There are actually several problems with what you are trying to do.
Firstly, c_str is a member function. You would have to call it, using (): str.c_str().
c_str() returns a const char*, so you won't be able to store it in a vector<char*>. This is so you can't break the std::string by changing it's internals in ways it doesn't expect.
You really should not store the result of c_str(). It only remains valid until you do some non-const operation on the std::string it came from. I.e. if you make a change to the content of the std::string, then try to use the corresponding element in the vector, you have Undefined Behaviour! And from the way you have laid out your example, it looks like the lifetime of the string will be much shorter than the lifetime of the vector, so the vector would point to something that doesn't even exist any more.
Maybe it's better to just use a std::vector<std::string>. If you don't need the original string again after this, you could even std::move it into the vector, and avoid extra copying.
As an aside, please reconsider your use of what are often considered bad practices: using namespace std; and endl (those are links to explanations). The latter is a bit contentious, but at least understand why and make an informed decision.
std::basic_string::c_str() is a member function, not a data member - you need to invoke it by using ().
The correct code is:
actionLog.push_back(str.c_str());
Note that std::basic_string::c_str() returns a pointer to const char - your actionLog vectors should be of type std::vector<const char*>.
Vittorio's answer tells you what you did wrong in the details. But I would argue that what you're doing wrong is really using a vector<char*> instead of a vector<string> in the first place.
With the vector of pointers, you have to worry about lifetime of the underlying strings, about them getting invalidated, changed from under you, and so on. The name actionLog suggests that the thing is long-lived, and the code you use to add to it suggests that str is a local helper variable used to build the log string, and nothing else. The moment str goes out of scope, the vector contains a dangling pointer.
Change the vector to a vector<string>, do a actionLog.push_back(str), and don't worry about lifetimes or invalidation.
You forgot (), c_str is a method, not a data member. Just write actionLog.push_back(str.c_str());

std::string& vs boost::string_ref

Does it matter anymore if I use boost::string_ref over std::string& ? I mean, is it really more efficient to use boost::string_ref over the std version when you are processing strings ? I don't really get the explanation offered here: http://www.boost.org/doc/libs/1_61_0/libs/utility/doc/html/string_ref.html . What really confuses me is the fact that std::string is also a handle class that only points to the allocated memory, and since c++11, with move semantics the copy operations noted in the article above are not going to happen. So, which one is more efficient ?
The use case for string_ref (or string_view in recent Boost and C++17) is for substring references.
The case where
the source string happens to be std::string
and the full length of a source string is referenced
is a (a-typical) special case, where it does indeed resemble std::string const&.
Note also that operations on string_ref (like sref.substring(...)) automatically return more string_ref objects, instead of allocating a new std::string.
I have never used it be it seems to me that its purpose is to provide an interface similar to std::string but without having to allocate a string for manipulation. Take the example given extract_part(): it is given a hard-coded C array "ABCDEFG", but because the initial function takes a std::string an allocation takes place (std::string will have its own version of "ABCDEFG"). Using string_ref, no allocation occurs, it uses the reference to the initial "ABCDEFG". The constraint is that the string is read-only.
This answer uses the new name string_view to mean the same as string_ref.
What really confuses me is the fact that std::string is also a handle class that only points to the allocated memory
A string allocates, owns, and manages its own memory. A string_view is a handle to some memory that was already allocated. The memory is managed by some other mechanism, unrelated to the string_view.
If you already have some text data, for example in a char array, then the additional memory allocation involved in constructing a string might be redundant. A string_view could be more efficient because it would allow you to operate directly on the original data in the char array. However, it would not permit the data to be modified; string_view allows no non-const access, because it doesn't own the data it refers to.
and since c++11, with move semantics the copy operations noted in the article above are not going to happen.
You can only move from an object that is ready to be discarded. Copying still serves a purpose and is necessary in many cases.
The example in the article constructs two new strings (not copies) and also constructs two copies of existing strings. In C++98 the copies could already be elided by RVO without move semantics, so they're not a big deal. By using string_view it avoids constructing the two new strings. Move semantics are irrelevant here.
In the call to extract_part("ABCDEFG") a string_view is constructed which refers to the char array represented by the string literal. Constructing a string here would have involved a memory allocation and a copy of the char array.
In the call to bar.substr(2,3) a string_view is constructed which refers to parts of the data already referred to by the first string_view. Using a string here would have involved another memory allocation and copy of part of the data.
So, which one is more efficient?
This is a bit like asking if a hammer is more efficient than a screwdriver. They serve different purposes, so it depends what it is you're trying to accomplish.
You need to be careful when using string_view that the memory it refers to remains valid throughout its lifetime.
If you stick to std::string it does not matter, but boost::string_ref also supports const char*. That is, do you intend to call your string processing function foo with std::string only?
void foo(const std::string&);
foo("won't work"); // no support for `const char*`
Since boost::string_ref is constructable from const char*, it is more flexible since it works with both const char* and std::string.
The proposal N3442 might be helpful.
In short: The main benefit of std::string_view over const std::string& is that you can pass both const char* and std::string objects without doing a copy. As others have said, it also allows you to pass substrings without copying, although (in my experience) this is somewhat less often important.
Consider the following (silly) function (yes I know you could just call s.at(2)):
char getThird(std::string s)
{
if (s.size() < 3) throw std::runtime_error("String too short");
return s[2];
}
This function works, but the string is passed by value. This means the whole length of the string is copied even though we don't look at all of it, and it also (often) incurs a dynamic memory allocation. Doing this in a tight loop can be very expensive. One solution to this is to pass the string by const reference instead:
char getThird(const std::string& s);
This works a lot better if you have a std::string variable and you pass it as a parameter to getThird. But now there's a problem: what if you have a null-terminated const char* string? When you call this function, a temporary std::string will get constructed, so you still get still get the copy and dynamic memory allocation.
Here's another attempt:
char getThird(const char* s)
{
if (std::strlen(s) < 3) throw std::runtime_error("String too short");
return s[2];
}
This will obviously now work fine for const char* variables. It will also work for std::string variables, but calling it is a little awkward: getThird(myStr.c_str()). What's more, std::string supports embedded null characters, and getThird will misinterpret the string as ended at the first of these. At worst this could cause a security vulnerability - imagine if the function were called checkStringForBadHacks!
Another problem is simply that it's annoying to write a function in terms of old null-terminated strings instead of std::string objects with their handy methods. Did you notice, for example, that this function looks at the whole length of the string even though only the first few characters are important? It's hidden in std::strlen, which iterates over all characters looking for the null terminator. We could replace that with a manual check that the first three characters aren't null, but you can see this is a lot less convenient than the other versions.
Step in std::string_view (or boost::string_view, previously known as boost::string_ref):
char getThird(std::string_view s)
{
if (s.size() < 3) throw std::runtime_error("String too short");
return s[2];
}
This gives you the nice methods you expect from a proper string class, like .size(), and it works in both the situations discussed above, plus another:
It works with std::string objects, which can be implicitly be converted to std::string_view objects.
It works with const char* null-terminated strings, which can also be implicitly be converted to std::string_view objects.
This does have the potential disadvantage that constructing the std::string_view requires iterating over the whole string to find the length, even if the function that uses it never needs it (as is the case here). However, if a caller is using a const char* as a parameter to several functions (or one function in a loop) that take std::string_view objects it could always manually construct that object beforehand. This could even give a performance increase, because if that function(s) do need the length then it is precomputed once and reused.
As other answers have mentioned, it also avoids a copy when you only want to pass a substring. For example, this is very useful in parsing. But std::string_view is justified even without this feature.
It's worth noting that there is a case where the original function signature, taking a std::string by value, may actually be better than a std::string_view. That's where you were going to make a copy of the string anyway, for example to store in some other variable or to return from the function. Imagine this function:
std::string changeThird(std::string s, char c)
{
if (s.size() < 3) throw std::runtime_error("String too short");
s[2] = c;
return s;
}
// vs.
std::string changeThird(std::string_view s, char c)
{
if (s.size() < 3) throw std::runtime_error("String too short");
std::string result = s;
result[2] = c;
return result;
}
Note that both of these involve exactly one copy: In the first case this is done implicitly when the parameter s is constructed from whatever is passed in (including if it is another std::string). In the second case we do it explicitly when we create result. But the return statement does not do a copy, because uses move semantics (as if we had done std::move(result)), or more likely uses the return value optimisation.
The reason the first version can be better is that it is actually possible for it to perform zero copies, if the caller moves the argument:
std::string something = getMyString();
std::string other = changeThird(std::move(something), "x");
In this case, the first changeThird does not involve any copy at all, whereas the second one does.

C++ const cast, unsure if this is secure

It maybe seems to be a silly question but i really need to clarify this:
Will this bring any danger to my program?
Is the const_cast even needed?
If i change the input pointers values in place will it work safely with std::string or will it create undefined behaviour?
So far the only concern is that this could affect the string "some_text" whenever I modify the input pointer and makes it unusable.
std::string some_text = "Text with some input";
char * input = const_cast<char*>(some_text.c_str());
Thanks for giving me some hints, i would like to avoid the shoot in my own foot
As an example of evil behavior: the interaction with gcc's Copy On Write implementation.
#include <string>
#include <iostream>
int main() {
std::string const original = "Hello, World!";
std::string copy = original;
char* c = const_cast<char*>(copy.c_str());
c[0] = 'J';
std::cout << original << "\n";
}
In action at ideone.
Jello, World!
The issue ? As the name implies, gcc's implementation of std::string uses a ref-counted shared buffer under the cover. When a string is modified, the implementation will neatly check if the buffer is shared at the moment, and if it is, copy it before modifying it, ensuring that other strings sharing this buffer are not affected by the new write (thus the name, copy on write).
Now, with your evil program, you access the shared buffer via a const-method (promising not to modify anything), but you do modify it!
Note that with MSVC's implementation, which does not use Copy On Write, the behavior would be different ("Hello, World!" would be correctly printed).
This is exactly the essence of Undefined Behavior.
To modify an inherently const object by casting away its constness using const_cast is an Undefined Behavior.
string::c_str() returns a const char *, i.e: a pointer to a constant c-style string. Technically, modifying this will result in Undefined Behavior.
Note, that the use of const_cast is when you have a const pointer to a non const data and you wish to modify the non-constant data.
Simply casting will not bring forth an undefined behavior. Modifying the data pointed at, however, will. (Also see ISO 14882:98 5.2.7-7).
If you want a pointer to modifiable data, you can have a
std::vector<char> wtf(str.begin(), str.end());
char* lol= &wtf[0];
The std::string manages it's own memory internally, which is why it returns a pointer to that memory directly as it does with the c_str() function. It makes sure it's constant so that your compiler will warn you if you try to do modifiy it.
Using const_cast in that way literally casts away such safety and is only an arguably acceptable practice if you are absolutely sure that memory will not be modified.
If you can't guarantee this then you must copy the string and use the copy.; it's certainly a lot safer to do this in any event (you can use strcpy).
See the C++ reference website:
const char* c_str ( ) const;
"Generates a null-terminated sequence of characters (c-string) with the same content as the string object and returns it as a pointer to an array of characters.
A terminating null character is automatically appended.
The returned array points to an internal location with the required storage space for this sequence of characters plus its terminating null-character, but the values in this array should not be modified in the program and are only guaranteed to remain unchanged until the next call to a non-constant member function of the string object."
Yes, it will bring danger, because
input points to whatever c_str happens to be right now, but if some_text ever changes or goes away, you'll be left with a pointer that points to garbage. The value of c_str is guaranteed to be valid only as long as the string doesn't change. And even, formally, only if you don't call c_str() on other strings too.
Why do you need to cast away the const? You're not planning on writing to *input, are you? That is a no-no!
This is a very bad thing to do. Check out what std::string::c_str() does and agree with me.
Second, consider why you want a non-const access to the internals of the std::string. Apparently you want to modify the contents, because otherwise you would use a const char pointer. Also you are concerned that you don't want to change the original string. Why not write
std::string input( some_text );
Then you have a std::string that you can mess with without affecting the original, and you have std::string functionality instead of having to work with a raw C++ pointer...
Another spin on this is that it makes code extremely difficult to maintain. Case in point: a few years ago I had to refactor some code containing long functions. The author had written the function signatures to accept const parameters but then was const_casting them within the function to remove the constness. This broke the implied guarantee given by the function and made it very difficult to know whether the parameter has changed or not within the rest of the body of the code.
In short, if you have control over the string and you think you'll need to change it, make it non-const in the first place. If you don't then you'll have to take a copy and work with that.
it is UB.
For example, you can do something like this this:
size_t const size = (sizeof(int) == 4 ? 1024 : 2048);
int arr[size];
without any cast and the comiler will not report an error. But this code is illegal.
The morale is that you need consider action each time.

C++ - char* vs. string*

If I have a pointer that points to a string variable array of chars, is there a difference between typing:
char *name = "name";
And,
string name = "name";
Yes, there’s a difference. Mainly because you can modify your string but you cannot modify your first version – but the C++ compiler won’t even warn you that this is forbidden if you try.
So always use the second version.
If you need to use a char pointer for whatever reason, make it const:
char const* str = "name";
Now, if you try to modify the contents of str, the compiler will forbid this (correctly). You should also push the warning level of your compiler up a notch: then it will warn that your first code (i.e. char* str = "name") is legal but deprecated.
For starters, you probably want to change
string *name = "name";
to read
string name = "name";
The first version won't compile, because a string* and a char* are fundamentally different types.
The difference between a string and a char* is that the char* is just a pointer to the sequence. This approach of manipulating strings is based on the C programming language and is the native way in which strings are encoded in C++. C strings are a bit tricky to work with - you need to be sure to allocate space for them properly, to avoid walking off the end of the buffer they occupy, to put them in mutable memory to avoid segmentation faults, etc. The main functions for manipulating them are in <cstring>. Most C++ programmers advise against the use of C-style strings, as they are inherently harder to work with, but they are still supported both for backwards compatibility and as a "lowest common denominator" to which low-level APIs can build off of.
A C++-style string is an object encapsulating a string. The details of its memory management are not visible to the user (though you can be guaranteed that all the memory is contiguous). It uses operator overloading to make some common operations like concatenation easier to use, and also supports several member functions designed to do high-level operations like searching, replacing, substrings, etc. They also are designed to interoperate with the STL algorithms, though C-style strings can do this as well.
In short, as a C++ programmer you are probably better off using the string type. It's safer and a bit easier to use. It's still good to know about C-style strings because you will certainly encounter them in your programming career, but it's probably best not to use them in your programs where string can also be used unless there's a compelling reason to do so.
Yes, the second one isn't valid C++! (It won't compile).
You can create a string in many ways, but one way is as follows:
string name = "name";
Note that there's no need for the *, as we don't need to declare it as a pointer.
char* name = "name" should be invalid but compiles on most systems for backward compatibility to the old days when there was no const and that it would break large amounts of legacy code if it did not compile. It usually gets a warning though.
The danger is that you get a pointer to writable data (writable according to the rules of C++) but if you actually tried writing to it you would invoke Undefined Behaviour, and the language rules should attempt to protect you from that as much as is reasonably possible.
The correct construct is
const char * name = "name";
There is nothing wrong with the above, even in C++. Using string is not always more correct.
Your second statement should really be
std::string name = "name";
string is a class (actually a typedef of basic_string<char,char_traits<char>,allocator<char>) defined in the standard library therefore in namespace std (as are basic_string, char_traits and allocator)
There are various scenarios where using string is far preferable to using arrays of char. In your immediate case, for example, you CAN modify it. So
name[0] = 'N';
(convert the first letter to upper-case) is valid with string and not with the char* (undefined behaviour) or const char * (won't compile). You would be allowed to modify the string if you had char name[] = "name";
However if want to append a character to the string, the std::string construct is the only one that will allow you to do that cleanly. With the old C API you would have to use strcat() but that would not be valid unless you had allocated enough memory to do that.
std::string manages the memory for you so you do not have to call malloc() etc. Actually allocator, the 3rd template parameter, manages the memory underneath - basic_string makes the requests for how much memory it needs but is decoupled from the actual memory allocation technique used, so you can use memory pools, etc. for efficiency even with std::string.
In addition basic_string does not actually perform many of the string operations which are done instead through char_traits. (This allows it to use specialist C-functions underneath which are well optimised).
std::string therefore is the best way to manage your strings when you are handling dynamic strings constructed and passed around at run-time (rather than just literals).
You will rarely use a string* (a pointer to a string). If you do so it would be a pointer to an object, like any other pointer. You would not be able to allocate it the way you did.
C++ string class is encapsulating of char C-like string. It is a much more convenient (http://www.cplusplus.com/reference/string/string/).
for legacy you always can "extract" char pointer from string variable to deal with it as char pointer:
char * cstr;
string str ("Please split this phrase into tokens");
cstr = new char [str.size()+1];
strcpy (cstr, str.c_str()); //here str.c_str() generate null terminated char* pointer
//str.data() is equivalent, but without null on end
Yes, char* is the pointer to an array of character, which is a string. string * is the pointer to an array of std::string (which is very rarely used).
string *name = "name";
"name" is a const char*, and it would never been converted to a std::string*. This will results compile error.
The valid declaration:
string name = "name";
or
const char* name = "name"; // char* name = "name" is valid, but deprecated
string *name = "name";
Does not compile in GCC.